<<

Finite-state normalization and processing utilities: The Nisaba Brahmic library

Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com

Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, .., This paper presents an open-source library for efficient low-level processing of ten - not all strings of characters from a single jor South Asian . The library script correspond to a valid (i.e., legible) provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin , where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible , and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write , Marathi and many other South affect the performance of downstream NLP Asian , do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and implementation details. structured around orthographic ()̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display. sequence of aksara.̣ This can occasionally result in many-to-one map- Brahmic scripts are heavily used across South pings, where several distinctly-encoded strings can Asia and have official status in , , result in identical display. For example, Latin , and beyond (Cardona and Jain, script letters with such as “é” can gener- 2007; Steever, 2019). Despite evident progress ally be encoded as either: (a) a pair of the base let- in localization standards (, ter (e.g., “e” which is +0065 from Unicode’s - 2019) and improvements in associated technolo- sic Latin block, corresponding to ASCII) and a dia- gies such as input methods (Hinkle et al., 2013) and critic (in this case U+0301 from the Combining Dia- character recognition (Pal et al., 2012), Brahmic critical Marks block); or (b) a single character that script processing still poses important challenges represents the grapheme directly (U+00E9 from the due to the inherent differences between these writ- Latin-1 Supplement ). Both encod- ing systems and those which historically have been ings yield visually identical text, hence text is of- more dominant in information technology (Sinha, ten normalized to a conventionalized normal form, 2009; Bhattacharyya et al., 2019). such as the well-known Normalization Form C In this paper, we present Nisaba, an open-source (NFC), so that visually identical words are mapped software library,2 which provides processing utili- to a conventionalized representative of their equiv- ties for ten major Brahmic scripts of : alence class for downstream processing. Critically, , Devanagari, , , Kan- NFC normalization falls far short of a complete nada, , (Odia), Sinhala, , handling of such many-to-one phenomena in Uni- 1See §3 for details on the scripts. code. 2https://github.com/google-research/nisaba/

14

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 14–23 April 19 - 23, 2021. ©2021 Association for Computational Linguistics and . In addition to string normaliza- Murikinati et al., 2020), named entity recogni- tion and well-formedness processing, the library tion (Huang et al., 2019) and part-of-speech tag- also includes utilities for the deterministic and re- ging (Cardenas et al., 2019). versible of these scripts, i.e., translit- Similar to the software mentioned above, our li- eration from each script to and from the Latin brary does provide romanization, but unlike some script (Wellisch, 1978). While the resulting roman- of the packages, such as URoman, we guarantee izations are standardized in a way that may or may reversibility from Latin back to the native script. not correspond to how native speakers tend to ro- Similar to others we do not focus on faithful in- manize the text in informal communication (see, vertible transliteration of named entities which e.g., Roark et al., 2020), such a default romaniza- typically requires model-based approaches (Se- tion can permit easy inspection of an approximate quiera et al., 2014). Unlike the IndicNLP pack- version of the linguistic strings for those who read age, our software does not provide morphologi- the but not the specific Brahmic script cal analysis, but instead offers significantly richer being examined. script normalization capabilities than other pack- As a whole, the library provides important utili- ages. These capabilities are functionally sepa- ties for processing applications of South rated into normalization to Normalization Form Asian languages using Brahmic scripts. The de- C (NFC) and visual normalization. Additionally, sign is based on the observation that, while there our library provides extensive script-specific well- are considerable superficial differences between formedness grammars. Finally, in contrast to these these scripts, they follow the same encoding model other approaches, grammars in our library are in Unicode, and maintain a very similar char- maintained separately from the code for compila- acter repertoire having evolved from the same tion and application, allowing for maintenance of source — the Brāhmī script (Salomon, 1996; Fe- existing scripts and languages plus extension to dorova, 2012). This observation lends itself to the new ones without having to modify any code. This script-agnostic design (outlined in §4) that, unlike is particularly important given that Unicode stan- other approaches reviewed in §2, is based on the dards do change over time and there remain many weighted finite state transducer (WFST) formal- languages left to cover. ism (Mohri, 2004). The details of our system are To the best of our knowledge this is the first provided in §5. publicly available general finite-state grammar ap- proach for low-level processing of multiple Brah- 2 Related Work mic scripts since the early formal syntactic work by Datta (1984) and is the first such library de- The computational processing of Brahmic scripts signed based on an observation by Sproat (2003) is not a new topic, with the first applications that the fundamental organizing principles of the dating back to the early formal syntactic work Brahmic scripts can be algebraically formalized. by Datta (1984). With an increased focus on the In particular, all the core components of our li- South Asian languages within the NLP commu- brary (inverse romanization, normalization and nity, facilitated by advances in machine learning well-formedness) are compactly and efficiently and the increased availability of relevant corpora, represented as finite state transducers. Such for- multiple script processing solutions have emerged. malization lends itself particularly well to run-time Some of these toolkits, such as statistical ma- or offline integration with any finite state process- chine translation-based Brahmi-Net (Kunchukut- ing pipeline, such as decoder components of in- tan et al., 2015), are model-based, while oth- put methods (Ouyang et al., 2017; Hellsten et al., ers, such as URoman (Hermjakob et al., 2018), 2017), text normalization for automatic speech IndicNLP (Kunchukuttan, 2020), and Akshar- recognition and text-to-speech synthesis (Zhang mukha (Rajan, 2020), employ rules. The main fo- et al., 2019), among other natural language and cus of these libraries is script conversion and ro- speech applications. manization. In this capacity they were success- fully employed in diverse downstream multilin- 3 Brahmic Scripts: An Overview gual NLP tasks such as neural machine transla- tion (Zhang et al., 2020; Amrhein and Sennrich, The scripts of interest have evolved from the an- 2020), morphological analysis (Hauer et al., 2019; cient Brāhmī system that was recorded

15 Name Id iv dv c co Visual Legacy sequence NFC normalized Bengali BENG 16 13 43 5 ऩ NA NUKTA (U+0928 U+093C) NNNA (U+0929) Devanagari 19 17 45 4 क़ QA (U+0958) NUKTA (U+0915 U+093C) Gujarati GUJR 16 15 39 5 Gurmukhi GURU 12 9 39 8 Table 2: NFC examples for Devanagari. KNDA 17 15 39 3 Malayalam MLYM 16 16 38 10 bols, like , , and , Oriya ORYA 14 13 38 5 Sinhala SINH 18 17 41 2 which modify the aksarạ as a whole (and follow Tamil TAML 12 11 27 1 and signs in the memory representation). Fi- Telugu TELU 16 15 38 5 nally, there is a class of special characters, such as the religious ॐ, that behave like inde- Table 1: Sizes of core graphemic classes: Independent pendent aksara.̣ 4 (iv), dependent vowel diacritics (dv), conso- nants ( ), coda symbols ( ). c co Unicode Normalization Unicode defines sev- from the 3rd century BCE and fell out of use eral normalization forms which are used for check- by the 5th century CE (Salomon, 1996; Strauch, ing whether the two Unicode strings are equiv- 2012; Fedorova, 2012). The main unit of lin- alent to each other (Unicode Consortium, 2019). ear graphemic representation in Brahmic scripts In our library we support Normalization Form C is known by its traditional -derived name (NFC) which is well suited for comparing visu- aksarạ . As Bright (1999) notes, it is often trans- ally identical strings. This normalization gener- lated as “” although it does not bear di- ally converts strings to the equivalent form that rect correspondence to a syllable of speech, but uses composite characters. Table 2 shows two ex- rather to an orthographic syllable. The structure, amples of legacy sequences corresponding canon- or “grammar” of an aksarạ is based on the follow- ically equivalent forms for Devanagari. ing common principles: an aksarạ often consists Visual Normalization As was mentioned above, of a symbol 퐶, by default bearing an an aksarạ may be represented by multiple Unicode unmarked or attached (de- character sequences and the goal of NFC normal- pendent) vowel sign 푣 (퐶푣); but it may also be an ization is to convert them to their unique canonical independent vowel symbol 푉 , or a consonant sym- form. However, there are many Unicode character bol with its inherent vowel “muted” by a special sequences that fall outside the scope of NFC algo- diacritic ∅ (퐶∅). In any of these preceding rithm. We provide visual normalization that, in ad- scenarios, the base consonant 퐶 can be replaced dition to providing the NFC functionality, also sup- by a where all but the last conso- ports transforming such legacy sequences. Some nant lose their inherent vowel. When the individ- of the rules are provided as “Do Not Use” tables by ual component of the cluster combine the Unicode Consortium (2019) that recommends to form a composite form, precluding the use of an transformations from legacy sequences to their cor- overt virama diacritic, this is known as a “conso- responding canonical form, such as Devanagari { nant conjunct” (e.g., 퐶∅퐶∅퐶 vs [퐶 퐶 퐶 ]3) (Fe- 푖 푗 푘 푖 푗 푘 (U+0905), (U+0945) } → (U+0972). We also dorova, 2013; Bright, 1999; Coulmas, 1999; Share अ ॅ ॲ included transformations for visually identical se- and Daniels, 2016). quences (under many implementations) which are The elements of the aksarạ grammar described commonly found on the Web, such as Devanagari above can be grouped into several natural classes. { (U+0910), (U+0947) } → (U+0910).5 The sizes of the core classes are shown in Ta- ऐ े ऐ ble 1 for each and its correspond- Well-formedness Check A well-formedness ac- ing ISO 15924 identifier in uppercase format (ISO, ceptor verifies whether the given text is readable in 2004). The major classes are the independent vow- a particular script or not. It would be hard for the els (e.g., the Devanagari diphthong औ), the depen- native reader to visually parse the text if the script dent vowel diacritics (e.g., the Gujarati ◌ી), and the rules are not followed. For example, the reader consonants (e.g., the Gurmukhi ੜ). Another im- portant class consists of the coda consonant sym- 4These classes are documented in https://github.com/ google-research/nisaba/blob/main/nisaba/brahmic/ 3Here, surrounding the consonants in square brackets will mappings.md. serve to indicate that the enclosed consonants form a conjunct 5Here the combining vowel sign U+0947 does not affect together. the compound glyph’s visual appearance hence is removed.

16 Script ID Visual Character(s) Translit. U+0AA6 U+0AB8 દ સ 푞 푞 푞 BENG ৎ KHANDA TA ⟨tⸯ⟩ 1 2 3 DEVA इ Non-word initial VOWEL I ⟨.i⟩ 0xE0 0xAA 0xA6 0xE0 0xAA 0xB8 GUJR ૐ Religious sign OM ⟨ōm̐ ⟩ 푞1 푞2 푞3 푞4 푞5 푞6 푞7 GURU ◌ੱ ADDAK ⟨˖⟩ MLYM ൻ CHILLU N ⟨nⸯ⟩ દ સ SINH ඥ JNYA ⟨ᵑǰ⟩ Figure 1: String acceptors for Gujarati word દસ (⟨dasa⟩, TAML ஃப VISARGA, ⟨f⟩ “ten”) over an alphabet of Unicode code points (top) and bytes (bottom). Table 3: Examples for additions to ISO 15919. Require: FSAs: consonant, vowel, vowel_sign, coda, standalone, virama, does not expect two vowels signs on a single con- dead_consonant, accept. sonant and such a thing may not even be possible 1: function 풲(consonant, vowel, vowel_sign, coda, standalone, virama, dead_consonant, accept) to reasonably draw. Furthermore, unlike the Latin 2: cluster ← (consonant + virama)∗ + consonant ? ? script, acronyms are not written using arbitrary let- 3: codable ← (vowel ∪ (cluster + vowel_sign ) ∪ accept) ∪ coda 4: akshara ← codable ∪ (cluster + virama + dead_consonant?) ter sequences, they are formed only as a sequence 5: 푇 ← akshara ∪ standalone 6: return 푇 + ⊳ Kleene plus of aksara.̣ Our approach verifies whether the text is a sequence of well-formed aksarạ using the gram- Figure 2: Simplified construction of the well-formed mar described above. automaton 풲.

Reversible ISO Transliteration ISO 15919 rep- 4 The Finite-State Approach resents a unified 8-bit Latin transliteration scheme for major South Asian Brahmic scripts (ISO, 2001). The Brahmic script manipulation operations Since it has not been updated with the characters described above have a natural intepretation that were introduced to the Unicode standard af- grounded in formal language theory. We treat the ter 2001, we have added additional mappings, with text corpus in a given script as a set of strings some examples shown in Table 3. These additions over some finite alphabet Σ that defines a set of are crucial because they allow us to reverse the admissable script symbols. The set of zero or to get the original Brahmic strings more strings is known as language which, in its back reliably. This property allows various data simplest (regular) form, can be succintly described processing pipelines to use the romanized text as (or recognized) by a finite state automaton (FSA) an internal representation and convert it back to the or acceptor (Yu, 1997). Two simple FSAs that original native script at the output stage. represent the Gujarati word દસ are shown in Figure 1, where the top automaton represents the Language-specific Logic Several South Asian word over an alphabet of Unicode code points languages often share the same script with some, for Gujarati, while the bottom one represents the often minor, language-specific differences. Our same string over the corresponding byte symbols library supports language-specific customizations in UTF-8 encoding (Unicode Consortium, 2019). that can be combined with language-agnostic Our library supports both representations. script logic. For example, the modern Bengali– The aksarạ grammar outlined in the previous Assamese script (Beng) is shared by both Bengali section can be expressed via elementary formal op- and Assamese languages, among others (Brandt erations on the FSAs that describe grammar con- and Sohoni, 2018). For both of these languages stituents. Such set-theoretic operations include our library provides customizations,6 such as union (∪), concatenation (+) and closure, where the transformations required for visual normal- closure is defined as an arbitrary natural number ization of Assamese that transform Bengali let- of concatenations of a language 퐿 over Σ with it- ter into its Assamese equivalent when it par- self, either accepting an empty input {휖} or not, ticipates in a consonant conjunct (which gener- denoted 퐿∗ (Kleene star) and 퐿+ (Kleene plus), ally occurs when following or preceding virama, respectively (Kuich and Salomaa, 1986). These e.g., { (U+09B0), (U+09CD) } → { (U+09F0), র ◌্ ৰ operations represent non-trivial automata which (U+09CD) }). ◌্ are compiled offline resulting in compact and ef- 6https://github.com/google-research/nisaba/ ficient representations. A simplified process for tree/main/nisaba/brahmic/data/lang constructing the automaton 풲 to perform the well-

17 එ/⟨e⟩ 푞1 ක/⟨k⟩ Require: FSTs: consonant, vowel, vowel_sign, coda, standalone, virama.

1: function ℐ(consonant, vowel, vowel_sign, coda, standalone, virama) 푞0 푞2 푞3 푞4 푞5 ද/⟨d⟩ ෙ◌/⟨e⟩ ක/⟨k⟩ 휖/⟨a⟩ 2: del_virama ← virama × Ø ⊳ Delete virama 3: ins_schwa ← Ø × {⟨a⟩} ⊳ Insert inherent vowel 4: deweight ← (휖, 휖, 푤 ↓) ⊳ De-prioritize the path Figure 3: Romanization of Sinhala words එක (“one”) 5: 푇 ← ( 6: (consonant + vowel_sign) ∪ ⊳ (ஸ,⟨sa⟩) + (◌ு,⟨u⟩) → (ஸு,⟨su⟩) and දෙක (“two”) into ⟨eka⟩ and ⟨deka⟩, respectively. 7: (consonant + del_virama + deweight) ∪ 8: (consonant + ins_schwa + deweight) ∪ 9: (vowel + deweight) ∪ coda ∪ standalone ∪ formed check from the previous section is shown 10: …) ⊳ Further logic 11: return 푇 ∗ ⊳ Kleene star in Figure 2. In this simplified example, the paths through the automaton that define a legal conso- Figure 4: Simplified construction of the transliteration transducer ℐ. nant cluster (line 2 of the algorithm) are repre- sented by a sub-automaton that recognizes the lan- guage that consists of strings formed from the con- over the aksara-initiaḷ independent vowels (line 9). sonant and virama symbols only, where each con- The two remaining operations on aksara,̣ sonant, apart from the last one, must be followed namely NFC and visual normalization, are repre- by the virama that removes an inherent vowel. sented in our library using the context-dependent The rest of the operations on the Brahmic scripts, rewrite rules from the formal approach pop- namely the normalization and transliteration, in- ularized by Chomsky and Halle (1968). The volve modifications of the Brahmic script inputs. normalization rules are represented as a sequence Such operations are naturally expressed by finite {휙 → 휓/휆 __ 휌}, where the source 휙 is rewritten state transducers (FSTs), which are a generaliza- as 휓 if its left and right contexts are 휆 and 휌. For tion of the FSA concept used to encode string- an earlier example from §3, a single NFC normal- string relations (or transductions), by modifying ization rule rewrites the Devanagari string 휙 = “न” the automata arcs to have pairs of labels from in- (na, U+0928) + “़” (nukta sign, U+093C) into its put and output , instead of single labels. canonical composition 휓 = “ऩ” (nnna, U+0929). A trivial romanization in our representation of the Kaplan and Kay (1994) proposed an algorithm two Sinhala words එක (⟨eka⟩, “one”) and දෙක for compiling such sequences into an FST. This (⟨deka⟩, “two”) is shown in Figure 3. Note the approach was further improved and extended “vocalization” of the final consonant by insertion to WFSTs by Mohri and Sproat (1996), whose of a via an input 휖-transition. Also note that algorithm we use to compile sequences of NFC the path accepting the second word is longer. The and visual normalization rules into transducers word දෙක consists of three aksarạ and requires denoted 풩 and 풱. modification of the inherent vowel by the depen- Finally, the transducers representing language- dent vowel in order to produce ⟨de⟩. specific customizations of a particular script op- The basic operations on the FSAs outlined eration are compiled by composing the generic above also extend to the FST case and allow language-agnostic transducer, such as the Devana- for similarly succinct final compiled representa- gari visual normalizer, with the transducer rep- tions (Mohri, 2000), such as the simplified con- resenting transformations that capture language- struction of the ISO romanization transducer ℐ for specific use of the script, e.g., Devanagari for converting from Brahmic scripts to Latin - Nepali. , shown in Figure 4. An important extension of FSAs and FSTs are the weighted finite state - 5 System Details and Demo tomata (WFSAs) and transducers (WFSTs) (Mohri, 2004, 2009) that equip each arc in the automaton or The core of the Nisaba Brahmic script manipula- transducer with a weight, thus allowing optimiza- tion library resides under the brahmic directory tion and search algorithms to compute the costs of of the distribution. In this section we provide de- distinct paths, which can be used to determine their tails for how to build and use the library and also relative importance. We use weights in some of our explore its application to visual normalization of grammars to indicate the relative priority of a par- Wikipedia-based text in 9 of these scripts. ticular aksarạ modification. For example, in Fig- ure 4, the paths corresponding to consonants fol- Prerequisites We use Bazel (Google, 2020) as lowed by dependent vowels (line 6) have priority a primary build environment. For compiling the

18 Script Op. Symb. Prop. BENG DEVA GUJR GURU KNDA MLYM ORYA SINH TAML TELU 푁 127 130 113 93 119 122 105 122 75 112 Unicode 푠 푁 475 546 476 418 487 522 452 513 326 485 ℐ 푎 푁 248 235 195 171 210 201 178 192 126 181 Byte 푠 푁푎 384 399 334 288 350 345 305 339 229 318 푁 9 17 1 8 21 8 9 17 11 4 Unicode 푠 푁 158 248 75 78 349 261 160 352 228 163 풩 푎 푁 31 55 1 28 70 27 31 55 37 14 Byte 푠 푁푎 1,812 1,841 255 1,047 2,884 2,322 1,813 2,611 3,098 1,543 푁 103 51,710 98 119 1764 287 60 182 209 57 Unicode 푠 푁 2,423 121,157 2,234 2,322 6,136 3,021 1,732 2,129 1,280 2,249 풱 푎 푁 369 165,168 356 425 5,611 965 232 624 703 225 Byte 푠 푁푎 18,896 266,441 18,684 20,733 30,422 18,598 16,146 15,363 11,830 18,717 푁 11 7 7 7 10 10 7 7 4 6 Unicode 푠 푁 427 446 388 341 465 485 380 361 158 335 풲 푎 푁 38 23 21 23 33 33 22 22 11 19 Byte 푠 푁푎 297 321 284 257 309 297 279 195 130 239

Table 4: Properties of script FSTs arranged by operation and symbol types (Unicode code points and UTF-8 bytes), where ℐ denotes the ISO transliteration operation, 풩 is the NFC normalization, 풱 denotes visual normalization, and 풲 is the well-formed check. The numbers of states and arcs are denoted by 푁푠 and 푁푎, respectively.

Brahmic Brahmic Brahmic Compiling the Transducers Figure 6 presents Offline: Compile Runtime: Python Runtime: C++ the sequence of steps to compile the transduc- ers, including downloading the repository (line 2), Pynini Thrax Pynini Thrax compiling the library and its artifacts (line 5) and running the unit tests (line 7). The artifacts are compiled by Bazel using Pynini and consist of the OpenFst OpenFst OpenFst finite state archive (FAR) files that contain collec- tions of WFSTs (Roark et al., 2012). For each Figure 5: Software dependency diagrams for the three of the four Brahmic script operations we generate modes of operation: compile stage (left), Python run- two FAR files: one for WFSTs over the byte al- time (center) and C++ run-time (right). phabet, and another over the Unicode alphabet.10 Each FAR file contains ten script- automata and transducers we employ Pynini7, a specific transducers whose names correspond to Python library for constructing finite-state gram- the upper-case ISO 15924 script codes. Since the mars and for performing operations on WF- transliteration operation is bidirectional, the name STs (Gorman, 2016; Gorman and Sproat, in press). of each script-specific transliteration transducer In addition, the library depends on Thrax8, an older has the prefix FROM_ for the native-to-Latin direc- relative of Pynini, that provides a custom gram- tion, and TO_ for the inverse. The numbers of states mar manipulation language for WFSTs (Tai et al., (푁푠) and arcs (푁푎) of the resulting transliteration 2011; Roark et al., 2012). Although Thrax has (ℐ), NFC (풩), visual normalization (풱) transduc- been mostly superseded by Pynini, we still rely on ers and well-formedness acceptors (풲) for each some of its utilities for unit testing and its C++ run- script and alphabet type are shown in Table 4. time components. At their core, both Pynini and Thrax depend on the OpenFst library9 for the im- Offline and Online Usage Once the transduc- plementation of most WFST algorithms (Allauzen ers are compiled, they can be applied offline to et al., 2007; Riley et al., 2009). The overall depen- the input files using the rewrite-tester tool pro- dency diagram is shown on the left-hand side of vided by Thrax, as shown in lines 8–13 of the ex- Figure 5 (the minimal dependency on Thrax is in- ample in Figure 6, where the visual normalization dicated by a dotted arrow). At build time, Bazel transducer 풱 for Kannada that resides in the vi- pulls in these dependencies remotely from their re- sual_norm.far archive is applied to words in in- spective repositories. put file words.txt. We provide lightweight run-time interfaces for 7http://pynini.opengrm.org/ 8http://thrax.opengrm.org 10The Unicode code point FARs rather misleadingly have 9http://www.openfst.org the suffix utf8 in their name for historical reasons.

19 % Changed 1 # Download Nisaba repository. 2 git clone https://github.com/google-research/nisaba.git Language Script Types Tokens 3 cd nisaba Bengali BENG 0.53 0.06 4 # Compile the transducers and tests. 5 bazel build -c opt //nisaba/brahmic/... Gujarati GUJR 0.46 0.09 6 # Run the unit tests. Hindi DEVA 1.41 0.18 7 bazel test -c opt //nisaba/brahmic/... Kannada KNDA 4.19 1.66 8 # Compile Thrax rewrite helper tool. Malayalam MLYM 6.33 4.19 9 bazel build -c opt @org_opengrm_thrax//:rewrite-tester 10 # Run visual normalization for Kannada. Marathi DEVA 1.51 0.40 11 bazel-bin/external/org_opengrm_thrax/rewrite-tester \ Punjabi GURU 1.67 0.33 12 --far=bazel-bin/nisaba/brahmic/visual_norm.far \ Sinhala SINH 3.55 0.71 13 --rules=KNDA < words.txt Tamil TAML 0.59 0.17 Telugu TELU 1.97 0.63 Figure 6: Compiling the transducers.

import unittest Table 5: Percentage of types and tokens changed by vi- from nisaba import brahmic sual normalization from native script Wikipedia train- class BrahmicTest(unittest.TestCase): ing partitions of the Dakshina dataset. def testBasicOperations(self): # Check romanization. iso_to_deva = brahmic.IsoTo(’Deva’) self.assertEqual(’क् ़लब’, iso_to_deva.ApplyOnText(’⟨kˑlaba⟩’)) these scripts, we normalized publicly available cor- # Check valid inputs. wellformed_mlym = brahmic.WellFormed(’Mlym’) pora and measured how frequently words in the self.assertTrue(wellformed_mlym.AcceptText(’സ്വര’)) # Visual normalizer. samples were modified. The Dakshina dataset visual_norm_deva = brahmic.VisualNorm(’Deva’) (Roark et al., 2020) includes (among other things) self.assertEqual(’औ’, visual_norm_deva.ApplyOnText(’औ’)) collections of monolingual Wikipedia sentences in Figure 7: Run-time Python interface example. 12 South Asian languages, 10 of which use Brah- mic scripts. We applied visual normalization to the training partitions of the collections in these 10 lan- both Python and C++, their dependencies shown guages, and Table 5 presents the percentage of both in the center and the right-hand side of Figure 5, types and tokens that were changed by the normal- respectively. The Python interface is provided via ization.11 Malayalam is the language with the high- several wrappers around the pynini.Fst abstrac- est percentage of both types and tokens changed by tion, with a simple example shown in Figure 7. visual normalization, largely due to frequent con- In addition to performing simple operations on in- version to chillu letters from alternative encodings. dividual strings, more WFST-specific operations, For example, the relatively frequent word തൻ്റെ such as transducer composition, are provided by (“yours”) is normalized to the encoding with the Pynini. The C++ interface is provided by the Gram- chillu letter ൻ instead of ന. mar helper class, shown in Figure 8, that includes the necessary methods for initializing the WFSTs and performing rewrites (for transducers) and ac- 6 Conclusion and Future Work ceptance tests (for acceptors). In addition, many more operations on WFSTs are available through We presented finite-state automata-based utilities the OpenFst library, if required. for processing the major Brahmic scripts. The fi- Prevalence of Normalization To demonstrate nite state transducer formalism provides an effi- the prevalence of text requiring normalization in cient and scalable framework for expressing Brah- mic script operations and is suitable for many NLP applications, such as those reported in Kumar et al. #include (2020) and Kakwani et al. (2020), which may ben- // Generic wrapper around FST archive with Brahmic transducers. class Grammar { efit from the reduction in “noise” present in unnor- public: // Constructs given the FAR path, its name and the name of WFST. malized text. In the future, we will continue to im- Grammar(const std::string& far_path, const std::string& far_name, prove the support for existing scripts and extend const std::string& fst_name); // Initializes the transducer. our work to other Brahmic scripts. bool Load(); // Rewrites into . bool Rewrite(const std::string& input, std::string *output) const; // Checks whether the grammar accepts . 11Tokenization was simply based on whitespace, with no bool Accept(const std::string& input) const; other processing such as separation, so the total }; number of distinct types is accordingly relatively high. The texts from that dataset were already NFC normalized. Figure 8: Run-time C++ interface. 20 Acknowledgments Liudmila L. Fedorova. 2013. The development of graphic representation in writing: The ak- The authors would like to thank Işın Demirşahin shara’s grammar. Lingua Posnaniensis, 55(2):49– for valuable discussion on this project. 66. Google. 2020. Bazel. http://bazel.build. [Online], Accessed: 2020-12-10. References Kyle Gorman. 2016. Pynini: A Python library for Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo- weighted finite-state grammar compilation. In Pro- jciech Skut, and Mehryar Mohri. 2007. OpenFst: A ceedings of the SIGFSM Workshop on Statistical general and efficient weighted finite-state transducer NLP and Weighted Automata, pages 75–80, Berlin, library. In International Conference on Implemen- Germany. Association for Computational Linguis- tation and Application of Automata, pages 11–23. tics. Springer. Kyle Gorman and Richard Sproat. in press. Finite-State Chantal Amrhein and Rico Sennrich. 2020. On Roman- Text Processing. Human Language Technologies. ization for model transfer between scripts in neural Morgan & Claypool, Williston, VT. machine translation. In Findings of the Association Bradley Hauer, Amir Ahmad Habibi, Yixing Luan, for Computational Linguistics: EMNLP 2020, pages Rashed Rubby Riyadh, and Grzegorz Kondrak. 2019. 2461–2469, Online. Association for Computational Cognate projection for low-resource inflection gen- Linguistics. eration. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, Pushpak Bhattacharyya, Hema Murthy, Surangika and Morphology, pages 6–11, Florence, Italy. Asso- Ranathunga, and Ranjiva Munasinghe. 2019. Indic ciation for Computational Linguistics. language computing. Communications of the ACM, 62(11):70–75. Lars Hellsten, Brian Roark, Prasoon Goyal, Cyril Al- lauzen, Françoise Beaufays, Tom Ouyang, Michael Carmen Brandt and Sohoni. 2018. Script and Riley, and David Rybach. 2017. Transliterated mo- identity – the politics of writing in South Asia: an in- bile keyboard input via weighted finite-state trans- troduction. South Asian History and Culture, 9(1):1– ducers. In Proceedings of the 13th International 15. Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017), pages 10– William Bright. 1999. A matter of typology: Alphasyl- 19, Umeå, Sweden. Association for Computational labaries and . Written Language & , Linguistics. 2(1):45–55. Ulf Hermjakob, Jonathan May, and Kevin Knight. Ronald Cardenas, Ying Lin, Heng Ji, and Jonathan May. 2018. Out-of-the-box universal Romanization tool 2019. A grounded unsupervised universal part-of- uroman. In Proceedings of ACL 2018, System speech tagger for low-resource languages. In Pro- Demonstrations, pages 13–18, Melbourne, Australia. ceedings of the 2019 Conference of the North Amer- Association for Computational Linguistics. ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- Lauren Hinkle, Albert Brouillette, Sujay Jayakar, Leigh ume 1 (Long and Short Papers), pages 2428–2439, Gathings, Miguel Lezcano, and Jugal Kalita. 2013. Minneapolis, Minnesota. Association for Computa- Design and evaluation of soft keyboards for Brahmic tional Linguistics. scripts. ACM Transactions on Asian Language Infor- mation Processing (TALIP), 12(2):1–37. George Cardona and Danesh Jain. 2007. The Indo- Xiaolei Huang, Jonathan May, and Nanyun Peng. 2019. Aryan Languages. Routledge Language Family Se- What matters for neural -lingual named entity ries. Routledge, New York. recognition: An empirical analysis. In Proceedings of the 2019 Conference on Empirical Methods in Nat- Noam Chomsky and Morris Halle. 1968. The Sound ural Language Processing and the 9th International Pattern of English. Harper & Row, New York. Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6395–6401, Hong Kong, Florian Coulmas. 1999. The Blackwell Encyclopedia of . Association for Computational Linguistics. Writing Systems. John Wiley & Sons, Oxford. ISO. 2001. ISO 15919: Transliteration of Devana- A. K. Datta. 1984. A generalized formal approach gari and related Indic scripts into Latin characters. for description and analysis of major Indian scripts. https://www.iso.org/standard/28333.html. Interna- IETE Journal of Research, 30(6):155–161. tional Organization for Standardization. Liudmila L Fedorova. 2012. The development of ISO. 2004. ISO 15924: Codes for the representation of structural characteristics of in deriva- names of scripts. https://www.iso.org/obp/ui/#iso: tive writing systems. Written Language & Literacy, std:iso:15924:ed-1:v1:en. International Organiza- 15(1):1–25. tion for Standardization.

21 Divyanshu Kakwani, Anoop Kunchukuttan, Satish 17th SIGMORPHON Workshop on Computational Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Research in Phonetics, Phonology, and Morphology, Khapra, and Pratyush Kumar. 2020. iNLPSuite: pages 189–197, Online. Association for Computa- Monolingual corpora, evaluation benchmarks and tional Linguistics. pre-trained multilingual language models for Indian languages. In Proc. of the 2020 Conference on Em- Tom Ouyang, David Rybach, Françoise Beaufays, and pirical Methods in Natural Language Processing: Michael Riley. 2017. Mobile keyboard input decod- Findings, EMNLP 2020, pages 4948–4961, Online ing with finite-state transducers. Event. Association for Computational Linguistics. Umapada Pal, Ramachandran Jayadevan, and Nabin Ronald M. Kaplan and Martin Kay. 1994. Regular mod- Sharma. 2012. Handwriting recognition in Indian re- els of phonological rule systems. Computational gional scripts: A survey of offline techniques. ACM Linguistics, 20(3):331–378. Transactions on Asian Language Information Pro- cessing (TALIP), 11(1):1–35. Werner Kuich and Arto Salomaa. 1986. Semirings, Au- tomata, Languages, volume 5 of Monographs in The- Vinodh Rajan. 2020. Aksharamukha. https://github. oretical Computer Science. Springer, Berlin. com/virtualvinodh/aksharamukha. Saurav Kumar, Saunack Kumar, Diptesh Kanojia, and Pushpak Bhattacharyya. 2020. “A passage to In- Michael Riley, Cyril Allauzen, and Martin Jansche. dia”: Pre-trained word embeddings for Indian lan- 2009. OpenFst: An open-source, weighted finite- guages. In Proc. of the 1st Joint Workshop on Spoken state transducer library and its applications to speech Language Technologies for Under-resourced lan- and language. In Proceedings of Human Language guages (SLTU) and Collaboration and Computing Technologies: The 2009 Annual Conference of the for Under-Resourced Languages (CCURL), pages North American Chapter of the Association for Com- 352–357, Marseille, France. European Language Re- putational Linguistics, Companion Volume: Tutorial sources association. Abstracts, pages 9–10, Boulder, Colorado. Associa- tion for Computational Linguistics. Anoop Kunchukuttan. 2020. The IndicNLP Li- brary. https://github.com/anoopkunchukuttan/ Brian Roark, Richard Sproat, Cyril Allauzen, Michael indic_nlp_library. Riley, Jeffrey Sorensen, and Terry Tai. 2012. The OpenGrm open-source finite-state grammar soft- Anoop Kunchukuttan, Ratish Puduppully, and Pushpak ware libraries. In Proceedings of the ACL 2012 Sys- Bhattacharyya. 2015. Brahmi-net: A transliteration tem Demonstrations, pages 61–66, Jeju Island, Ko- and script conversion system for languages of the In- rea. Association for Computational Linguistics. dian subcontinent. In Proceedings of the 2015 Con- ference of the North American Chapter of the Asso- Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, ciation for Computational Linguistics: Demonstra- Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, and tions, pages 81–85, Denver, Colorado. Association Keith Hall. 2020. Processing South Asian languages for Computational Linguistics. written in the Latin script: the Dakshina dataset. In Proc. of 12th Language Resources and Evaluation Mehryar Mohri. 2000. Minimization algorithms for se- Conference (LREC), pages 2413–2423, Marseille, quential transducers. Theoretical Computer Science, France. 234(1-2):177–201. Richard G. Salomon. 1996. Brahmi and Kharoshthi. In Mehryar Mohri. 2004. Weighted finite-state transducer Peter . Daniels and William Bright, editors, The algorithms. An overview. In Carlos Martín-Vide, World’s Writing Systems, pages 373–383. Oxford Victor Mitrana, and Gheorghe Păun, editors, For- University Press, New York, NY. mal Languages and Applications, pages 551–563. Springer, Berlin; Heidelberg. Royal Denzil Sequiera, Shashank S. Rao, and B. R. Mehryar Mohri. 2009. Weighted automata algorithms. Shambavi. 2014. Word-level language identifica- In Manfred Droste, Werner Kuich, and Heiko Vogler, tion and back transliteration of romanized text. In editors, Handbook of Weighted Automata, Mono- Proceedings of the Forum for Information Retrieval graphs in Theoretical Computer Science, pages 213– Evaluation, pages 70–73, Bangalore, India. 254. Springer. David L. Share and Peter T. Daniels. 2016. Aksha- Mehryar Mohri and Richard Sproat. 1996. An efficient ras, alphasyllabaries, abugidas, alphabets and ortho- compiler for weighted rewrite rules. In 34th An- graphic depth: Reflections on Rimzhim, Katz and nual Meeting of the Association for Computational Fowler (2014). Writing systems research, 8(1):17– Linguistics, pages 231–238, Santa Cruz, California, 31. USA. Association for Computational Linguistics. R. Mahesh K. Sinha. 2009. A journey from Indian Nikitha Murikinati, Antonios Anastasopoulos, and Gra- scripts processing to Indian language processing. ham Neubig. 2020. Transliteration for cross-lingual IEEE Annals of the History of Computing, 31(1):8– morphological inflection. In Proceedings of the 31.

22 Richard Sproat. 2003. A formal computational analysis of Indic scripts. In In International Symposium on Indic Scripts: Past and Future, Tokyo, . Sanford B. Steever. 2019. The , 2nd edition. Routledge Language Family Series. Routledge, New York. Ingo Strauch. 2012. The character of the Indian Kharosṭhị̄ script and the “Sanskrit Revolution”: A writing system between identity and assimilation. In Alexander J. de Voogt and Joachim Friedrich Quack, editors, The Idea of Writing: Writing Across Borders, pages 131–168. Brill, Leiden; Boston. Terry Tai, Wojciech Skut, and Richard Sproat. 2011. Thrax: An open source grammar compiler built on OpenFst. In Proc. of IEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU), vol- ume 12, Hawaii, USA. Unicode Consortium. 2019. The Unicode Stan- dard. Online: http://www.unicode.org/versions/ Unicode12.1.0/. Version 12.1.0, Mountain View, . Hans H. Wellisch. 1978. The Conversion of Scripts: Its Nature, History, and Utilization. Information sci- ences series. John Wiley & Sons, New York. Sheng Yu. 1997. Regular languages. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, volume 1: Word, Language, Grammar, pages 41–110. Springer, Berlin. Hao Zhang, Richard Sproat, Axel H Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, and Brian Roark. 2019. Neural models of text normal- ization for speech applications. Computational Lin- guistics, 45(2):293–337. Yuhao Zhang, Ziyang Wang, Runzhe Cao, Binghao Wei, Weiqiao Shan, Shuhan Zhou, Abudurexiti Re- heman, Tao Zhou, Xin Zeng, Laohu Wang, Yongyu Mu, Jingnan Zhang, Xiaoqian Liu, Xuanjun Zhou, Yinqiao Li, Bei Li, Tong Xiao, and Jingbo Zhu. 2020. The NiuTrans machine translation systems for WMT20. In Proceedings of the Fifth Conference on Machine Translation, pages 338–345, Online. Asso- ciation for Computational Linguistics.

23