Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
Total Page:16
File Type:pdf, Size:1020Kb
Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display. sequence of aksara.̣ This can occasionally result in many-to-one map- Brahmic scripts are heavily used across South pings, where several distinctly-encoded strings can Asia and have official status in India, Bangladesh, result in identical display. For example, Latin Nepal, Sri Lanka and beyond (Cardona and Jain, script letters with diacritics such as “é” can gener- 2007; Steever, 2019). Despite evident progress ally be encoded as either: (a) a pair of the base let- in localization standards (Unicode Consortium, ter (e.g., “e” which is U+0065 from Unicode’s Ba- 2019) and improvements in associated technolo- sic Latin block, corresponding to ASCII) and a dia- gies such as input methods (Hinkle et al., 2013) and critic (in this case U+0301 from the Combining Dia- character recognition (Pal et al., 2012), Brahmic critical Marks block); or (b) a single character that script processing still poses important challenges represents the grapheme directly (U+00E9 from the due to the inherent differences between these writ- Latin-1 Supplement Unicode block). Both encod- ing systems and those which historically have been ings yield visually identical text, hence text is of- more dominant in information technology (Sinha, ten normalized to a conventionalized normal form, 2009; Bhattacharyya et al., 2019). such as the well-known Normalization Form C In this paper, we present Nisaba, an open-source (NFC), so that visually identical words are mapped software library,2 which provides processing utili- to a conventionalized representative of their equiv- ties for ten major Brahmic scripts of South Asia: alence class for downstream processing. Critically, Bengali, Devanagari, Gujarati, Gurmukhi, Kan- NFC normalization falls far short of a complete nada, Malayalam, Oriya (Odia), Sinhala, Tamil, handling of such many-to-one phenomena in Uni- 1See §3 for details on the scripts. code. 2https://github.com/google-research/nisaba/ 14 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 14–23 April 19 - 23, 2021. ©2021 Association for Computational Linguistics and Telugu. In addition to string normaliza- Murikinati et al., 2020), named entity recogni- tion and well-formedness processing, the library tion (Huang et al., 2019) and part-of-speech tag- also includes utilities for the deterministic and re- ging (Cardenas et al., 2019). versible romanization of these scripts, i.e., translit- Similar to the software mentioned above, our li- eration from each script to and from the Latin brary does provide romanization, but unlike some script (Wellisch, 1978). While the resulting roman- of the packages, such as URoman, we guarantee izations are standardized in a way that may or may reversibility from Latin back to the native script. not correspond to how native speakers tend to ro- Similar to others we do not focus on faithful in- manize the text in informal communication (see, vertible transliteration of named entities which e.g., Roark et al., 2020), such a default romaniza- typically requires model-based approaches (Se- tion can permit easy inspection of an approximate quiera et al., 2014). Unlike the IndicNLP pack- version of the linguistic strings for those who read age, our software does not provide morphologi- the Latin script but not the specific Brahmic script cal analysis, but instead offers significantly richer being examined. script normalization capabilities than other pack- As a whole, the library provides important utili- ages. These capabilities are functionally sepa- ties for language processing applications of South rated into normalization to Normalization Form Asian languages using Brahmic scripts. The de- C (NFC) and visual normalization. Additionally, sign is based on the observation that, while there our library provides extensive script-specific well- are considerable superficial differences between formedness grammars. Finally, in contrast to these these scripts, they follow the same encoding model other approaches, grammars in our library are in Unicode, and maintain a very similar char- maintained separately from the code for compila- acter repertoire having evolved from the same tion and application, allowing for maintenance of source — the Brāhmī script (Salomon, 1996; Fe- existing scripts and languages plus extension to dorova, 2012). This observation lends itself to the new ones without having to modify any code. This script-agnostic design (outlined in §4) that, unlike is particularly important given that Unicode stan- other approaches reviewed in §2, is based on the dards do change over time and there remain many weighted finite state transducer (WFST) formal- languages left to cover. ism (Mohri, 2004). The details of our system are To the best of our knowledge this is the first provided in §5. publicly available general finite-state grammar ap- proach for low-level processing of multiple Brah- 2 Related Work mic scripts since the early formal syntactic work by Datta (1984) and is the first such library de- The computational processing of Brahmic scripts signed based on an observation by Sproat (2003) is not a new topic, with the first applications that the fundamental organizing principles of the dating back to the early formal syntactic work Brahmic scripts can be algebraically formalized. by Datta (1984). With an increased focus on the In particular, all the core components of our li- South Asian languages within the NLP commu- brary (inverse romanization, normalization and nity, facilitated by advances in machine learning well-formedness) are compactly and efficiently and the increased availability of relevant corpora, represented as finite state transducers. Such for- multiple script processing solutions have emerged. malization lends itself particularly well to run-time Some of these toolkits, such as statistical ma- or offline integration with any finite state process- chine translation-based Brahmi-Net (Kunchukut- ing pipeline, such as decoder components of in- tan et al., 2015), are model-based, while oth- put methods (Ouyang et al., 2017; Hellsten et al., ers, such as URoman (Hermjakob et al., 2018), 2017), text normalization for automatic speech IndicNLP (Kunchukuttan, 2020), and Akshar- recognition and text-to-speech synthesis (Zhang mukha (Rajan, 2020), employ rules. The main fo- et al., 2019), among other natural language and cus of these libraries is script conversion and ro- speech applications. manization. In this capacity they were success- fully employed in diverse downstream multilin- 3 Brahmic Scripts: An Overview gual NLP tasks such as neural machine transla- tion (Zhang et al., 2020; Amrhein and Sennrich, The scripts of interest have evolved from the an- 2020), morphological analysis (Hauer et al., 2019; cient Brāhmī writing system that was recorded 15 Name Id iv dv c co Visual Legacy sequence NFC normalized Bengali BENG 16 13 43 5 ऩ NA NUKTA (U+0928 U+093C) NNNA (U+0929) Devanagari DEVA 19 17 45 4 क़ QA (U+0958) KA NUKTA (U+0915 U+093C) Gujarati GUJR 16 15 39 5 Gurmukhi GURU 12 9 39 8 Table 2: NFC examples for Devanagari. Kannada KNDA 17 15 39 3 Malayalam MLYM 16 16 38 10 bols, like anusvara, chandrabindu, and visarga, Oriya ORYA 14 13 38 5 Sinhala SINH 18 17 41 2 which modify the aksarạ as a whole (and follow Tamil TAML 12 11 27 1 and vowel signs in the memory representation). Fi- Telugu TELU 16 15 38 5 nally, there is a class of special characters, such as the religious symbol Om ॐ, that behave like inde- Table 1: Sizes of core graphemic classes: Independent pendent aksara.̣ 4 vowels (iv), dependent vowel diacritics (dv), conso- nants ( ), coda symbols ( ). c co Unicode Normalization Unicode defines sev- from the 3rd century BCE and fell out of use eral normalization forms which are used for check- by the 5th century CE (Salomon, 1996; Strauch, ing whether the two Unicode strings are equiv- 2012; Fedorova, 2012). The main unit of lin- alent to each other (Unicode Consortium, 2019).