A Method for the Affixed Word Transliteration to the Balinese Script on the Learning Web Application

Total Page:16

File Type:pdf, Size:1020Kb

A Method for the Affixed Word Transliteration to the Balinese Script on the Learning Web Application Turkish Journal of Computer and Mathematics Education Vol.12 No.6 (2021), 2849-2857 Research Article A Method for the Affixed Word Transliteration to the Balinese Script on the Learning Web Application G. Indrawan1, I K. Paramarta2, D. P. Ramendra3, I G. A. Gunadi1, Sariyasa1 1Computer Science Department 2Balinese Language Education Department 3English Language Education Department Universitas Pendidikan Ganesha (Undiksha), Singaraja, Bali, Indonesia 81116 Abstract: This research proposed a method for the affixed word transliteration to the Balinese Script since there has not been studied yet and it is important since the affixed word needs to be transliterated, inevitably. This research is one of the efforts to preserve digitally the endangered Balinese local language knowledge in Indonesia through the multi-discipline collaboration between Computer Science and Language discipline. The proposed method was taken care of two related aspects, i.e.; (1) A Latin root word has its related Balinese Script root word by using default or special transliteration rule; and (2) A Latin root word with a special transliteration rule for its Balinese Script root word, also need a special transliteration rule for its affixed word. This study was conducted on the pioneering web-based transliteration learning application, BaliScript, that receives the Latin text input and outputs the Balinese Script by using the Noto Serif Balinese font with its dedicated Balinese Unicode. Through the experiment, the proposed method gave the expected transliteration results, added a certain perspective, and strengthened the transliteration knowledge. Future work is to enhance and reuse this method on the mobile computing device, as a part of the Balinese Language ubiquitous learning that supports Balinese Language education, which is a mandatory local subject from the elementary school to the high school in Bali Province. Keywords: affixed word, Balinese Script, transliteration, learning web 1. Introduction The Balinese Script as an endangered local language (G. Indrawan, Paramarta, Agustini, & Sariyasa, 2018) in Indonesia affects its Latin-to-Balinese Script transliteration knowledge, as part of the transliteration research in general (Karimi, Scholer, & Turpin, 2011; Kaur & Singh, 2014). The preservation effort by the Bali Government has already been conducted through the Bali Governor Circular Letter (Bali Government, 2018) that follows up the Bali Governor Regulation (Bali Government, 2019). Also, from the previous preservation regulation, the Balinese Language (including all aspects of the Balinese Script) was considered and running as a mandatory local subject from elementary school to senior high school in Bali Province. As the preservation effort of the language can be done by that governmental/political approach, multiple approaches should strengthen it and should have a greater impact. This research joined the effort through the technological approach, through the multi-discipline collaboration between Computer Science and Language discipline, by proposing a method for the affixed word transliteration to the Balinese Script since there has not been studied yet and it is important since the affixed word needs to be transliterated, inevitably. Moreover, this study was conducted on the pioneering web-based transliteration learning application, BaliScript, as part of the planned Balinese Language ubiquitous learning that supports Balinese Language education. The proposed method can be reused in the future for the mobile application as part of ubiquitous learning to give the transliteration knowledge for the next generations to come. As known, ubiquitous learning improves learning by providing learners with content and interaction anytime and anywhere (Hwang, Tsai, & Yang, 2008) by mobile and embedded computers and wireless networks in everyday life (Ogata, Matsuka, Bishouty, & Yano, 2009). This research proposes an unexposed method for the affixed word transliteration to the Balinese Script based on the supporting computer font Noto Serif Balinese (NSB). Also, this work advances the previous authors‟ work on the transliteration process by: (1) accommodating the rule from the Balinese Language, Script, and Literature Advisory Agency (Anom et al., 2009). This Bali Province government agency (Bali Government, 1992) carries out guidance and formulates programs for the maintenance, study, development, and preservation of the Balinese Language, Script, and Literature; (2) accommodating seventeen types of special words (G. Indrawan, Paramarta, & Agustini, 2019; G. Indrawan, Paramarta, et al., 2018) by using a particular table structure in the database which is avoiding hard-coded special word repositories in the learning application; (3) utilizing the Noto Serif Balinese font (Google, 2020b; The Unicode Consortium, 2020b, 2020a) with more developed and less bug than the Noto Sans Balinese font (Google, 2020a). Also, it uses dedicated Balinese __________________________________________________________________________________ 2849 Turkish Journal of Computer and Mathematics Education Vol.12 No.6 (2021), 2849-2857 Research Article Unicode as a standard Balinese Script font on the computer system including mobile devices (so the method can be reused on the mobile application); and (4) maximizing the learning application, that used this method, for the delivery of the knowledge in the area of transliteration and also translation, at the same time, specifically for the Balinese Script, Indonesian, and English (see the next Figure 2). Overall, all of those advances were considered as the contribution of this work. This paper is organized into several sections, i.e.: Section 1 (Introduction) states the problem background related to the Balinese Script in general, and the Latin-to-Balinese Script transliteration in specific; Section 2 (Related Work) describes the related works in the area of the Latin-to-Balinese Script transliteration; Section 3 (Research Method) exposes the supporting transliteration computer font (which is the NSB font) with its dedicated Balinese Unicode, and the supporting rule-based algorithm for the affixed word transliteration to the Balinese Script; Section 4 (Result and Discussion) covers the analysis of the testing result; and finally, Section 5 (Conclusion and Future Work) consists of some important conclusion and future work points. 2. Related Work As the Balinese Script adheres with its scriptio-continua writing style like its relatively close (geographically and culturally) Javanese Script (Widiarti & Pulungan, 2020), several related works on Latin-to- Balinese Script transliteration were conducted on the previous author‟s works (Crisnapati et al., 2019; G. Indrawan, Dantes, Aryanto, & Paramarta, 2020; G. Indrawan, Gunadi, Gitakarma, & Paramarta, 2021; G. Indrawan, Paramarta, et al., 2019; G. Indrawan, Puspita, Paramarta, & Sariyasa, 2018; G. Indrawan, Sariyasa, & Paramarta, 2019; G. Indrawan, Setemen, Sutaya, & Paramarta, 2020; G. Indrawan, Swastika, Sariyasa, & Paramarta, 2020; Loekito, Indrawan, Sariyasa, & Paramarta, 2020). Various non-dedicated Balinese Unicode fonts (i.e. Bali Simbar (Suatjana, 1999) and Bali Simbar Dwijendra (Suatjana, 2009)) and dedicated Balinese Unicode font (The Unicode Consortium, 2020b, 2020a) (i.e. Noto Sans Balinese (Google, 2020a)) were used for displaying Balinese Script output on those previous research. A more developed dedicated Balinese Unicode font was found on Noto Serif Balinese font (Google, 2020b) rather than Noto Sans Balinese font. Bali Simbar font was used in (G. Indrawan, Sariyasa, et al., 2019) with a relatively good accuracy result on testing cases from The Balinese Alphabet document (Sudewa, 2003). It was also used on the developed robotic system that writing Balinese Script from Latin text input (Crisnapati et al., 2019), and on the transliteration line-breaking handling exploration (G. Indrawan, Setemen, et al., 2020). Bali Simbar Dwijendra (BSD) font, as the improvement of Bali Simbar (BS) font, was used in (G. Indrawan, Swastika, et al., 2020) with additional testing cases from the Balinese Script dictionary (Anom et al., 2009), in addition to the same testing cases on (G. Indrawan, Sariyasa, et al., 2019). It was also used on the transliteration exploration of the mathematical expression (G. Indrawan, Dantes, et al., 2020). Ten transliteration lessons were also learned by using this font on the extended testing data (G. Indrawan et al., 2021) other than the initial established testing data (G. Indrawan, Paramarta, et al., 2019; G. Indrawan, Sariyasa, et al., 2019). Noto Sans Balinese font was used in (G. Indrawan, Paramarta, et al., 2019) with the same testing cases in (G. Indrawan, Sariyasa, et al., 2019) and gave a relatively good accuracy result. It was also used on the developed robotic system that writing Balinese Script from Latin text input (G. Indrawan, Puspita, et al., 2018). Extensive accuracy analysis on the developed algorithm (G. Indrawan, Paramarta, et al., 2019) was conducted in (Loekito et al., 2020) for improvement in the future. For Balinese Script preservation effort and learning, both sides of transliteration knowledge need to be explored from a technological point of view for digitalization. Ref. (G. Indrawan, Ariawan, Agustini, & Paramarta, 2020) was related to the Balinese Script-to-Latin transliteration along with the other authors‟ work (Gede Indrawan, Gunadi, & Paramarta, 2020) that utilizing GNU Optical Character Recognition (OCR), i.e. Ocrad (Antonio
Recommended publications
  • A Method to Accommodate Backward Compatibility on the Learning Application-Based Transliteration to the Balinese Script
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, 2021 A Method to Accommodate Backward Compatibility on the Learning Application-based Transliteration to the Balinese Script 1 3 4 Gede Indrawan , I Gede Nurhayata , Sariyasa I Ketut Paramarta2 Department of Computer Science Department of Balinese Language Education Universitas Pendidikan Ganesha (Undiksha) Universitas Pendidikan Ganesha (Undiksha) Singaraja, Indonesia Singaraja, Indonesia Abstract—This research proposed a method to accommodate transliteration rules (for short, the older rules) from The backward compatibility on the learning application-based Balinese Alphabet document 1 . It exposes the backward transliteration to the Balinese Script. The objective is to compatibility method to accommodate the standard accommodate the standard transliteration rules from the transliteration rules (for short, the standard rules) from the Balinese Language, Script, and Literature Advisory Agency. It is Balinese Language, Script, and Literature Advisory Agency considered as the main contribution since there has not been a [7]. This Bali Province government agency [4] carries out workaround in this research area. This multi-discipline guidance and formulates programs for the maintenance, study, collaboration work is one of the efforts to preserve digitally the development, and preservation of the Balinese Language, endangered Balinese local language knowledge in Indonesia. The Script, and Literature. proposed method covered two aspects, i.e. (1) Its backward compatibility allows for interoperability at a certain level with This study was conducted on the developed web-based the older transliteration rules; and (2) Breaking backward transliteration learning application, BaliScript, for further compatibility at a certain level is unavoidable since, for the same ubiquitous Balinese Language learning since the proposed aspect, there is a contradictory treatment between the standard method reusable for the mobile application [8], [9].
    [Show full text]
  • Improvement Accuracy of Recognition Isolated Balinese Characters with Deep Convolution Neural Network
    Journal of Applied Intelligent System (e-ISSN : 2502-9401 | p-ISSN : 2503-0493) Vol. 4 No. 1, 2019, pp. 22 – 27 Improvement Accuracy of Recognition Isolated Balinese Characters with Deep Convolution Neural Network Ida Bagus Teguh Teja Murti*1 Universitas Pendidikan Ganesha, Jalan Udayana No 11 Singaraja Bali 8116, (+62362) 22570 E-mail : [email protected]*1 *Corresponding author Abstract - The numbers of Balinese script and the low quality of palm leaf manuscripts provide a challenge for testing and evaluation for character recognition. The aim of high accuracy for character recognition of Balinese script,we implementation Deep Convolution Neural Network using SmallerVGG (Visual Geometry Group) Architectur for character recognition on palm leaf manuscripts. We evaluated the performance that methods and we get accuracy 87,23% . Keywords - Classfication, Deep Neural Network, Balinese Characters, Visual Geometry Group 1. INTRODUCTION Isolated handwritten character recognition (IHCR) has been the subject of vigorous research in recent years. Some populer methods in this case are CNN for recognition numbers in the MNIST digits image database [1]⁠ . There are challenges in character recognition in some scripts, such as Chinese characters [2]⁠ and Balinese characters [3]⁠ , which have more characters than numbers in MNIST. Beside numbers character there is issue in recognition character in document made palm leaf manuscript that is the condition of document has been degenerated. Therefore the researcher must carry out the process of digitizing the palm leaf manuscript, automatic analysis and the indexing system of the script simultaneously. The activity goal is bring added value to digital palm leaf manuscripts by developing tools for analyzing, developing and accessing quickly and efficiently script content.
    [Show full text]
  • Iso/Iec Jtc1/Sc2/Wg2 N3022 L2/06-002
    ISO/IEC JTC1/SC2/WG2 N3022 L2/06-002 2006-01-09 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding the Sundanese script in the BMP of the UCS Source: Michael Everson Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2006-01-09 The Sundanese script, or aksara Sunda, is used for writing the Sundanese language, one of the languages of the island of Java in Indonesia. It is a descendent of the ancient Brahmi script of India, and so has many similarities with modern scripts of South Asia and Southeast Asia which are also members of that family. The script has official support and is taught in schools and is used on road signs. The SIL Ethnologue indicates that there are 27,000,000 Sundanese. Sundanese has been written in a number of scripts. Pallawa or Pra-Nagari was first used in West Java to write Sanskrit from the fifth to eighth centuries CE, and from Pallawa was derived Sunda Kuna or Old Sundanese which was used in the Sunda Kingdom from the 14th to 18th centuries. Both Javanese and Arabic script were used from the 17th to 19th centuries and the 17th to the mid-20th centuries respectively. Latin script has had currency since the 20th century. The modern Sundanese script, called Sunda Baku or Official Sundanese, was made official in 1996. The modern script itself was derived from Old Sundanese, the earliest example of which is the Prasasti Kawali stone (see Figure 1).
    [Show full text]
  • 1 Brahmi Word Recognition by Supervised Techniques
    Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 June 2020 doi:10.20944/preprints202006.0048.v1 Brahmi word recognition by supervised techniques Neha Gautam 1*, Soo See Chai1, Sadia Afrin2, Jais Jose3 1 Faculty of Computer Science and Information Technology, University Malaysia Sarawak 2 Institute of Cognitive Science, Universität Osnabrück 3Amity University, Noida, India *[email protected] [email protected] Abstract: Significant progress has made in pattern recognition technology. However, one obstacle that has not yet overcome is the recognition of words in the Brahmi script, specifically the recognition of characters, compound characters, and word because of complex structure. For this kind of complex pattern recognition problem, it is always difficult to decide which feature extraction and classifier would be the best choice. Moreover, it is also true that different feature extraction and classifiers offer complementary information about the patterns to be classified. Therefore, combining feature extraction and classifiers, in an intelligent way, can be beneficial compared to using any single feature extraction. This study proposed the combination of HOG +zonal density with SVM to recognize the Brahmi words. Keeping these facts in mind, in this paper, information provided by structural and statistical based features are combined using SVM classifier for script recognition (word-level) purpose from the Brahmi words images. Brahmi word dataset contains 6,475 and 536 images of Brahmi words of 170 classes for the training and testing, respectively, and the database is made freely available. The word samples from the mentioned database are classified based on the confidence scores provided by support vector machine (SVM) classifier while HOG and zonal density use to extract the features of Brahmi words.
    [Show full text]
  • Introduction to Unicode and Beyond Mike Mckenna Mimckenna(At)Paypal.Com Craig Cummings I18ncraig(At)Gmail.Com Tex Texin Textexin(At)Xencraft.Com V.1.4 November, 2016
    Introduction to Unicode and Beyond Mike McKenna mimckenna(at)paypal.com Craig Cummings i18ncraig(at)gmail.com Tex Texin textexin(at)xencraft.com v.1.4 November, 2016 Yahoo! C onfidential Agenda • Brief Background • Characters – Structure – Properties – Encoding • Key Specifications for Internationalization • Unicode in the Real World • How Unicode is Evolving Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 2 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode in Brief Yahoo! C onfidential Unicode - Summary • Universal Character Set replaces – ASCII – 8-bit – Double-byte and some multibyte • Encompasses over 240 known coded character sets in use today • Covers virtually all modern business languages • Additional archaic and academic scripts • Integrated with ISO 10646 as BMP • Information: http://www.unicode.org Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 4 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 History: Character Sets • Evolved when and where the need arose • Developed for stand-alone applications • Many developed their own • Example: 3270 nightmare – A different encoding for each European country! • Now: – Over 250 character sets in use around the world – Conversion? Any to any: (n)(n-1) > 62,000 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 5 Solution: Unicode • ISO 10646 - draft – Make everyone happy? • Each standard with own 16-bit plane – Big mess! Not feasible • Joe Becker/Xerox and XCCS • Project began 1988 • Industry Consortium formed 1991 – Xerox and Apple initially • ISO 10646 merged with Unicode 1992 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 6 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Some terminology … • A “glyph” is a single visual unit of text.
    [Show full text]
  • Balinese Latin Text Becomes Aksara Bali Using Rule Base Method
    I Putu Agus Eka Darma Udayana et. al., International Journal of Research in IT, Management and Engineering, ISSN 2249-1619, Impact Factor: 6.123, Volume 07 Issue 05, May 2017, Page 1-7 Balinese Latin Text Becomes Aksara Bali Using Rule Base Method I Putu Agus Eka Darma Udayana1, Made Sudarma2, and I Nyoman Satya Kumara3 1(Magister Program of Electrical and Computer Engineering, Udayana University Graduate Program, Sudirman Denpasar, Bali-Indonesia) 2,3(Department of Electrical and Computer Engineering, Faculty of Engineering, Udayana University Jimbaran Campus, Bali-Indonesia) Abstract: Lontar is one of the cultural heritage which has information about history of Balinese civilization in the past away. Problems encountered today are that the lontar are not well maintained. While, the lontar that used as letter of Balinese alphabet will be worn out soon, because it doesn’t have good endurance for long time. The exploiting of technology is one of media that can be used as a solution to solve the problems. The digitalization process of letter Balinese alphabet can be done by rewrite the script of that lontar in Balinese alphabet by using translation script with Rule Base and Levenshtein Distance Approach. The exploiting of technology will make the lontar becoming digital form and it won’t be worn out when the lontar is kept safe for long time and information that consisted in lontar can be protected for long time. Keywords: Translitation, rule base, levenshtein distance, aksara bali. I. INTRODUCTION Aksara Bali (Balinese Alphabet) is a symbol visual system that showed in a media, it has function to uncover the elements that expressing a language(Sudarma et al., 2016).
    [Show full text]
  • Association Relative À La Télévision Européenne G.E.I.E. Gtld: .Arte Status: ICANN Review Status Date: 2015-10-27 17:43:58 Print Date: 2015-10-27 17:44:13
    ICANN Registry Request Service Ticket ID: S1K9W-8K6J2 Registry Name: Association Relative à la Télévision Européenne G.E.I.E. gTLD: .arte Status: ICANN Review Status Date: 2015-10-27 17:43:58 Print Date: 2015-10-27 17:44:13 Proposed Service Name of Proposed Service: Technical description of Proposed Service: ARTE registry operator, Association Relative à la Télévision Européenne G.E.I.E. is applying to add IDN domain registration services. ARTE IDN domain name registration services will be fully compliant with IDNA 2008, as well as ICANN's Guidelines for implementation of IDNs. The language tables are submitted separately via the GDD portal together with IDN policies. ARTE is a brand TLD, as defined by the Specification 13, and as such only the registry and its affiliates are eligible to register .ARTE domain names. The full list of languages/scripts is: Azerbaijani language Belarusian language Bulgarian language Chinese language Croatian language French language Greek, Modern language Japanese language Korean language Kurdish language Macedonian language Moldavian language Polish language Russian language Serbian language Spanish language Swedish language Ukrainian language Arabic script Armenian script Avestan script Balinese script Bamum script Batak script Page 1 ICANN Registry Request Service Ticket ID: S1K9W-8K6J2 Registry Name: Association Relative à la Télévision Européenne G.E.I.E. gTLD: .arte Status: ICANN Review Status Date: 2015-10-27 17:43:58 Print Date: 2015-10-27 17:44:13 Bengali script Bopomofo script Brahmi script Buginese
    [Show full text]
  • Amendment No. 2 to Registry Agreement
    Amendment No. 2 to Registry Agreement The Internet Corporation for Assigned Names and Numbers and Infibeam Incorporation Limited agree, effective as of _______________________________ (“Amendment No. 2 Effective Date”), that the modification set forth in this amendment No. 2 (the “Amendment”) is made to the 09 January 2014 .OOO Registry Agreement between the parties, as amended (the “Agreement”). The parties hereby agree to amend Exhibit A of the Agreement by deleting section 5.3 in its entirety: [OLD TEXT] “5.3 Registry Operator may offer registration of IDNs in the following languages/scripts (IDN Tables and IDN Registration Rules will be published by the Registry Operator as specified in the ICANN IDN Implementation Guidelines): 5.3.1. Arabic script 5.3.2. Armenian script 5.3.3. Avestan script 5.3.4. Azerbaijani language 5.3.5. Balinese script 5.3.6. Bamum script 5.3.7. Batak script 5.3.8. Belarusian language 5.3.9. Bengali script 5.3.10. Bopomofo script 5.3.11. Brahmi script 5.3.12. Buginese script 5.3.13. Buhid script 5.3.14. Bulgarian language 5.3.15. Canadian Aboriginal script 5.3.16. Carian script 5.3.17. Cham script 5.3.18. Cherokee script 5.3.19. Chinese language 5.3.20. Coptic script 5.3.21. Croatian language 5.3.22. Cuneiform script 5.3.23. Cyrillic script 5.3.24. Devanagari script 5.3.25. Egyptian Hieroglyphs script 5.3.26. Ethiopic script 5.3.27. French language 5.3.28. Georgian script 5.3.29. Glagolitic script 5.3.30.
    [Show full text]
  • Hindi Transliteration and Translation of Sanskrit-Old
    __________________________________ Journal of Religious Culture Journal für Religionskultur Ed. by / Hrsg. von Edmund Weber in Association with / in Zusammenarbeit mit Matthias Benad, Mustafa Cimsit, Natalia Diefenbach, Martin Mittwede, Vladislav Serikov, Ajit S. Sikand, Ida Bagus Putu Suamba & Roger Töpelmann Goethe-Universität Frankfurt am Main in Cooperation with the Institute for Religious Peace Research / in Kooperation mit dem Institut für Wissenschaftliche Irenik Assistant Editor/ Redaktionsassistentin: Susan Stephanie Tsomakaeva ISSN 1434-5935 - © E.Weber – E-mail: [email protected]; [email protected] ; irenik.org/publikationen/jrc; http://publikationen.ub.uni-frankfurt.de/solrsearch/index/search/searchtype/series/id/16137; http://web.uni-frankfurt.de/irenik/ew.htm; http://irenik.org/; http://www.wissenschaftliche-irenik.org/ ____________________________________ No. 237 (2018) Hindi Transliteration and Translation of Sanskrit-Old Javanese Texts: An attempt to introduce Indian cultural influences abroad to native India1 By Ida Bagus Putu Suamba* 2 Abstract Strong imprints of Indian culture in various forms or modes of expressions are significantly found in Java. Sanskrit-Old Javanese texts, amongst those texts and traditions, were produced in the island in the periods between 9th to 15th cen. A.D. It covers various genres and subjects enriching indigenous culture in the archipelago. Tutur or tattva texts were one genre of them recorded the dynamic of Javanese intellectuals or poet-sages in pursuing the truth; they reveal metaphysical or theological aspects of Brahmanism, Saivism, Buddhism, Tantrism, Samkhya, Yoga, etc. This paper attempts to study ideas behind the Hindi transliteration and translation of those texts. This is a library reserach, the data were collected from Hindi translation of those texts.
    [Show full text]
  • Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia
    Journal of Imaging Article Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia Made Windu Antara Kesiman 1,2,* ID , Dona Valy 3,4, Jean-Christophe Burie 1, Erick Paulus 5, Mira Suryani 5, Setiawan Hadi 5, Michel Verleysen 2, Sophea Chhun 4 and Jean-Marc Ogier 1 1 Laboratoire Informatique Image Interaction (L3i), Université de La Rochelle, 17042 La Rochelle, France; [email protected] (J.-C.B.); [email protected] (J.-M.O.) 2 Laboratory of Cultural Informatics (LCI), Universitas Pendidikan Ganesha, Singaraja, Bali 81116, Indonesia; [email protected] 3 Institute of Information and Communication Technologies, Electronic, and Applied Mathematics (ICTEAM), Université Catholique de Louvain, 1348 Louvain-la-Neuve, Belgium; [email protected] 4 Department of Information and Communication Engineering, Institute of Technology of Cambodia, Phnom Penh, Cambodia; [email protected] 5 Department of Computer Science, Universitas Padjadjaran, Bandung 45363, Indonesia; [email protected] (E.P.); [email protected] (M.S.); [email protected] (S.H.) * Correspondence: [email protected] Received: 15 December 2017; Accepted: 18 February 2018; Published: 22 February 2018 Abstract: This paper presents a comprehensive test of the principal tasks in document image analysis (DIA), starting with binarization, text line segmentation, and isolated character/glyph recognition, and continuing on to word recognition and transliteration for a new and challenging collection of palm leaf manuscripts from Southeast Asia. This research presents and is performed on a complete dataset collection of Southeast Asian palm leaf manuscripts. It contains three different scripts: Khmer script from Cambodia, and Balinese script and Sundanese script from Indonesia.
    [Show full text]
  • Unicode Character Properties
    Unicode character properties Document #: P1628R0 Date: 2019-06-17 Project: Programming Language C++ Audience: SG-16, LEWG Reply-to: Corentin Jabot <[email protected]> 1 Abstract We propose an API to query the properties of Unicode characters as specified by the Unicode Standard and several Unicode Technical Reports. 2 Motivation This API can be used as a foundation for various Unicode algorithms and Unicode facilities such as Unicode-aware regular expressions. Being able to query the properties of Unicode characters is important for any application hoping to correctly handle any textual content, including compilers and parsers, text editors, graphical applications, databases, messaging applications, etc. static_assert(uni::cp_script('C') == uni::script::latin); static_assert(uni::cp_block(U'[ ') == uni::block::misc_pictographs); static_assert(!uni::cp_is<uni::property::xid_start>('1')); static_assert(uni::cp_is<uni::property::xid_continue>('1')); static_assert(uni::cp_age(U'[ ') == uni::version::v10_0); static_assert(uni::cp_is<uni::property::alphabetic>(U'ß')); static_assert(uni::cp_category(U'∩') == uni::category::sm); static_assert(uni::cp_is<uni::category::lowercase_letter>('a')); static_assert(uni::cp_is<uni::category::letter>('a')); 3 Design Consideration 3.1 constexpr An important design decision of this proposal is that it is fully constexpr. Notably, the presented design allows an implementation to only link the Unicode tables that are actually used by a program. This can reduce considerably the size requirements of an Unicode-aware executable as most applications often depend on a small subset of the Unicode properties. While the complete 1 Unicode database has a substantial memory footprint, developers should not pay for the table they don’t use. It also ensures that developers can enforce a specific version of the Unicode Database at compile time and get a consistent and predictable run-time behavior.
    [Show full text]
  • Universal Scripts Project: Statement of Significance and Impact
    Universal Scripts Project: Statement of Significance and Impact The Universal Scripts Project expands the capabilities of the Internet by providing digital access to text materials from a variety of modern and historical cultures whose writing systems are not currently included in the international standard for electronic representation of scripts, known as Unicode. People who write in these scripts find it difficult to use email, compose and send documents electronically, and post documents on the World Wide Web, without relying on nonstandard fonts or other cumbersome workarounds, and are therefore left out of the “technological revolution.” About 66 scripts are currently included in the Unicode standard, but over 80 are not. Some 40 of these missing scripts belong to modern linguistic minorities in Africa, the Indian subcontinent, China, and other countries in Southeast Asia; about 40 are scripts of historical importance. The project’s goal for 2007–2008 is to provide the standards bodies overseeing character sets with proposals for 15 scripts to be included in the Unicode standard. The scripts selected for inclusion include 9 modern minority scripts and 6 historical scripts. The need is urgent, because the entire process, from first proposal to acceptance, typically takes from 2 to 5 years, and support among corporations and national bodies for adding more scripts to Unicode is uncertain. If the proposals are not submitted soon, these user communities will not be able to use their scripts in the near future. The scripts selected for this grant have established scholarly and user-community connections, which will help guarantee that the proposals meet the users' needs.
    [Show full text]