Optical Character Recognition System for Baybayin Scripts Using Support Vector Machine
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Schwa Deletion: Investigating Improved Approach for Text-To-IPA System for Shiri Guru Granth Sahib
ISSN (Online) 2278-1021 ISSN (Print) 2319-5940 International Journal of Advanced Research in Computer and Communication Engineering Vol. 4, Issue 4, April 2015 Schwa Deletion: Investigating Improved Approach for Text-to-IPA System for Shiri Guru Granth Sahib Sandeep Kaur1, Dr. Amitoj Singh2 Pursuing M.E, CSE, Chitkara University, India 1 Associate Director, CSE, Chitkara University , India 2 Abstract: Punjabi (Omniglot) is an interesting language for more than one reasons. This is the only living Indo- Europen language which is a fully tonal language. Punjabi language is an abugida writing system, with each consonant having an inherent vowel, SCHWA sound. This sound is modifiable using vowel symbols attached to consonant bearing the vowel. Shri Guru Granth Sahib is a voluminous text of 1430 pages with 511,874 words, 1,720,345 characters, and 28,534 lines and contains hymns of 36 composers written in twenty-two languages in Gurmukhi script (Lal). In addition to text being in form of hymns and coming from so many composers belonging to different languages, what makes the language of Shri Guru Granth Sahib even more different from contemporary Punjabi. The task of developing an accurate Letter-to-Sound system is made difficult due to two further reasons: 1. Punjabi being the only tonal language 2. Historical and Cultural circumstance/period of writings in terms of historical and religious nature of text and use of words from multiple languages and non-native phonemes. The handling of schwa deletion is of great concern for development of accurate/ near perfect system, the presented work intend to report the state-of-the-art in terms of schwa deletion for Indian languages, in general and for Gurmukhi Punjabi, in particular. -
A Comparative Study of Shan and Standard Thai Morphology
A COMPARATIVE STUDY OF SHAN AND STANDARD THAI MORPHOLOGY Kittisara A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Master of Arts (Linguistics) Graduate School Mahachulalongkornrajavidayalaya University C.E. 2018 A Comparative Study of Shan and Standard Thai Morphology Kittisara A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Master of Arts (Linguistics) Graduate School Mahachulalongkornrajavidayalaya University C.E. 2018 (Copyright by Mahachulalongkornrajavidyalaya University) i Thesis Title : A Comparative Study of Shan and Standard Thai Morphology Researcher : Kittisara Degree : Master of Arts in Linguistics Thesis Supervisory Committee : Assoc. Prof. Nilratana Klinchan B.A. (English), M.A. (Political Science) : Asst. Prof. Dr. Phramaha Suriya Varamedhi B.A. (Philosophy), M.A. (Linguistics), Ph.D. (Linguistics) Date of Graduation : March 19, 2019 Abstract The purpose of this research is to explore the comparative study of Shan and standard Thai Morphology. The objectives of the study are classified into three parts as the following; (1) To study morpheme of Shan and standard Thai, (2) To study the word-formation of Shan and standard Thai and (3) To compare the morpheme and word-classes of Shan and standard Thai. This research is the qualitative research. The population referred to this research, researcher selects Shan people who were born at Tachileik in Shan state consisting of 6 persons. Area of research is Shan people at Tachileik in Shan state union of Myanmar. Research method, the tool used in the research, the researcher makes interview and document research. The main important parts in this study based on content analysis as documentary research by selecting primary sources from the books, academic books, Shan dictionary, Thai dictionary, library, online research and the research studied from informants' native speakers for 6 persons. -
Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display. -
'14 Oct 28 P6:23
SIXTEENTH CONGRESS Of ) THE REPUBLIC OF THE PHJUPPINES ) Second Regular Session ) '14 OCT 28 P6:23 SENATE S.B. NO. 2440 Introduced by SENATOR LOREN LEGARDA AN ACT DECLARING "BAYBAYIN" AS THE NATIONAL WRITING SYSTEM OF THE PHILIPPINES, PROVIDING FOR ITS PROMOTION, PROTECTION, PRESERVATION AND CONSERVATION, AND FOR OTHER PURPOSES EXPLANATORY NOTE The "Baybayin" is an ancient Philippine system of writing composed of a set of 17 cursive characters or letters which represents either a single consonant or vowel or a complete syllable. It started to spread throughout the country in the 16th century. However, today, there is a shortage of proof of the existence of "baybayin" scripts due to the fact that prior to the arrival of Spaniards in the Philippines, Filipino scribers would engrave mainly on bamboo poles or tree barks and frail materials that decay qUickly. To recognize our traditional writing systems which are objects of national importance and considered as a National Cultural Treasure, the "Baybayin", an indigenous national writing script which will be Tagalog-based national written language and one of the existing native written languages, is hereby declared collectively as the "Baybayin" National Writing Script of the country. Hence, it should be promoted, protected, preserved and conserved for posterity for the next generation. This bill, therefore, seeks to declare "baybayin" as the National Writing Script of the Philippines. It shall mandate the NCCA to promote, protect, preserve and conserve the script through the following (a) -
Recognition of Baybayin Symbols (Ancient Pre-Colonial Philippine Writing System) Using Image Processing
ISSN 2278-3091 Volume 9, No.1, January – February 2020 Mark Jovic A. Daday et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(1), January – February 2020, 594 – 598 International Journal of Advanced Trends in Computer Science and Engineering Available Online at http://www.warse.org/IJATCSE/static/pdf/file/ijatcse83912020.pdf https://doi.org/10.30534/ijatcse/2020/83912020 Recognition of Baybayin Symbols (Ancient Pre-Colonial Philippine Writing System) using Image Processing Mark Jovic A. Daday1, Arnel C. Fajardo2, Ruji P. Medina3 1 Technological Institute of the Philippines, Quezon City, Philippines, [email protected] 2 School of Computer Science Manuel L. Quezon University, Diliman, Quezon City, Philippines, [email protected] 3 Technological Institute of the Philippines, Quezon City, Philippines, [email protected] because of the recognition problems encountered. Due to its ABSTRACT problem faces throughout recognition, computer is incapable to take out the features correctly when scanning them in [4]. The goal of this paper is to accomplish an Optical Character Recognition (OCR) that gives an extremely contribution to Alternatively, unusual modern methods for a several the advancement of technology in terms of image recognition concealed datasets of text and images were faced into more in Machine Learning. The researcher introduces the experiments, that directs to a compilation of multiple type of Feed-Forward Neural Network with Dropout Method font and unusual ruin degree in [5]. The goal of this paper is to (FFNNDM) and Convolutional Neural Network with Dropout accomplish an OCR system for the Baybayin handwriting Method (CNNDM) for the recognition of the Baybayin symbols or so called (“Alibata) an ancient and national symbols. -
Second Language Writing System Word Recognition (With a Focus on Lao)
Second Language Writing System Word Recognition (with a focus on Lao) Christine Elliott University of Wisconsin-Madison Abstract Learning a second language (L2) with a script different from the learner’s first language (L1) presents unique challenges for both stu- dent and teacher. This paper looks at current theory and research examining issues of second language writing system (L2WS) acquisi- tion, particularly issues pertaining to decoding and word recognition1 by adult learners. I argue that the importance of word recognition and decoding in fluent L1 and L2 reading has been overshadowed for several decades by a focus on research looking at top-down reading processes. Although top-down reading processes and strategies are clearly components of successful L2 reading, I argue that more atten- tion needs to be given to bottom-up processing skills, particularly for beginning learners of an L2 that uses a script that is different from their L1. I use the example of learning Lao as a second language writing system where possible and suggest preliminary pedagogical implications. Introduction Second language writing systems have increasingly become the focus of a growing body of research drawing on the fields of psy- chology, education, linguistics, and second language acquisition, among others. The term writing system is used to refer to the ways in which written symbols represent language in a systematic way (Cook and Bassetti, 2005). Further, a writing system can be discussed in terms of both its script and its orthography. Cook and Bassetti de- fine script as the physical implementation of a writing system (i.e. the written symbols) and orthography as “the rules for using a script in a 1 Following Koda (2005), I define word recognition as “the process of extract- ing lexical information from graphic displays of words,” and decoding as the specific process of extracting phonological information. -
Simplified Abugidas
Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan fchenchen.ding, mutiyama, [email protected] phonogram segmental abjad Abstract writing alphabet system An abugida is a writing system where the logogram syllabic abugida consonant letters represent syllables with Figure 1: Hierarchy of writing systems. a default vowel and other vowels are de- noted by diacritics. We investigate the fea- ណូ ន ណណន នួន ននន …ជិតណណន… sibility of recovering the original text writ- /noon/ /naen/ /nuən/ /nein/ vowel machine ten in an abugida after omitting subordi- diacritic omission learning ណន នន methods nate diacritics and merging consonant let- consonant character ters with similar phonetic values. This is merging N N … J T N N … crucial for developing more efficient in- (a) ABUGIDA SIMPLIFICATION (b) RECOVERY put methods by reducing the complexity in abugidas. Four abugidas in the south- Figure 2: Overview of the approach in this study. ern Brahmic family, i.e., Thai, Burmese, ters equally. In contrast, abjads (e.g., the Arabic Khmer, and Lao, were studied using a and Hebrew scripts) do not write most vowels ex- newswire 20; 000-sentence dataset. We plicitly. The third type, abugidas, also called al- compared the recovery performance of a phasyllabary, includes features from both segmen- support vector machine and an LSTM- tal and syllabic systems. In abugidas, consonant based recurrent neural network, finding letters represent syllables with a default vowel, that the abugida graphemes could be re- and other vowels are denoted by diacritics. -
14 South and Central Asia-III 14 Ancient Scripts
The Unicode® Standard Version 12.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2019 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 12.0. Includes index. ISBN 978-1-936213-22-1 (http://www.unicode.org/versions/Unicode12.0.0/) 1. -
A Abjad, 430, 891 Abugida, 430, 891 Academic Systems, 429, 448
Index A Anchor points, 717 Abjad, 430, 891 Angular radial transform (ART) descriptors, Abugida, 430, 891 526, 530, 532 Academic systems, 429, 448 Animated, 625, 630, 633, 636, 637 Accidental forgery, 932 Anisotropic Gaussian, 269–271 Accuracy, 44, 50, 51, 54, 1023, 1024, ANN. See Artificial neural networks (ANN) 1027, 1033 Annotation, 630, 639, 966–968, 976, 983–1002 Acid-free paper, 15 Application domains, 182, 198 Acquisition, 11–60, 984–986, 989–991, Approximate NN search, 170 993, 996 APTI, 450 Action plan, 919, 921, 926, 929, 930 Arabic AdaBoost, 862, 865, 866 alphabet, 430, 434–436, 449 Adaptive local connectivity map, 476 extension, 435 Address block(s), 715, 717, 720, 721, 723, 724 letters, 432, 435, 436 Address block location, 720, 723–724, 727 OCR, 450 Address database, 715, 720, 729, 730, 744 writing styles, 437–438 Address interpretation, 723 writing system, 432 Address recognition systems, 709, 710, Arabic and Persian signatures, 930 720–723, 732, 733, 744 Arabic and Syriac are cursive, 428 Adjacency Arabic recognition approaches, 446 grammars, 528, 537 Aramaic letters, 433 matrices, 541, 544 Arc detection, 505, 511, 512 Affine covariant, 629 Architectural drawings, 493, 495, 511–513 AHARONI, 440 Area under the curve (AUC), 931 Algebraic invariants, 617 Area Voronoi diagram, 146, 147, 156, 157, Allograph, 303, 304 161, 163 Alphabet(s), 7–9, 303, 304, 306, 307, 312, Arial, 323 891, 892, 896, 897, 902, 904, 906, Arrowheads, 493, 512, 513 908, 913 Artificial intelligence, 332 Alphabetic class, 429 Artificial neural networks (ANN), 617, 632, Alphabetic fields, 719 636, 821, 835 Ambiguity, 680 Aruspix, 753, 768, 771 Analysis of Invoices, 184 Aryan language, 304 Analysis System to Interpret Areas in Ascenders and descenders, 325, 442–444, 810, Single-sided Letters (ANASTASIL), 812, 814 181, 182, 207, 208 ASCII characters, 817 Analytical word recognition, 725 Asian scripts, 460–462, 471, 475, 483 D. -
General Historical and Analytical / Writing Systems: Recent Script
9 Writing systems Edited by Elena Bashir 9,1. Introduction By Elena Bashir The relations between spoken language and the visual symbols (graphemes) used to represent it are complex. Orthographies can be thought of as situated on a con- tinuum from “deep” — systems in which there is not a one-to-one correspondence between the sounds of the language and its graphemes — to “shallow” — systems in which the relationship between sounds and graphemes is regular and trans- parent (see Roberts & Joyce 2012 for a recent discussion). In orthographies for Indo-Aryan and Iranian languages based on the Arabic script and writing system, the retention of historical spellings for words of Arabic or Persian origin increases the orthographic depth of these systems. Decisions on how to write a language always carry historical, cultural, and political meaning. Debates about orthography usually focus on such issues rather than on linguistic analysis; this can be seen in Pakistan, for example, in discussions regarding orthography for Kalasha, Wakhi, or Balti, and in Afghanistan regarding Wakhi or Pashai. Questions of orthography are intertwined with language ideology, language planning activities, and goals like literacy or standardization. Woolard 1998, Brandt 2014, and Sebba 2007 are valuable treatments of such issues. In Section 9.2, Stefan Baums discusses the historical development and general characteristics of the (non Perso-Arabic) writing systems used for South Asian languages, and his Section 9.3 deals with recent research on alphasyllabic writing systems, script-related literacy and language-learning studies, representation of South Asian languages in Unicode, and recent debates about the Indus Valley inscriptions. -
A Baybayin Word Recognition System
A Baybayin word recognition system Rodney Pino, Renier Mendoza and Rachelle Sambayan Institute of Mathematics, University of the Philippines Diliman, Quezon City, Metro Manila, Philippines ABSTRACT Baybayin is a pre-Hispanic Philippine writing system used in Luzon island. With the effort in reintroducing the script, in 2018, the Committee on Basic Education and Culture of the Philippine Congress approved House Bill 1022 or the ''National Writing System Act,'' which declares the Baybayin script as the Philippines' national writing system. Since then, Baybayin OCR has become a field of research interest. Numerous works have proposed different techniques in recognizing Baybayin scripts. However, all those studies anchored on the classification and recognition at the character level. In this work, we propose an algorithm that provides the Latin transliteration of a Baybayin word in an image. The proposed system relies on a Baybayin character classifier generated using the Support Vector Machine (SVM). The method involves isolation of each Baybayin character, then classifying each character according to its equivalent syllable in Latin script, and finally concatenate each result to form the transliterated word. The system was tested using a novel dataset of Baybayin word images and achieved a competitive 97.9% recognition accuracy. Based on our review of the literature, this is the first work that recognizes Baybayin scripts at the word level. The proposed system can be used in automated transliterations of Baybayin texts transcribed in old books, tattoos, signage, graphic designs, and documents, among others. Subjects Computational Linguistics, Computer Vision, Natural Language and Speech, Optimization Theory and Computation, Scientific Computing and Simulation Keywords Baybayin, Optical character recognition, Support vector machine, Baybayin word recognition Submitted 17 March 2021 Accepted 25 May 2021 INTRODUCTION Published 16 June 2021 Baybayin is a pre-colonial writing system primarily used by Tagalogs in the northern Corresponding author Philippines. -
3 Writing Systems
Writing Systems 43 3 Writing Systems PETER T. DANIELS Chapters on writing systems are very rare in surveys of linguistics – Trager (1974) and Mountford (1990) are the only ones that come to mind. For a cen- tury or so – since the realization that unwritten languages are as legitimate a field of study, and perhaps a more important one, than the world’s handful of literary languages – writing systems were (rightly) seen as secondary to phonological systems and (wrongly) set aside as unworthy of study or at best irrelevant to spoken language. The one exception was I. J. Gelb’s attempt (1952, reissued with additions and corrections 1963) to create a theory of writ- ing informed by the linguistics of his time. Gelb said that what he wrote was meant to be the first word, not the last word, on the subject, but no successors appeared until after his death in 1985.1 Although there have been few lin- guistic explorations of writing, a number of encyclopedic compilations have appeared, concerned largely with the historical development and diffusion of writing,2 though various popularizations, both new and old, tend to be less than accurate (Daniels 2000). Daniels and Bright (1996; The World’s Writing Systems: hereafter WWS) includes theoretical and historical materials but is primarily descriptive, providing for most contemporary and some earlier scripts information (not previously gathered together) on how they represent (the sounds of) the languages they record. This chapter begins with a historical-descriptive survey of the world’s writ- ing systems, and elements of a theory of writing follow.