Neural Machine Translation of Logographic Language Using Sub

Total Page:16

File Type:pdf, Size:1020Kb

Neural Machine Translation of Logographic Language Using Sub Neural Machine Translation of Logographic Languages Using Sub-character Level Information Longtu Zhang Mamoru Komachi Tokyo Metropolitan University Tokyo Metropolitan University 6-6 Asahigaoka, Hino, 6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan Tokyo 191-0065, Japan [email protected] [email protected] Abstract model better encode the source sentence and de- code the target sentence, particularly when the Recent neural machine translation (NMT) source and target languages share some similar- systems have been greatly improved by ities. encoder-decoder models with attention mech- anisms and sub-word units. However, im- Almost all of the methods used to improve portant differences between languages with NMT systems were developed for alphabetic lan- logographic and alphabetic writing systems guages such as English, French, and German as have long been overlooked. This study fo- either the source or target language, or both. An cuses on these differences and uses a simple alphabetic language typically uses an alphabet: a approach to improve the performance of NMT small set of letters (basic writing symbols) that systems utilizing decomposed sub-character level information for logographic languages. each roughly represents a phoneme in the spo- Our results indicate that our approach not ken language. Words are composed by ordered only improves the translation capabilities of letters, and sentences are composed by space- NMT systems between Chinese and English, segmented ordered words. However, in other but also further improves NMT systems be- major writing systems—namely, logographic (or tween Chinese and Japanese, because it uti- character-based) languages such as Chinese, lizes the shared information brought by simi- Japanese, and traditional Korean—strokes are lar sub-character units. used to construct ideographs; ideographs are used 1 Introduction to construct characters, which are the basic units for meaningful words. Words can then further Neural machine translation (Cho et al., 2014) compose sentences. In alphabetic languages, (NMT) systems based on sequence-to-sequence sub-word units are easy to identify, whereas in models (Sutskever et al., 2014) have recently be- logographic languages, a similar effect can be come the de facto standard architecture. The achieved only if sub-character level information models use attention mechanisms (Bahdanau is taken into consideration.1 et al., 2015; Luong et al., 2015) to keep records Having noticed this significant difference of all encoding results, and can focus on particu- between these two writing systems, Shi et lar parts of these results during decoding, so that al. (2015), Liu et al. (2017), Peng et al. (2017), the model can produce longer and more accurate and Cao et al. (2017) used stroke-level informa- translations. Sub-word units are another tech- tion for logographic languages when constructing nique first introduced by Sennrich’s (2016) appli- word embeddings; Toyama et al. (2017) used vi- cation of the byte pair encoding (BPE) algorithm, sual information for strokes and Japanese Kanji and are used to break up words in both source and target sentences into sequences of smaller units, 1Taking the ASPEC corpus as an example, the average learned without supervision. This alleviates the word lengths are roughly 1.5 characters (Chinese words, to- kenized by Jieba tokenizer), 1.7 characters (Japanese words, risk of producing <unk> symbols when the model tokenized by MeCab tokenizer), and 5.7 characters (English encounters infrequent “unknown” words, also words, tokenized by Moses tokenizer), respectively. There- known as the out-of-vocabulary (OOV) problem. fore, when a sub-word model of similar vocabulary size is ap- plied directly, English sub-words usually contain several let- Moreover, sub-word units, which can be viewed ters, which are more effective in facilitating NMT, whereas as learned stems and affixes, can help the NMT Chinese and Japanese sub-words are largely just characters. Proceedings of the Third Conference on Machine Translation17 (WMT), Volume 1: Research Papers, pages 17–25 Belgium, Brussels, October 31 - Novermber 1, 2018. c 2018 Association for Computational Linguistics https://doi.org/10.18653/v1/W18-64003 radicals in a text classification task.2 2. We facilitate the encoding or decoding pro- Some studies have performed NMT tasks using cess by using sub-character sequences on ei- various sub-word “equivalents”. For instance, ther the source or target side of the NMT Du and Way (2017) trained factored NMT mod- system. This will improve translation perfor- els using “Pinyin”3 sequences on the source side. mance; if sub-character information is shared Unfortunately, they did not apply a BPE algo- between the encoder and decoder, it will fur- rithm during training, and their model also cannot ther benefit the NMT system. perform factored decoding. Wang et al. (2017) 4 directly applied a BPE algorithm to character se- 3. Specifically, Chinese ideograph data and quences before building NMT models. However, Japanese stroke data are the best choices for they did not take advantage of sub-character level relevant NMT tasks. information during the training of sub-word and 2 Background NMT models. Kuang and Han (2018) also at- tempted to use a factored encoder for Chinese 2.1 NMT with Attention Mechanisms and NMT systems using radical data. It is worth not- Sub-word Units ing that although the idea of using ideographs and In this study, we applied a sequence-to-sequence strokes in NLP tasks (particularly in NMT tasks) model with an attention mechanism (Bahdanau is not new, no previous NMT research has fo- et al., 2015). The basic recurrent unit is the “long cused on the decoding process. If it is also possi- short-term memory” (Hochreiter and Schmidhu- ble to construct an ideograph/stroke decoder, we ber, 1997) unit. Because of the nature of the can further investigate translations between lo- sequence-to-sequence model, the vocabulary size gographic languages. Additionally, no NMT re- must be limited for the computational efficiency search has previously used stroke data. of the Softmax function. In such cases, the de- To summarize, there are three potential in- coder outputs an <unk> symbol for any word formation gaps associated with current studies that is not in the vocabulary, which will harm on NMT systems for logographic languages us- the translation quality. This is called the out-of- ing sub-character level data: 1) no research has vocabulary (OOV) problem. been performed on the decoding process; 2) no Sub-word unit algorithms (such as BPE algo- studies have trained models using sub-character rithms) first break up a sentence into the smallest level sub-words; and 3) no studies have attempted possible units. Then, two adjacent units at a time to build NMT models for logographic language are merged according to some standard (e.g., the pairs, despite their sharing many similarities. co-occurrence frequency). Finally, after n steps, This study investigates whether sub-character in- the algorithm collects the merged units as “sub- formation can facilitate both encoding and decod- word” units. By using sub-word units, it is pos- ing in NMT systems and between logographic sible to represent a large number of words with language pairs, and aims to determine the best a small vocabulary. Originally, sub-word units sub-character unit granularity for each setting. were only applied to unknown words (Sennrich The main contributions of this study are three- et al., 2016). However, in the recent GNMT (Wu fold: et al., 2016) and transformer systems (Vaswani et al., 2017), all words are broken up into sub- 1. We create a sub-character database of Chi- word units to better represent the shared informa- nese character-based languages, and conduct tion. MT experiments using various types of sub- For alphabetic languages, researchers have in- character NMT models. dicated that sub-word units are useful for solving 2To be more precise, there is another so-called syl- OOV problems, and that shared information can labic writing system, which uses individual symbols to further improve translation quality. The Senten- represent symbols rather than phonemes. Japanese hira- 5 gana and katakana are actually syllabic symbols rather than cepiece project compared several combinations ideographs. In this paper, we focus only on the logographic of word-pieces (Kudo, 2018) and BPE sub-word part. 3An official Romanization system for standard Chinese 4We use the term “logographic” to refer to writing sys- in mainland China. Pinyin includes both letters and dia- tems such as Chinese characters and Japanese Kanji, and critics, which represent phonemic and tonal information, re- “ideograph” to refer to the character components. spectively. 5https://github.com/google/sentencepiece 18 Semantic Phonetic Word Meaning Ideographs Character Pinyin ideograph ideograph 树木 Wood 木对木 驰 run 马 horse 也 chí 森林 Forest 木木木木木 池 pool 水(氵) water 也 chí 施 impose 方 direction 也 sh Table 2: Examples of multi-character words in Chi- 弛 loosen 弓 bow 也 chí nese and their ideograph sequences. 地 land 土 soil 也 dì 驱 drive 马 horse 区 q tively). A few ideographs can also be treated as Table 1: Examples of decomposed ideographs of Chi- standalone characters. nese characters. The composing ideographs of differ- To the best of our knowledge, however, no re- ent functionality might be shared across different char- search has been performed on logographic lan- acters. guage NMT beyond character-level data, except in the work of Du and Way (2017), who attempted models in English/Japanese NMT tasks. The sub- to use Pinyin sequences instead of character se- word units were trained on character (Japanese quences in Chinese–English NMT tasks. Consid- Kanji and Hiragana/Katakana) sequences. Sim- ering the fact that there are a large number of ho- ilarly, Wang et al. (2017) attempted to compare mophones and homonyms in Chinese languages, the effects of different segmentation methods on it was difficult for this method to be used to re- NMT tasks, including “BPE” units trained on construct characters in the decoding step.
Recommended publications
  • Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging
    ORIGINAL ARTICLE Neuroscience http://dx.doi.org/10.3346/jkms.2014.29.10.1416 • J Korean Med Sci 2014; 29: 1416-1424 Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging Zang-Hee Cho,1 Nambeom Kim,1 The two basic scripts of the Korean writing system, Hanja (the logography of the traditional Sungbong Bae,2 Je-Geun Chi,1 Korean character) and Hangul (the more newer Korean alphabet), have been used together Chan-Woong Park,1 Seiji Ogawa,1,3 since the 14th century. While Hanja character has its own morphemic base, Hangul being and Young-Bo Kim1 purely phonemic without morphemic base. These two, therefore, have substantially different outcomes as a language as well as different neural responses. Based on these 1Neuroscience Research Institute, Gachon University, Incheon, Korea; 2Department of linguistic differences between Hanja and Hangul, we have launched two studies; first was Psychology, Yeungnam University, Kyongsan, Korea; to find differences in cortical activation when it is stimulated by Hanja and Hangul reading 3Kansei Fukushi Research Institute, Tohoku Fukushi to support the much discussed dual-route hypothesis of logographic and phonological University, Sendai, Japan routes in the brain by fMRI (Experiment 1). The second objective was to evaluate how Received: 14 February 2014 Hanja and Hangul affect comprehension, therefore, recognition memory, specifically the Accepted: 5 July 2014 effects of semantic transparency and morphemic clarity on memory consolidation and then related cortical activations, using functional magnetic resonance imaging (fMRI) Address for Correspondence: (Experiment 2). The first fMRI experiment indicated relatively large areas of the brain are Young-Bo Kim, MD Department of Neuroscience and Neurosurgery, Gachon activated by Hanja reading compared to Hangul reading.
    [Show full text]
  • The Challenge of Chinese Character Acquisition
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Faculty Publications: Department of Teaching, Department of Teaching, Learning and Teacher Learning and Teacher Education Education 2017 The hC allenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem Justin Olmanson University of Nebraska at Lincoln, [email protected] Xianquan Chrystal Liu University of Nebraska - Lincoln, [email protected] Follow this and additional works at: http://digitalcommons.unl.edu/teachlearnfacpub Part of the Bilingual, Multilingual, and Multicultural Education Commons, Chinese Studies Commons, Curriculum and Instruction Commons, Instructional Media Design Commons, Language and Literacy Education Commons, Online and Distance Education Commons, and the Teacher Education and Professional Development Commons Olmanson, Justin and Liu, Xianquan Chrystal, "The hC allenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem" (2017). Faculty Publications: Department of Teaching, Learning and Teacher Education. 239. http://digitalcommons.unl.edu/teachlearnfacpub/239 This Article is brought to you for free and open access by the Department of Teaching, Learning and Teacher Education at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Faculty Publications: Department of Teaching, Learning and Teacher Education by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. Volume 4 (2017)
    [Show full text]
  • Rune Caster – a New Character Class
    Sample file Rune Caster – A New Character Class THE RUNE CASTER WHAT IS A RUNE CASTER? WHY PLAY A RUNE CASTER AND NOT A A rune caster is a master of arcane spells, but WIZARD? rather than relying on a spellbook, the rune In certain campaigns your pointy hat caster uses magical symbols known as runes. In bookworm wizard might not fit thematically, addition the rune caster is much more for instance in a Viking campaign. In these dependent on ritual spellcasting. situations the rune caster might be a better option. WHAT IS A RUNE? A rune is a magical symbol that allows the rune CREATING A RUNE CASTER caster to cast one or more cantrips or spells. Creating a rune caster character requires a The rune caster carries these runes around and backstory where your character came into uses them as an arcane focus for spell casting. contact with the arcane. This contact awoke an Often the rune caster will use these runes in interest, causing your character to study the rituals for casting spells. arcane afterwards. It is often the case that the study of arcane lore is forbidden and your character had to do it in secret. Perhaps the predominant religion considers such studies heresy. Whatever the case, your character feels strong and confident enough now to use the abilities. The Rune Caster Proficiency Runes Runes - Spells Slots per Spell Level - Level bonus Features Known Attuned 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 1st +2 Rune casting, Ritual casting 3 1 2 - - - - - - - - 2nd +2 Rune recovery 3 1 3 - - - - - - - - 3th +2 Portent 4 2 4 2 - - - - -
    [Show full text]
  • An Ontology for Accessing Transcription Systems (OATS)
    An Ontology for Accessing Transcription Systems (OATS) Steven Moran University of Washington Seattle, WA, USA [email protected] Abstract tions can be written to perform intelligent search (deriving implicit knowledge from explicit infor- This paper presents the Ontology for Ac- mation). They can also interoperate between re- cessing Transcription Systems (OATS), a sources, thus allowing data to be shared across ap- knowledge base that supports interopera- plications and between research communities with tion over disparate transcription systems different terminologies, annotations, and notations and practical orthographies. The knowl- for marking up data. edge base includes an ontological descrip- OATS is a knowledge base, i.e. a data source tion of writing systems and relations for that uses an ontology to specify the structure of mapping transcription system segments entities and their relations. It includes general to an interlingua pivot, the IPA. It in- knowledge of writing systems and transcription cludes orthographic and phonemic inven- systems that are core to the General Ontology of tories from 203 African languages. OATS Linguistic Description (GOLD)2 (Farrar and Lan- is motivated by the desire to query data in gendoen, 2003). Other portions of OATS, in- the knowledge base via IPA or native or- cluding the relationships encoded for relating seg- thography, and for error checking of dig- ments of transcription systems, or the computa- itized data and conversion between tran- tional representations of these elements, extend scription systems. The model in this paper GOLD as a Community of Practice Extension implements these goals. (COPE) (Farrar and Lewis, 2006). OATS provides 1 Introduction interoperability for transcription systems and prac- tical orthographies that map phones and phonemes The World Wide Web has emerged as the pre- in unique relationships to their graphemic repre- dominate source for obtaining linguistic field data sentations.
    [Show full text]
  • Chapter 6, Writing Systems and Punctuation
    The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1.
    [Show full text]
  • The Phoenician Alphabet Reassessed in Light of Its Descendant Scripts and the Language of the Modern Lebanese
    1 The Phoenician Alphabet Reassessed in Light of its Descendant Scripts and the Language of the Modern Lebanese Joseph Elias [email protected] 27th of May, 2010 The Phoenician Alphabet Reassessed in Light of its Descendant Scripts and the Language of the Modern Lebanese 2 The Phoenician Alphabet Reassessed in Light of its Descendant Scripts and the Language of the Modern Lebanese DISCLAIMER: THE MOTIVATION BEHIND THIS WORK IS PURELY SCIENTIFIC. HENCE, ALL ALLEGATIONS IMPOSING THIS WORK OF PROMOTING A: POLITICAL, RELIGIOUS OR ETHNIC AGENDA WILL BE REPUDIATED. Abstract The contemporary beliefs regarding the Phoenician alphabet are reviewed and challenged, in light of the characteristics found in the ancient alphabets of Phoenicia's neighbours and the language of the modern Lebanese. 3 The Phoenician Alphabet Reassessed in Light of its Descendant Scripts and the Language of the Modern Lebanese Introduction The Phoenician alphabet, as it is understood today, is a 22 letter abjad with a one-to-one letter to phoneme relationship [see Table 1].1 Credited for being the world's first alphabet and mother of all modern alphabets, it is believed to have been inspired by the older hieroglyphics system of nearby Egypt and/or the syllabaries of Cyprus, Crete, and/or the Byblos syllabary - to which the Phoenician alphabet appears to be a graphical subset of.2 3 Table 1: The contemporary decipherment of the Phoenician alphabet. Letter Name Glyph Phonetic Value (in IPA) a aleph [ʔ] b beth [b] g gamil [g] d daleth [d] h he [h] w waw [w] z zayin [z] H heth [ħ] T teth [tʕ] y yodh [j] k kaph [k] l lamedh [l] m mem [m] n nun [n] Z samekh [s] o ayin [ʕ] p pe [p] S tsade [sʕ] q qoph [q] r resh [r] s shin [ʃ] t tau [t] 1 F.
    [Show full text]
  • Will the Chinese One Day Write with an Alphabet? Many People in This Century Have Thought So
    Ramsey, S. Robert, The Languages of China, Princeton, New Jersey: Princeton University Press, 1987. 1 8 CHINESE WRITING TODAY The Chinese script is so wonderfully well adapted to the linguistic condition of China that it is indispensa­ ble; the day the Chinese discard it they will surrender the very foundation of their culture. -Bernhard Karlgren (1929) Will the Chinese one day write with an alphabet? Many people in this century have thought so. In an open letter written around 1919, Hu Shih, for one, expressed the belief that phonetic writing was inevi­ table. To him, the coming of the alphabet appeared so likely he was concerned that the old characters might be phased out with too much haste. Other intellectuals of the time, especially leftists like Lu Xun, were convinced that alphabetization was necessary for China's survival; they looked forward hopefully to the day when Chinese would be completely Romanized. The prospects for alphabetic writing soared with the victory of the Communists in 1949. The revolutionary ideal was then-and ostensibly remains today-the promotion of literacy through the abolition of char­ acters and the substitution of some kind of alphabetic writing. Accord­ ingly, on the eve of the Revolution, one supporter of the Latinization Movement ecstatically predicted that at long last China was about to be "liberated from the shackles of monosyllabic Chinese characters." As soon as the new government consolidated power, script reform emerged as an important issue and discussions began. There were two concurrent approaches. The .firstapproach was to design a new phonetic writing system; out of these plans eventually came the Pinyin system of spelling.
    [Show full text]
  • An Efficient Character Segmentation Algorithm for Offline Handwritten Uighur Scripts Based on Grapheme Analysis
    Advances in Intelligent Systems Research, volume 133 2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE2016) An Efficient Character Segmentation Algorithm for Offline Handwritten Uighur Scripts Based on Grapheme Analysis Yamei Xu* and Panpan Du School of Compute and Communication, Lanzhou University of Technology, Lanzhou, China *Corresponding author Abstract—Cursive offline handwritten Uighur scripts contain a In basis of the above analysis, this paper proposes an lot of small and random writing strokes, which makes the effective segmentation algorithm for offline handwritten character segmentation is more complicated. In view of this, a Uighur words. The algorithm firstly over-segments a word new efficient character segmentation algorithm based on image into three types of strokes: dot, affix and main strokes. grapheme (part of a character) analysis is proposed in this paper. Then, to avoid the false segmentations due to incorrect Firstly, by dot strokes detection and Component analysis, a matching of the dot and affix strokes, the algorithm is designed handwritten Uighur word is over-segmented into three types of to over-segment the main strokes and cluster the dot and affix strokes: dot, affix and main strokes. Secondly, the main strokes ones respectively to get two types of graphemes(part of a are over-segmented but the dot strokes are clustered, so that a character) queues. One of them is called the main graphemes main graphemes queue and an additional graphemes queue are queue, and another the additional one. Finally, the graphemes constructed respectively. Finally, the best hypothesis of characters sequence is selected by analyses of the graphemes’ are matched and merged based on grapheme shapes and shapes and recognition results.
    [Show full text]
  • Extending Gardiner's Code for Hieroglyphic Recognition And
    Multimedia Tools and Applications https://doi.org/10.1007/s11042-020-09825-2 Extending Gardiner’scodeforHieroglyphic recognition and English mapping Reham Elnabawy1 & Rimon Elias1 & Mohammed A.-M. Salem 1,2 & Slim Abdennadher1 Received: 20 May 2020 /Revised: 6 August 2020 /Accepted: 2 September 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract “Knowledge is power”… Writing is the main way to preserve the humanity knowledge across the ages. Therefore, an automatic and accurate mapping from ancient scripts to modern live language is a must to have. Such a system will support knowledge transfer, tourism and education. This article presents a new algorithm to segment and recognize the ancient Egyptian Hieroglyphs from images and produce the corresponding English meaning. The algorithm used image processing along with Optical Character Recognition (OCR). Then, the meaning behind the image containing Hieroglyphs is interpreted based on the context of the mapped English sentence. Gardiner’s sign list is a standard list used to classify the Hieroglyphics symbols such that similar shapes are grouped in the same category. Hence, Hieroglyphics script is mapped to the English language. Hieroglyphics could be either read from left to right or from right to left based on the face orientation of the Hieroglyphic symbol. However, Gardiner’s code does not help to automate the reading direction of the Hieroglyphics script. In this work, an extension of the list is proposed to resolve the ambiguity in the reading direction. The results of the segmenta- tion and recognition of Hieroglyphics symbols are demonstrated and compared to similar work for Chinese character recognition.
    [Show full text]
  • The Nature of Chinese Characters Sheng Jie1
    Advances in Social Science, Education and Humanities Research, volume 497 Proceedings of the 2nd International Conference on Literature, Art and Human Development (ICLAHD 2020) The Nature of Chinese Characters Sheng Jie1 1School of Foreign Languages, Shenyang Normal University, Shenyang, Liaoning 11000, China *Corresponding author. Email: [email protected] ABSTRACT How to determine Chinese characters' nature scientifically involves the construction and understanding of the system of Chinese characters, which has been academic circles' focal point ever since the 20th century with great differences. The nature of Chinese characters is the core of the research on the basic theory of Chinese characters. In this paper, five representative viewpoints on the nature of Chinese characters are analysed. From the induction and analysis of various theories, it can be seen that Chinese characters' different stages have their own characteristics, thus summarizing Chinese characters' essential characteristics, and expounding one's own shallow understanding of this problem. Keywords: Chinese characters, Nature, The ideograph Characters 1. INTRODUCTION 2. SOME DISCUSSIONS OF THE NATURE OF CHINESE CHARACTERS Language is a symbol system combining sound and meaning, and It is also the most important communication tool for human beings. Languages and characters are gradually formed and developed in the long-term practice 2.1. The ideograph Characters Theory of human social life. Languages always appear earlier than The first to make a judgment about the nature of Chinese words. [1] Many linguists believe that words are only a characters was the linguist Saussure. In General writing symbol system for recording language, and the Linguistics Tutorial, he believes that Chinese characters word itself does not belong to the language system.
    [Show full text]
  • A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes
    A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes 1,2Samiul Alam , 1Tahsin Reasat , 1Asif Shahriyar Sushmit , 1Sadi Mohammad Siddique 3Fuad Rahman , 4Mahady Hasan , 1,*Ahmed Imtiaz Humayun 1Bengali.AI, 2Bangladesh University of Engineering and Technology, 3Apurba Technologies Inc., 4Independent University, Bangladesh Abstract Vowel Grapheme Roots Diacritics Latin has historically led the state-of-the-art in handwritten optical character recognition (OCR) re- search. Adapting existing systems from Latin to alpha- a l syllabary languages is particularly challenging due to a g n a sharp contrast between their orthographies. The seg- B mentation of graphical constituents corresponding to characters becomes significantly hard due to a cur- sive writing system and frequent use of diacritics in Consonant h s Diacritic i l the alpha-syllabary family of languages. We propose g n a labeling scheme based on graphemes (linguistic seg- E i ments of word formation) that makes segmentation in- r a g side alpha-syllabary words linear and present the first a n v e dataset of Bengali handwritten graphemes that are com- D monly used in everyday context. The dataset contains 411k curated samples of 1295 unique commonly used Alpha-syllabaryAbugida Gra pGraphemeheme Seg mSegmentsents Bengali graphemes. Additionally, the test set contains 900 uncommon Bengali graphemes for out of dictionary Figure 1: Orthographic components in a Bangla (Ben- performance evaluation. The dataset is open-sourced gali) word compared to English and Devnagari (Hindi). as a part of a public Handwritten Grapheme Classi- The word ‘Proton’ in both Bengali and Hindi is equiv- fication Challenge on Kaggle to benchmark vision al- alent to its transliteration.
    [Show full text]
  • Corrected Copy of Hargis' Full Article
    Are We Hooked on Phonics? By Dr. Charles Hargis University of Tennessee Knoxville, Tennessee At the risk of being politically incorrect and antagonizing the phonics police, I feel compelled to critique the prevailing approach to teaching reading. As a former teacher of the deaf I was occasionally surprised and sometimes annoyed when questioned about when I would start using phonics to teach reading to my students. I found that some think that phonics and reading are synonymous. It was difficult for some to understand that deaf students cannot link sounds to letters of the alphabet because the deaf don’t know the sounds. They cannot decode. There are no familiar spoken counterparts of the printed words. Deaf children learn to read by associating the unfamiliar printed words with the meaning of those words. They do this in much the same way that Chinese and Japanese children do when they learn the characters (logographs) that constitute one of the written forms of their languages. Chinese, Korean, and Japanese are written logographically, and reading logograms has no “sounding out.” The whole word or its meaning is called to mind when its form becomes sufficiently familiar. In these languages a logogram is a character that denotes the meaning but not the sound or pronunciation of a word, and each character/logogram (morpheme) fits in the same square frame. This is true whether it consists of one stroke or 64. Phonics has nothing to do with how logographically written 1 languages are read. Even though English is written alphabetically, deaf children must learn to read each word as a logogram.
    [Show full text]