Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation

Total Page:16

File Type:pdf, Size:1020Kb

Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation IALP 2019, Shanghai, Nov 15-17, 2019 Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation Jinyi Zhang Tadahiro Matsumoto Graduate School of Engineering Faculty of Engineering Gifu University Gifu University Gifu, Japan Gifu, Japan Email: [email protected] Email: [email protected] Abstract—After years of development, Neural Machine For character-level NMT, there is also an advantage that Translation (NMT) has produced richer translation results errors and fluctuations do not occur in the process of than ever over various language pairs, becoming a new dividing sentences into words (word segmentation). machine translation model with great potential. For the NMT model, it can only translate words/characters contained in Compared with the word-level NMT, the vocabulary the training data. One problem on NMT is handling of the size is kept small in character-level NMT, but there low-frequency words/characters in the training data. In this are still many characters of extremely low-frequency in paper, we propose a method for removing characters whose the vocabulary. At word-level, the method of replacing frequencies of appearance are less than a given minimum a low-frequency word having low statistical reliability threshold by decomposing such characters into their compo- nents and/or pseudo-characters, using the Chinese character with another word of related high-frequency has been decomposition table we made. Experiments of Japanese- attempted [3], but such a substitution is difficult for char- to-Chinese and Chinese-to-Japanese NMT with ASPEC-JC acters. Therefore, we devised a method for reducing low- (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) frequency characters for character-level NMT between corpus show that the BLEU scores, the training time and Japanese and Chinese by dividing low-frequency Chi- the number of parameters are varied with the number of the given minimum thresholds of decomposed characters. nese characters into constituent elements of the character (radicals: traditionally recognized components of Chinese Keywords-character decomposition; neural machine trans- characters) and pseudo partial characters. We investigated lation; Japanese-Chinese; character-level; LSTM; encoder; decoder; the effects of the method on translation results, and the number of the parameters of the model by experiments. We used Luong’s NMT system as the base system I. INTRODUCTION [4], which follows an encoder-decoder architecture with Machine translation’s performance has greatly improved global attention at the character level. In our case, we from Statistical Machine Translation (SMT), due to the chose the character-level NMT as the baseline, because the appearance of Neural Machine Translation (NMT). For character-level NMT between Japanese and Chinese has NMT, one problem is handling of the low-frequency better translation performance than the word-level NMT. words/characters in the vocabulary of the training data [1]. The main contributions of this paper are the follow- For the NMT models, as the vocabulary size increases, the ing. We created a Chinese character composition table computational complexity becomes enormous. Therefore, for finding its constituent elements. We demonstrate the in a general word-level NMT model, the vocabulary size possibility to improve the translation performance of NMT (the number of different characters) is usually limited systems by dividing the Chinese and Japanese characters to about tens of thousands of words, and the remaining into constituent elements and share them with the other low-frequency words are uniformly treated as unknown characters in the vocabulary, without changing the neural words. The increasing of unknown words leads to reduce network architecture. We believe this capability makes our translation performance, therefore the handling of low approach applicable to different NMT architectures. frequency words is a big problem in NMT. In the remainder of this paper, Section II presents Byte Pair Encoding (BPE) made the NMT model ca- the related work of this paper. Section III gives a brief pable of open-vocabulary translation by encoding low- explanation of the architecture of the NMT that we used frequency and unknown words as sequences of subword as the base system and ASPEC-JC (Asian Scientific Paper units, was proposed by Sennrich et al. [2], to be used to Excerpt Corpus, Japanese-Chinese) corpus. Section IV solve the low frequency words’ problem. describes the proposed method, how to divide the Chinese However, Chinese mainly uses Chinese characters and Japanese characters into constituent elements and (Hanzi) which are logograms. Many Chinese words are share them with the other characters in the vocabulary. written with one or two Chinese characters, as a result, it is Section V reports the experimental framework and the difficult to divide a Chinese word into high-frequency sub- results obtained in the Japanese-Chinese and Chinese- word units. Therefore, it is considered that the character- Japanese translation (with ASPEC-JC [5]). Finally, Section level is suitable for NMT between Japanese and Chinese. VI concludes with the contributions of this paper and 978-1-7281-5014-7/19/$31.00 c 2019 IEEE 35 further work. The decoder is a recurrent neural network with LSTM units that predicts a target sequence y = (y1, . , yn). II. RELATED WORK Every word (or character in case of character-level NMT) The characters used in a language are usually much yi is predicted based on a recurrent hidden state si, fewer than the words of the language. Character-level the previously predicted word (or character) yi 1, and a − neural language models [6] and MT are explored and context vector ci. ci is computed as the weighted sum of achieved respective results. Previous works, such as POS the annotations hj. Finally, the weight of each annotation tagging [7], name entity recognition [8], parsing [9], learn- hj is computed through an alignment (or attention) model ing word representations [10], and character embeddings αij, which models the probability that yi is aligned to xj. [11], shown different advantages of using character-level The forward states of the encoder is expressed as below: information in Natural Language Processing (NLP). Besides, subword-based representations (the middle −→h j = tanh(−→W Exj + −→U −→h j 1) (1) of word-based and character-based representations) have − m Vx been explored in NMT [2], and are applied to English and where E R × is a word embedding matrix, n m ∈ n n −→W R × and −→U R × are weight matrices; m, n other western languages, where most of the words consist ∈ ∈ of several or a dozen characters. Contrastingly, Chinese and Vx are the word embedding size, the number of hidden characters, which are used in Chinese, Japanese and some units, and the vocabulary size of the source language, other Asian languages, are typical logograms. A logogram respectively. is a character that represents a concept or thing, namely B. ASPEC-JC Corpus a word; and thus, it is difficult to split those words into We implement our system with the ASPEC-JC corpus, subwords. Recently, Meng et al. [12] found that character- which was constructed by manually translating Japanese based models consistently outperform subword-based and scientific papers into Chinese [5]. The Japanese scientific word-based models for deep learning-based Chinese NLP papers are either the property of the Japan Science and works. Technology Agency (JST) or stored in Japan’s Largest For Chinese-Japanese NMT, the sub-character level Electronic Journal Platform for Academic Societies (J- information improved the translation performance [13], STAGE). by using sub-character sequences on either the source or ASPEC-JC is composed of three parts: training data target side. However, about their character decomposition, (672,315 sentence pairs), development data (2,090 sen- it still needs to be explored. Du and Way [14] trained tence pairs), development-test data (2,148 sentence pairs) factored NMT models using “Pinyin” sequences on the and test data (2,107 sentence pairs) on the assumption that source side. Pinyin, is the official romanization system it would be used for machine translation research. for Chinese. This work only applied to Chinese source- ASPEC-JC contains both abstracts and some parts of side NMT. Zhang and Matsumoto [15] also attempted the body texts. ASPEC-JC only includes “Medicine”, to use a factored encoder for Japanese-Chinese NMT “Information”, “Biology”, “Environmentology”, “Chem- system using radical information. They did not achieve istry”, “Materials”, “Agriculture” and “Energy” 8 fields good results in Chinese-to-Japanese NMT. Wang et al. because it was difficult to include all the scientific fields. [16] directly applied a BPE algorithm to sequences before These fields were selected by investigating the important building NMT models. This method has only been tested scientific fields in China and the use tendency of literature in the Chinese-English direction and is not comprehensive databases by researchers and engineers in Japan. In these enough. fields, sentences belonging to the same article are not III. NEURAL MACHINE TRANSLATION AND included. ASPEC-JC CORPUS Compared with other language pairs such as English- A. Neural Machine Translation French, which usually comprises millions of parallel sen- tences. ASPEC-JC corpus only has about 672k sentences. NMT completely adopts the neural network approach to Moreover, LSTMs+attention model is usually more robust compute the conditional probability p(y x) of the target | than the transformer model [17] on smaller datasets, due sentence y for the given source sentence x. We follow to the smaller number of parameters [12]. the NMT architecture by Luong et al. [4], which we will briefly describe here. This NMT system is implemented as IV. REDUCTION OF LOW-FREQUENCY CHARACTERS a global attentional encoder-decoder neural network with BY CHARACTER DECOMPOSITION Long Short-Term Memory (LSTM), and we simply use it During the training and translation process, the training at the character level.
Recommended publications
  • Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging
    ORIGINAL ARTICLE Neuroscience http://dx.doi.org/10.3346/jkms.2014.29.10.1416 • J Korean Med Sci 2014; 29: 1416-1424 Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging Zang-Hee Cho,1 Nambeom Kim,1 The two basic scripts of the Korean writing system, Hanja (the logography of the traditional Sungbong Bae,2 Je-Geun Chi,1 Korean character) and Hangul (the more newer Korean alphabet), have been used together Chan-Woong Park,1 Seiji Ogawa,1,3 since the 14th century. While Hanja character has its own morphemic base, Hangul being and Young-Bo Kim1 purely phonemic without morphemic base. These two, therefore, have substantially different outcomes as a language as well as different neural responses. Based on these 1Neuroscience Research Institute, Gachon University, Incheon, Korea; 2Department of linguistic differences between Hanja and Hangul, we have launched two studies; first was Psychology, Yeungnam University, Kyongsan, Korea; to find differences in cortical activation when it is stimulated by Hanja and Hangul reading 3Kansei Fukushi Research Institute, Tohoku Fukushi to support the much discussed dual-route hypothesis of logographic and phonological University, Sendai, Japan routes in the brain by fMRI (Experiment 1). The second objective was to evaluate how Received: 14 February 2014 Hanja and Hangul affect comprehension, therefore, recognition memory, specifically the Accepted: 5 July 2014 effects of semantic transparency and morphemic clarity on memory consolidation and then related cortical activations, using functional magnetic resonance imaging (fMRI) Address for Correspondence: (Experiment 2). The first fMRI experiment indicated relatively large areas of the brain are Young-Bo Kim, MD Department of Neuroscience and Neurosurgery, Gachon activated by Hanja reading compared to Hangul reading.
    [Show full text]
  • A Logogram for YAH "Wound"
    Textdatenbank und Wörterbuch des Klassischen Maya Arbeitsstelle der Nordrhein-Westfälischen Akademie der Wissenschaften und der Künste an der Rheinischen Friedrich-Wilhelms-Universität Bonn ISSN 2366-5556 RESEARCH NOTE 17 Published 23 Jun 2020 DOI: 10.20376/IDIOM-23665556.20.rn017.en A Logogram for YAH "Wound" Nikolai Grube1 1) Rheinische Friedrich-Wilhelms-Universität, Bonn Among the many logographic signs which so far have escaped decipherment is a head sign which shows a V-shaped stepped design in its interior1. Figure 1. The Logograph 1078 in its manifestations 1078vc, 1078va and 1078vs. Drawings by Christian Prager. The sign (Fig. 1) has been identified by Eric Thompson (1962) as T1078, and by Martha Macri and Mathew Looper (2003) as PE3. A closer look at the sign shows that its full form includes a small attached prefix with “darkness” markings (Fig. 2a-c). The example on a shell from Piedras Negras Burial 13 (Houston et al. 1998: Fig. 3) (Fig. 2c) shows that the prefixed sign has a small hook and that it most likely represents an obsidian tool, perhaps a knife, such as the personified obsidian eccentric knife on Piedras Negras Stela 8 (Fig. 2f). This affix has not received previous attention in any of the existing sign catalogues, although it does occurs independently in other contexts, such as within the stela names on the back sides of Copan Stelae F and M (Fig. 2d, e). The fact that the sign appears with exactly the same affixation when it has the “knife” sign attached to it and without it indicates 1 This research note appears a few days after a post on the blog „Maya Decipherment“ by Dimitri Beliaev and Stephen D.
    [Show full text]
  • The Six Principles of Chinese Writing and Their Application to Design As Design Idea
    ISSN 1923-1555[Print] Studies in Literature and Language ISSN 1923-1563[Online] Vol. 8, No. 3, 2014, pp. 84-88 www.cscanada.net DOI: 10.3968/4968 www.cscanada.org The Six Principles of Chinese Writing and Their Application to Design As Design Idea ZHOU Zhen[a],* [a]Academy of Fine Arts, Shandong Normal University, Jinan, China. The Six Principles are the principles of Chinese *Corresponding author. characters’ formation and application which was Received 12 February 2014; accepted 26 May 2014 developed during the formation of Chinese characters. Published online 25 June 2014 As an open set with new characters constantly being developed, the total number of Chinese characters from Abstract past to present reaches a tremendous sum. The Chinese Given the impact that nationality and locality have on the dictionary published by the People’s Republic of China in essential elements of design, it is a demanding task for 1989 covered about 56,000 characters. It is really amazing Chinese designers to set up new Chinese design styles. that such huge and complicated character-formation can In my opinion, the Six Principles of Chinese Writing (六 be generalized by only six principles. = 書原理), which are the principles of Chinese characters’ In my opinion, the Six Principles comprise a set of formation and application, is a set of design idea that can design thoughts based on using graphics to indicate be applied to modern design. In this paper, I present my meanings. From this standpoint, the Six Principles can research on the new design idea of design based on the be regarded as an effective design method which can be Six Principles of Chinese writing with mark design as applied to modern design, especially in the field of visual examples.
    [Show full text]
  • + Natali A, Professor of Cartqraphy, the Hebreu Uhiversity of -Msalem, Israel DICTIONARY of Toponymfc TERLMINO~OGY Wtaibynafiail~
    United Nations Group of E%perts OR Working Paper 4eographicalNames No. 61 Eighteenth Session Geneva, u-23 August1996 Item7 of the E%ovisfonal Agenda REPORTSOF THE WORKINGGROUPS + Natali a, Professor of Cartqraphy, The Hebreu UhiVersity of -msalem, Israel DICTIONARY OF TOPONYMfC TERLMINO~OGY WtaIbyNafiaIl~- . PART I:RaLsx vbim 3.0 upi8elfuiyl9!J6 . 001 . 002 003 004 oo!l 006 007 . ooa 009 010 . ol3 014 015 sequala~esfocJphabedcsaipt. 016 putting into dphabetic order. see dso Kqucna ruIt!% Qphabctk 017 Rtlpreat8Ii00, e.g. ia 8 computer, wflich employs ooc only numm ds but also fetters. Ia a wider sense. aIso anploying punauatiocl tnarksmd-SymboIs. 018 Persod name. Esamples: Alfredi ‘Ali. 019 022 023 biliaw 024 02s seecIass.f- 026 GrqbicsymboIusedurunitiawrIdu~morespedficaty,r ppbic symbol in 1 non-dphabedc writiog ryste.n& Exmlptes: Chinese ct, , thong; Ambaric u , ha: Japaoese Hiragana Q) , no. 027 -.modiGed Wnprehauive term for cheater. simplified aad character, varIaoL 031 CbmJnyol 032 CISS, featm? 033 cQdedrepfwltatiul 034 035 036 037' 038 039 040 041 042 047 caavasion alphabet 048 ConMQo table* 049 0nevahte0frpointinlhisgr8ti~ . -.- w%idofplaaecoordiaarurnm;aingoftwosetsofsnpight~ -* rtcight8ngfIertoeachotkrodwithap8ltKliuofl8qthonbo&. rupenmposedonr(chieflytopogtaphtc)map.see8lsouTM gz 051 see axxdimtes. rectangufar. 052 A stahle form of speech, deriyed from a pbfgin, which has became the sole a ptincipal language of 8 qxech comtnunity. Example: Haitian awle (derived from Fresh). ‘053 adllRaIfeatlue see feature, allhlral. 054 055 * 056 057 Ac&uioaofsoftwamrcqkdfocusingrdgRaIdatabmem rstoauMe~osctlto~thisdatabase. 058 ckalog of defItitioas of lbe contmuofadigitaldatabase.~ud- hlg data element cefw labels. f0mw.s. internal refm codMndtextemty,~well~their-p,. 059 see&tadichlq. 060 DeMptioa of 8 basic unit of -Lkatifiile md defiile informatioa tooccqyrspecEcdataf!eldinrcomputernxaxtLExampk Pateofmtifii~ofluwtby~namaturhority’.
    [Show full text]
  • Preposed Phonetic Complements in Maya Hieroglyphic Writing
    PREPOSED PHONETIC COMPLEMENTS IN MAYA HIEROGLYPHIC WRITING Nikolai Grube 1. Introduction The use of phonetic complements is attested in a few ancient writing systems where word signs (logograms) exist alongside signs denoting syllabic or alphabetic values (syllabograms/phonograms). Amongst logosyllabic scripts, only Egyptian (Ritner 1996), hieroglyphic “Lu- vian” (Melchert 2004), the cuneiform scripts (Cooper 1996) and Maya hieroglyphic writing use phonetic complements. Phonetic comple- ments are syllabic signs, which are preposed, postposed, or both, in relation to a logographic sign and serve as a reading aid. Phonetic complements can be partial or complete in regard to the reading of the logographic sign. All logosyllabic scripts which use phonetic comple- mentation overwhelmingly favor final over initial position (Justeson 1978:92). Phonetic complements in initial position are so rare, that Ignace Gelb, one of the most influential theoreticians of writing sys- tems has totally ignored them (1973:275). Phonetic complements are generally considered as ways of specifying dubious readings, although their use often later was extended to cases which are not dubious. Logographic signs sometimes can have multiple potential readings, either because the signs have been adopted from a language which provides different phonic values for the same meaning or because re- lated but different meanings are rendered by a single sign (Justeson 1978:254-255). 2. Phonetic complements in Maya hieroglyphic writing The role of phonetic complements in Maya hieroglyphic writing is still little understood, although considerable progress has been made in 28 NIKOLAI GRUBE recent years. Phonetic complements do not define a separate class of signs in Maya writing but a function of syllabic signs.
    [Show full text]
  • Emergence of Representations Through Repeated Training on Pronouncing Novel Letter Combinations Leads to Efficient Reading
    Neuropsychologia 89 (2016) 14–30 Contents lists available at ScienceDirect Neuropsychologia journal homepage: www.elsevier.com/locate/neuropsychologia Emergence of representations through repeated training on pronouncing novel letter combinations leads to efficient reading Atsuko Takashima a,n, Iris Hulzink a, Barbara Wagensveld b, Ludo Verhoeven a a Radboud University, Behavioural Science Institute, PO Box 9104, 6500 HE Nijmegen, The Netherlands b Studio Lakmoes, Statenlaan 8, De Kleine Campus, BG lokaal 0.2, 6828 WE Arnhem, The Netherlands article info abstract Article history: Printed text can be decoded by utilizing different processing routes depending on the familiarity of the Received 17 December 2015 script. A predominant use of word-level decoding strategies can be expected in the case of a familiar Received in revised form script, and an almost exclusive use of letter-level decoding strategies for unfamiliar scripts. Behavioural 9 May 2016 studies have revealed that frequently occurring words are read more efficiently, suggesting that these Accepted 13 May 2016 words are read in a more holistic way at the word-level, than infrequent and unfamiliar words. To test Available online 15 May 2016 whether repeated exposure to specific letter combinations leads to holistic reading, we monitored both Keywords: behavioural and neural responses during novel script decoding and examined changes related to re- Reading peated exposure. We trained a group of Dutch university students to decode pseudowords written in an Learning unfamiliar script, i.e., Korean Hangul characters. We compared behavioural and neural responses to Serial decoding pronouncing trained versus untrained two-character pseudowords (equivalent to two-syllable pseudo- Holistic decoding fMRI words).
    [Show full text]
  • A STUDY of WRITING Oi.Uchicago.Edu Oi.Uchicago.Edu /MAAM^MA
    oi.uchicago.edu A STUDY OF WRITING oi.uchicago.edu oi.uchicago.edu /MAAM^MA. A STUDY OF "*?• ,fii WRITING REVISED EDITION I. J. GELB Phoenix Books THE UNIVERSITY OF CHICAGO PRESS oi.uchicago.edu This book is also available in a clothbound edition from THE UNIVERSITY OF CHICAGO PRESS TO THE MOKSTADS THE UNIVERSITY OF CHICAGO PRESS, CHICAGO & LONDON The University of Toronto Press, Toronto 5, Canada Copyright 1952 in the International Copyright Union. All rights reserved. Published 1952. Second Edition 1963. First Phoenix Impression 1963. Printed in the United States of America oi.uchicago.edu PREFACE HE book contains twelve chapters, but it can be broken up structurally into five parts. First, the place of writing among the various systems of human inter­ communication is discussed. This is followed by four Tchapters devoted to the descriptive and comparative treatment of the various types of writing in the world. The sixth chapter deals with the evolution of writing from the earliest stages of picture writing to a full alphabet. The next four chapters deal with general problems, such as the future of writing and the relationship of writing to speech, art, and religion. Of the two final chapters, one contains the first attempt to establish a full terminology of writing, the other an extensive bibliography. The aim of this study is to lay a foundation for a new science of writing which might be called grammatology. While the general histories of writing treat individual writings mainly from a descriptive-historical point of view, the new science attempts to establish general principles governing the use and evolution of writing on a comparative-typological basis.
    [Show full text]
  • Beginner's Visual Catalog of Maya Hieroglyphs
    DEPARTMENT OF ANTHROPOLOGY UNIVERSITY OF ALABAMA BEGINNER'S VISUAL CATALOG OF MAYA HIEROGLYPHS Alexandre Tokovinine 2017 INTRODUCTION This catalog of Ancient Maya writing characters is intended as an aid for beginners and intermediate-level students of the script. Most known Ancient Maya inscriptions date to the Late Classic Period (600-800 C.E.), so the characters in the catalog roughly reproduce a generic Late Classic graphic style from the cities in the Southern Lowlands or the area of the present-day department of Petén in Guatemala and the states of Chiapas and Campeche in Mexico. The goal of the catalog is not to demonstrate possible variation in the appearance of individual characters, but to highlight similarities and differences between distinct glyphs. Although all Maya glyphs look like representations of animate and inanimate objects, they are all strictly phonetic: characters called syllabograms encode syllables, whereas other signs known as logograms stand for entire words. The readings of logograms in the catalog are recorded in bold upper case letters and syllabograms with bold lower case letters. Some characters have multiple readings and some readings are less certain than others. This catalog may show several possible readings of a glyph. Uncertain readings are followed by question marks. The sign that looks like a head of a bat, for instance, has two confirmed readings in distinct contexts: a logogram SUUTZ' "bat" and a syllabogram tz'i. The third reading - a syllabogram xu - is plausible, but less well-proven. The corresponding catalog entry will show all these readings underneath the character: SUUTZ'/tz'i/xu? Undeciphered glyphs have also been included in this catalog.
    [Show full text]
  • Report 39 May 2016
    Glyph Dwellers Report 39 May 2016 Possible Phonetic Substitutions for the "Knot-Head" Glyph Raphael Tunesi Independent Scholar Yuriy Polyukhovych Department of Art and Art History, California State University Chico For many decades the "knot-head" grapheme has been known to epigraphers due to its appearance in the Primary Standard Sequence (PSS) of many vases from the central and eastern Peten area of the Maya realm (Fig. 1). The sign itself depicts a human head in profile with a large knot superimposed over its eye. The knot resembles the cloth or paper headband used by ancient Maya elites during accession ceremonies. Fig. 1. The "Knot-Head" grapheme from painted ceramic vessels K1547, 4572, and 1377. Drawings by Yuriy Polyukhovych. In recent years, few scholars have ventured into an analysis of this sign. The most prominent cases would be Dorie Reents-Budet (1994:123, 4.13), who identified it in her book Painting the Maya Universe Glyph Dwellers Report 39 Possible Phonetic Substitutions for the "Knot-Head" Glyph as "ah-Knot Eye" and interpreted it sometimes as a name or title as well as u-na' "(it is) his knowledge" (1994:133, fig. 4.25). Erik Boot (2005) also studied this sign, proposing the reading of hun ("book"); however, later he qualified his interpretation: "In the past I suggested a logographic value HUN for the KNOT.EYE sign, based on the large knot (HUN) as well as the substitution with a sign that may represent a jaguar skin covered screenfold book (HUN) .... This sign also appears in inscriptions from northern Campeche (e.g.
    [Show full text]
  • EVOLUTION of WRITING Abstract
    1 THE EVOLUTION OF WRITING Abstract Writing - a system of graphic marks representing the units of a specific language - has been invented independently in the Near East, China and Mesoamerica. The cuneiform script, created in Mesopotamia, present-day Iraq, ca. 3200 BC, was first. It is also the only writing system which can be traced to its earliest prehistoric origin. This antecedent of the cuneiform script was a system of counting and recording goods with clay tokens. The evolution of writing from tokens to pictography, syllabary and alphabet illustrates the development of information processing to deal with larger amounts of data in ever greater abstraction. Introduction The three writing systems that developed independently in the Near East, China and Mesoamerica, shared a remarkable stability. Each preserved over millennia features characteristic of their original prototypes. The Mesopotamian cuneiform script can be traced furthest back into prehistory to an eighth millennium BC counting system using clay tokens of multiple shapes. The development from tokens to script reveals that writing emerged from counting and accounting. Writing was used exclusively for accounting until the third millennium BC, when the Sumerian concern for the afterlife paved the way to literature by using writing for funerary inscriptions. The evolution from tokens to script also documents a steady progression in abstracting data, from one-to-one correspondence with three-dimensional tangible tokens, to two-dimensional pictures, the invention of abstract numbers and phonetic syllabic signs and finally, in the second millennium BC, the ultimate abstraction of sound and meaning with the representation of phonemes by the letters of the alphabet.
    [Show full text]
  • 3 Writing Systems
    Writing Systems 43 3 Writing Systems PETER T. DANIELS Chapters on writing systems are very rare in surveys of linguistics – Trager (1974) and Mountford (1990) are the only ones that come to mind. For a cen- tury or so – since the realization that unwritten languages are as legitimate a field of study, and perhaps a more important one, than the world’s handful of literary languages – writing systems were (rightly) seen as secondary to phonological systems and (wrongly) set aside as unworthy of study or at best irrelevant to spoken language. The one exception was I. J. Gelb’s attempt (1952, reissued with additions and corrections 1963) to create a theory of writ- ing informed by the linguistics of his time. Gelb said that what he wrote was meant to be the first word, not the last word, on the subject, but no successors appeared until after his death in 1985.1 Although there have been few lin- guistic explorations of writing, a number of encyclopedic compilations have appeared, concerned largely with the historical development and diffusion of writing,2 though various popularizations, both new and old, tend to be less than accurate (Daniels 2000). Daniels and Bright (1996; The World’s Writing Systems: hereafter WWS) includes theoretical and historical materials but is primarily descriptive, providing for most contemporary and some earlier scripts information (not previously gathered together) on how they represent (the sounds of) the languages they record. This chapter begins with a historical-descriptive survey of the world’s writ- ing systems, and elements of a theory of writing follow.
    [Show full text]
  • Hanzi, Concept and Computation: a Preliminary Survey of Chinese Characters As a Knowledge Resource in NLP
    Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters as a Knowledge Resource in NLP von Shu-Kai Hsieh Philosophische Dissertation angenommen von der Neuphilologischen Fakultät der Universität Tübingen am 06.02.2006 Tübingen 10.04.2006 Gedruckt mit Genehmigung der Neuphilologischen Fakultät der Universität Tübingen Hauptberichterstatter: Prof. Dr. Erhard W. Hinrichs Mitberichterstatter: Prof. Dr. Eschbach-Szabo Dekan: Prof. Dr. Joachim Knape Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters as a Knowledge Resource in NLP Shu-Kai Hsieh Acknowledgements There are many people to whom I owe a debt of thanks for their support, for the completion of my thesis and supported me in science as well in privacy during this time. First, I would like to sincerely thank my advisor, Prof. Dr Erhard Hin- richs, under whose influence the work here was initiated during my fruit- ful stay in Germany. Without his continuous and invaluable support, this work could not have been completed. I would also like to thank Prof. Dr. Eschbach-Szabo for reading this thesis and offering constructive comments. Besides my advisors, I am deeply grateful to the rest of my thesis commit- tee: Frank Richter and Fritz Hamm, for their kindly support and interesting questions. A special thanks goes to Lothar Lemnitzer, who proofread the thesis carefully and gave insightful comments. I would like to thank my parents for their life-long love and support. Last but not least, I also owe a lot of thanks to my lovely wife Hsiao-Wen, my kids MoMo and NoNo for their understanding while I was away from home.
    [Show full text]