Arxiv:2106.03958V2 [Cs.CL] 9 Jun 2021 Can Be an Effective Way to Adapt Lms for Lrls, However, There Is a Challenge

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2106.03958V2 [Cs.CL] 9 Jun 2021 Can Be an Effective Way to Adapt Lms for Lrls, However, There Is a Challenge Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study Yash Khemchandani1∗ Sarvesh Mehtani1∗ Vaidehi Patil1 Abhijeet Awasthi1 Partha Talukdar2 Sunita Sarawagi1 1Indian Institute of Technology Bombay, India 2Google Research, India fyashkhem,smehtani,awasthi,[email protected] [email protected], [email protected] Abstract Recent research in multilingual language mod- els (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, in- corporating a new language in an LM still re- mains a challenge, particularly for languages with limited corpora and in unseen scripts. In Figure 1: Number of wikipedia articles for top-few In- this paper we argue that relatedness among lan- dian Languages and English. The height of the English guages in a language family may be exploited bar is not to scale as indicated by the break. Number of to overcome some of the corpora limitations of English articles is roughly 400x more than articles in LRLs, and propose RelateLM. We focus on In- Oriya and 800x more than articles in Assamese. dian languages, and exploit relatedness along two dimensions: (1) script (since many In- dic scripts originated from the Brahmic script), languages. For example, the Multilingual BERT and (2) sentence structure. RelateLM uses (mBERT) (Devlin et al., 2019) model was trained transliteration to convert the unseen script of on 104 different languages. When fine-tuned for limited LRL text into the script of a Re- various downstream tasks, multilingual LMs have lated Prominent Language (RPL) (Hindi in our demonstrated significant success in generalizing case). While exploiting similar sentence struc- across languages (Hu et al., 2020; Conneau et al., tures, RelateLM utilizes readily available bilin- gual dictionaries to pseudo translate RPL text 2019). Thus, such models make it possible to into LRL corpora. Experiments on multiple transfer knowledge and resources from resource real-world benchmark datasets provide valida- rich languages to Low Web-Resource Languages tion to our hypothesis that using a related lan- (LRL). This has opened up a new opportunity to- guage as pivot, along with transliteration and wards rapid development of language technologies pseudo translation based data augmentation, for LRLs. arXiv:2106.03958v2 [cs.CL] 9 Jun 2021 can be an effective way to adapt LMs for LRLs, However, there is a challenge. The current rather than direct training or pivoting through English. paradigm for training Mutlilingual LM requires text corpora in the languages of interest, usually in 1 Introduction large volumes. However, such text corpora is often available in limited quantities for LRLs. For exam- BERT-based pre-trained language models (LMs) ple, in Figure1 we present the size of Wikipedia, have enabled significant advances in NLP (Devlin a common source of corpora for training LMs, for et al., 2019; Liu et al., 2019; Lan et al., 2020). Pre- top-few scheduled Indian languages1 and English. trained LMs have also been developed for the mul- The top-2 languages are just one-fiftieth the size of tilingual setting, where a single multilingual model is capable of handling inputs from many different 1According to Indian Census 2011, more than 19,500 lan- guages or dialects are spoken across the country, with 121 of ∗Authors contributed equally them being spoken by more than 10 thousand people. इक है … … ਇਹ ਇਕ ਬਾਂਦਰ ਹੈਬ拓ਦਰ इह इक है बांदर है इह MASK बांदर है ... ... MLM Training … इह एक है बंदर है ... … … इह इक है बांदर है ਇਹ : एक है इक है : एक है ਬ拓ਦਰ : बंदर, वानर बांदर : बंदर, वानर ... ... इह एक है बंदर है Alignment Loss … … अ楍छा : ਚੰਗਾ, ਵਧੀਆ अ楍छा : चँगा, वधीआ बालक है : ਮੁੰਡਾ बालक है : मअच्छा बालकहैुँडा ... ... … राम अच्छाबालकहै चँगा मअच्छा बालकहैुँडा है राम अच्छा बालक हैचँगामअच्छाबालकहैुँडा ... राम अच्छाबालकहै अ楍छा बालक है है … Alignment Loss राम अच्छा बालक हैअ楍छाबालक ... Transliteration Pre-training with Pseudo Translation into R’s script MLM and Alignment Figure 2: Pre-training with MLM and Alignment loss in RelateLM with LRL L as Punjabi (pa) in Gurumukhi script and RPL R as Hindi (hi) in Devanagari script. RelateLM first transliterates LRL text in the monolingual corpus (DL) and bilingual dictionaries (BL!R and BR!L) to the script of the RPL R. The transliterated bilingual dictionaries are then used to pseudo translate the RPL corpus (DR) and transliterated LRL corpus (DLR ). This pseudo translated data is then used to adapt the given LM M for the target LRL L using a combination of Masked Language Model (MLM) and alignment losses. For notations and further details, please see Section3. English, and yet Hindi is seven times larger than as they are all part of the Indo-Aryan family, ge- the O(20,000) documents of languages like Oriya ographically, and syntactically matching on key and Assamese which are spoken by millions of peo- features like the Subject-Object-Verb (SOV) order ple. This calls for the development of additional as against the Subject-Verb-Object (SVO) order mechanisms for training multilingual LMs which in English. Even though the scripts of several In- are not exclusively reliant on large monolingual dic languages differ, they are all part of the same corpora. Brahmic family, making it easier to design rule- Recent methods of adapting a pre-trained mul- based transliteration libraries across any language tilingual LM to a LRL include fine-tuning the full pair. In contrast, transliteration of Indic languages model with an extended vocabulary (Wang et al., to English is harder with considerable phonetic 2020), training a light-weight adapter layer while variation in how words are transcribed. The geo- keeping the full model fixed (Pfeiffer et al., 2020b), graphical and phylogenetic proximity has lead to and exploiting overlapping tokens to learn embed- significant overlap of words across languages. This dings of the LRL (Pfeiffer et al., 2020c). These are implies that just after transliteration we are able general-purpose methods that do not sufficiently to exploit overlap with a Related Prominent Lan- exploit the specific relatedness of languages within guage (RPL) like Hindi. On three Indic languages the same family. we discover between 11% and 26% overlapping We propose RelateLM for this task. RelateLM tokens with Hindi, whereas with English it is less exploits relatedness between the LRL of interest than 8%, mostly comprising numbers and entity and a Related Prominent Language (RPL). We names. Furthermore, the syntax-level similarity focus on Indic languages, and consider Hindi as between languages allows us to generate high qual- the RPL. The languages we consider in this pa- ity data augmentation by exploiting pre-existing per are related along several dimensions of linguis- bilingual dictionaries. We generate pseudo parallel tic typology (Dryer and Haspelmath, 2013; Lit- data by converting RPL text to LRL and vice-versa. tell et al., 2017): phonologically, phylogenetically These allow us to further align the learned embed- dings across the two languages using the recently is the lack of model’s capacity to accommodate proposed loss functions for aligning contextual em- all languages simultaneously. This has led to in- beddings of word translations (Cao et al., 2020; Wu creased interest in adapting multilingual LMs to and Dredze, 2020). LRLs and we discuss these in the following two In this paper, we make the following contributions: settings. • We address the problem of adding a Low Web- Resource Language (LRL) to an existing pre- LRL adaptation using monolingual data For trained LM, especially when monolingual cor- eleven languages outside mBERT, Wang et al. pora in the LRL is limited. This is an impor- (2020) demonstrate that adding a new target lan- tant but underexplored problem. We focus guage to mBERT by simply extending the embed- on Indian languages which have hundred of ding layer with new weights results in better per- millions of speakers, but traditionally under- forming models when compared to bilingual-BERT studied in the NLP community. pre-training with English as the second language. • We propose RelateLM which exploits relat- Pfeiffer et al.(2020c) adapt multilingual LMs to edness among languages to effectively in- the LRLs and languages with scripts unseen during corporate a LRL into a pre-trained LM. We pre-training by learning new tokenizers for the un- highlight the relevance of transliteration and seen script and initializing their embedding matrix pseudo translation for related languages, and by leveraging the lexical overlap w.r.t. the lan- use them effectively in RelateLM to adapt a guages seen during pre-training. Adapter (Pfeiffer pre-trained LM to a new LRL. et al., 2020a) based frameworks like (Pfeiffer et al., 2020b; Artetxe et al., 2020; Ust¨ un¨ et al., 2020) ad- • Through extensive experiments, we find that dress the lack of model’s capacity to accommodate RelateLM is able to gain significant improve- multiple languages and establish the advantages of ments on benchmark datasets. We demon- adding language-specific adapter modules in the strate how RelateLM adapts mBERT to Oriya BERT model for accommodating LRLs. These and Assamese, two low web-resource Indian methods generally assume access to a fair amount languages by pivoting through Hindi. Via ab- of monolingual LRL data and do not exploit relat- lation studies on bilingual models we show edness across languages explicitly. These methods that RelateLM is able to achieve accuracy of provide complimentary gains to our method of di- zero-shot transfer with limited data (20K doc- rectly exploiting language relatedness.
Recommended publications
  • The Festvox Indic Frontend for Grapheme-To-Phoneme Conversion
    The Festvox Indic Frontend for Grapheme-to-Phoneme Conversion Alok Parlikar, Sunayana Sitaram, Andrew Wilkinson and Alan W Black Carnegie Mellon University Pittsburgh, USA aup, ssitaram, aewilkin, [email protected] Abstract Text-to-Speech (TTS) systems convert text into phonetic pronunciations which are then processed by Acoustic Models. TTS frontends typically include text processing, lexical lookup and Grapheme-to-Phoneme (g2p) conversion stages. This paper describes the design and implementation of the Indic frontend, which provides explicit support for many major Indian languages, along with a unified framework with easy extensibility for other Indian languages. The Indic frontend handles many phenomena common to Indian languages such as schwa deletion, contextual nasalization, and voicing. It also handles multi-script synthesis between various Indian-language scripts and English. We describe experiments comparing the quality of TTS systems built using the Indic frontend to grapheme-based systems. While this frontend was designed keeping TTS in mind, it can also be used as a general g2p system for Automatic Speech Recognition. Keywords: speech synthesis, Indian language resources, pronunciation 1. Introduction in models of the spectrum and the prosody. Another prob- lem with this approach is that since each grapheme maps Intelligible and natural-sounding Text-to-Speech to a single “phoneme” in all contexts, this technique does (TTS) systems exist for a number of languages of the world not work well in the case of languages that have pronun- today. However, for low-resource, high-population lan- ciation ambiguities. We refer to this technique as “Raw guages, such as languages of the Indian subcontinent, there Graphemes.” are very few high-quality TTS systems available.
    [Show full text]
  • Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe Romanization: KNAB 2012
    Institute of the Estonian Language KNAB: Place Names Database 2012-10-11 Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe romanization: KNAB 2012 I. Consonant characters 1 ᦀ ’a 13 ᦌ sa 25 ᦘ pha 37 ᦤ da A 2 ᦁ a 14 ᦍ ya 26 ᦙ ma 38 ᦥ ba A 3 ᦂ k’a 15 ᦎ t’a 27 ᦚ f’a 39 ᦦ kw’a 4 ᦃ kh’a 16 ᦏ th’a 28 ᦛ v’a 40 ᦧ khw’a 5 ᦄ ng’a 17 ᦐ n’a 29 ᦜ l’a 41 ᦨ kwa 6 ᦅ ka 18 ᦑ ta 30 ᦝ fa 42 ᦩ khwa A 7 ᦆ kha 19 ᦒ tha 31 ᦞ va 43 ᦪ sw’a A A 8 ᦇ nga 20 ᦓ na 32 ᦟ la 44 ᦫ swa 9 ᦈ ts’a 21 ᦔ p’a 33 ᦠ h’a 45 ᧞ lae A 10 ᦉ s’a 22 ᦕ ph’a 34 ᦡ d’a 46 ᧟ laew A 11 ᦊ y’a 23 ᦖ m’a 35 ᦢ b’a 12 ᦋ tsa 24 ᦗ pa 36 ᦣ ha A Syllable-final forms of these characters: ᧅ -k, ᧂ -ng, ᧃ -n, ᧄ -m, ᧁ -u, ᧆ -d, ᧇ -b. See also Note D to Table II. II. Vowel characters (ᦀ stands for any consonant character) C 1 ᦀ a 6 ᦀᦴ u 11 ᦀᦹ ue 16 ᦀᦽ oi A 2 ᦰ ( ) 7 ᦵᦀ e 12 ᦵᦀᦲ oe 17 ᦀᦾ awy 3 ᦀᦱ aa 8 ᦶᦀ ae 13 ᦺᦀ ai 18 ᦀᦿ uei 4 ᦀᦲ i 9 ᦷᦀ o 14 ᦀᦻ aai 19 ᦀᧀ oei B D 5 ᦀᦳ ŭ,u 10 ᦀᦸ aw 15 ᦀᦼ ui A Indicates vowel shortness in the following cases: ᦀᦲᦰ ĭ [i], ᦵᦀᦰ ĕ [e], ᦶᦀᦰ ăe [ ∎ ], ᦷᦀᦰ ŏ [o], ᦀᦸᦰ ăw [ ], ᦀᦹᦰ ŭe [ ɯ ], ᦵᦀᦲᦰ ŏe [ ].
    [Show full text]
  • Download Download
    Phonotactic frequencies in Marathi KELLY HARPER BERKSON1 and MAX NELSON Indiana University Bloomington, Indiana ABSTRACT: Breathy sonorants are cross-linguistically rare, occurring in just 1% of the languages indexed in the UCLA Phonological Segment Inventory Database (UPSID) and 0.2% of those in the PHOIBLE database (Moran, McCloy and Wright 2014). Prior work has shed some light on their acoustic properties, but little work has investigated the language-internal distribution of these sounds in a language where they do occur, such as Marathi (Indic, spoken mainly in Maharashtra, India). With this in mind, we present an overview of the phonotactic frequencies of consonants, vowels, and CV-bigrams in the Marathi portion of the EMILLE/CIIL corpus. Results of a descriptive analysis show that breathy sonorants are underrepresented, making up fewer than 1% of the consonants in the 2.2 million-word corpus, and that they are disfavored in back vowel contexts. 1 Introduction Languages are rife with gradience. While some sounds, morphological patterns, and syntactic structures occur with great frequency in a language, others—though legal—occur rarely. The same is observed crosslinguistically: some elements and structures are found in language after language, while others are quite rare. For at least the last several decades, cognitive and psycholinguistic investigation of language processing has revealed that language users are sensitive not only to categorical phonological knowledge—such as, for example, the sounds and sound combinations that are legal and illegal in one’s language— but also to this kind of gradient information about phonotactic probabilities. This has been demonstrated in a variety of experimental tasks (Ellis 2002; Frisch, Large, and Pisoni 2000; Storkel and Maekawa 2005; Vitevitch and Luce 2005), which indicate that gradience influences language production, comprehension, and processing phenomena (Ellis 2002; Vitevitch, Armbrüster, and Chu 2004) for infants as well as adults (Jusczyk and Luce 1994).
    [Show full text]
  • Neural Machine Translation System of Indic Languages - an Attention Based Approach
    Neural Machine Translation System of Indic Languages - An Attention based Approach Parth Shah Vishvajit Bakrola Uka Tarsadia University Uka Tarsadia University Bardoli, India Bardoli, India [email protected] [email protected] Abstract—Neural machine translation (NMT) is a recent intervention. This task can be effectively done using machine and effective technique which led to remarkable improvements translation. in comparison of conventional machine translation techniques. Machine Translation (MT) is described as a task of compu- Proposed neural machine translation model developed for the Gujarati language contains encoder-decoder with attention mech- tationally translate human spoken or natural language text or anism. In India, almost all the languages are originated from their speech from one language to another with minimum human ancestral language - Sanskrit. They are having inevitable similari- intervention. Machine translation aims to generate translations ties including lexical and named entity similarity. Translating into which have the same meaning as a source sentence and Indic languages is always be a challenging task. In this paper, we grammatically correct in the target language. Initial work on have presented the neural machine translation system (NMT) that can efficiently translate Indic languages like Hindi and Gujarati MT started in early 1950s [2], and has advanced rapidly that together covers more than 58.49 percentage of total speakers since the 1990s due to the availability of more computational in the country. We have compared the performance of our capacity and training data. Then after, number of approaches NMT model with automatic evaluation matrices such as BLEU, have been proposed to achieve more and more accurate ma- perplexity and TER matrix.
    [Show full text]
  • An Introduction to Indic Scripts
    An Introduction to Indic Scripts Richard Ishida W3C [email protected] HTML version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html PDF version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.pdf Introduction This paper provides an introduction to the major Indic scripts used on the Indian mainland. Those addressed in this paper include specifically Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. I have used XHTML encoded in UTF-8 for the base version of this paper. Most of the XHTML file can be viewed if you are running Windows XP with all associated Indic font and rendering support, and the Arial Unicode MS font. For examples that require complex rendering in scripts not yet supported by this configuration, such as Bengali, Oriya, and Malayalam, I have used non- Unicode fonts supplied with Gamma's Unitype. To view all fonts as intended without the above you can view the PDF file whose URL is given above. Although the Indic scripts are often described as similar, there is a large amount of variation at the detailed implementation level. To provide a detailed account of how each Indic script implements particular features on a letter by letter basis would require too much time and space for the task at hand. Nevertheless, despite the detail variations, the basic mechanisms are to a large extent the same, and at the general level there is a great deal of similarity between these scripts. It is certainly possible to structure a discussion of the relevant features along the same lines for each of the scripts in the set.
    [Show full text]
  • A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection
    No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection Debanjana Kar* Mohit Bhardwaj* Suranjana Samanta Amar Prakash Azad Dept. of CSE Dept. of CSE IBM Research IBM Research IIT Kharagpur, India IIIT Delhi, India Bangalore, India Bangalore, India [email protected] [email protected] [email protected] [email protected] Abstract—The sudden widespread menace created by the present global pandemic COVID-19 has had an unprecedented effect on our lives. Man-kind is going through humongous fear and dependence on social media like never before. Fear inevitably leads to panic, speculations, and spread of misinformation. Many governments have taken measures to curb the spread of such misinformation for public well being. Besides global measures, to have effective outreach, systems for demographically local languages have an important role to play in this effort. Towards this, we propose an approach to detect fake news about COVID- 19 early on from social media, such as tweets, for multiple Fig. 1. Two examples of COVID-19 related tweets; Left one shows an Indic-Languages besides English. In addition, we also create an example of fake tweet, which has high number of retweet and like counts; annotated dataset of Hindi and Bengali tweet for fake news Right one shows an example of tweet with real facts, which is has low number of retweets. Number of retweets and likes is proportional to the popularity of detection. We propose a BERT based model augmented with a tweet. additional relevant features extracted from Twitter to identify fake tweets.
    [Show full text]
  • The Formal Kharoṣṭhī Script from the Northern Tarim Basin in Northwest
    Acta Orientalia Hung. 73 (2020) 3, 335–373 DOI: 10.1556/062.2020.00015 Th e Formal Kharoṣṭhī script from the Northern Tarim Basin in Northwest China may write an Iranian language1 FEDERICO DRAGONI, NIELS SCHOUBBEN and MICHAËL PEYROT* L eiden University Centre for Linguistics, Universiteit Leiden, Postbus 9515, 2300 RA Leiden, Th e Netherlands E-mail: [email protected]; [email protected]; *Corr esponding Author: [email protected] Received: February 13, 2020 •Accepted: May 25, 2020 © 2020 The Authors ABSTRACT Building on collaborative work with Stefan Baums, Ching Chao-jung, Hannes Fellner and Georges-Jean Pinault during a workshop at Leiden University in September 2019, tentative readings are presented from a manuscript folio (T II T 48) from the Northern Tarim Basin in Northwest China written in the thus far undeciphered Formal Kharoṣṭhī script. Unlike earlier scholarly proposals, the language of this folio can- not be Tocharian, nor can it be Sanskrit or Middle Indic (Gāndhārī). Instead, it is proposed that the folio is written in an Iranian language of the Khotanese-Tumšuqese type. Several readings are proposed, but a full transcription, let alone a full translation, is not possible at this point, and the results must consequently remain provisional. KEYWORDS Kharoṣṭhī, Formal Kharoṣṭhī, Khotanese, Tumšuqese, Iranian, Tarim Basin 1 We are grateful to Stefan Baums, Chams Bernard, Ching Chao-jung, Doug Hitch, Georges-Jean Pinault and Nicholas Sims-Williams for very helpful discussions and comments on an earlier draft. We also thank the two peer-reviewers of the manuscript. One of them, Richard Salomon, did not wish to remain anonymous, and espe- cially his observation on the possible relevance of Khotan Kharoṣṭhī has proved very useful.
    [Show full text]
  • Designing Devanagari Type
    Designing Devanagari type The effect of technological restrictions on current practice Kinnat Sóley Lydon BA degree final project Iceland Academy of the Arts Department of Design and Architecture Designing Devanagari type: The effect of technological restrictions on current practice Kinnat Sóley Lydon Final project for a BA degree in graphic design Advisor: Gunnar Vilhjálmsson Graphic design Department of Design and Architecture December 2015 This thesis is a 6 ECTS final project for a Bachelor of Arts degree in graphic design. No part of this thesis may be reproduced in any form without the express consent of the author. Abstract This thesis explores the current process of designing typefaces for Devanagari, a script used to write several languages in India and Nepal. The typographical needs of the script have been insufficiently met through history and many Devanagari typefaces are poorly designed. As the various printing technologies available through the centuries have had drastic effects on the design of Devanagari, the thesis begins with an exploration of the printing history of the script. Through this exploration it is possible to understand which design elements constitute the script, and which ones are simply legacies of older technologies. Following the historic overview, the character set and unique behavior of the script is introduced. The typographical anatomy is analyzed, while pointing out specific design elements of the script. Although recent years has seen a rise of interest on the subject of Devanagari type design, literature on the topic remains sparse. This thesis references books and articles from a wide scope, relying heavily on the works of Fiona Ross and her extensive research on non-Latin typography.
    [Show full text]
  • Töwkhön, the Retreat of Öndör Gegeen Zanabazar As a Pilgrimage Site Zsuzsa Majer Budapest
    Töwkhön, The ReTReaT of öndöR GeGeen ZanabaZaR as a PilGRimaGe siTe Zsuzsa Majer Budapest he present article describes one of the revived up to the site is not always passable even by jeep, T Mongolian monasteries, having special especially in winter or after rain. Visitors can reach the significance because it was once the retreat and site on horseback or on foot even when it is not possible workshop of Öndör Gegeen Zanabazar, the main to drive up to the monastery. In 2004 Töwkhön was figure and first monastic head of Mongolian included on the list of the World’s Cultural Heritage Buddhism. Situated in an enchanted place, it is one Sites thanks to its cultural importance and the natural of the most frequented pilgrimage sites in Mongolia beauties of the Orkhon River Valley area. today. During the purges in 1937–38, there were mass Information on the monastery is to be found mainly executions of lamas, the 1000 Mongolian monasteries in books on Mongolian architecture and historical which then existed were closed and most of them sites, although there are also some scattered data totally destroyed. Religion was revived only after on the history of its foundation in publications on 1990, with the very few remaining temple buildings Öndör Gegeen’s life. In his atlas which shows 941 restored and new temples erected at the former sites monasteries and temples that existed in the past in of the ruined monasteries or at the new province and Mongolia, Rinchen marked the site on his map of the subprovince centers. Öwörkhangai monasteries as Töwkhön khiid (No.
    [Show full text]
  • 3 Writing Systems
    Writing Systems 43 3 Writing Systems PETER T. DANIELS Chapters on writing systems are very rare in surveys of linguistics – Trager (1974) and Mountford (1990) are the only ones that come to mind. For a cen- tury or so – since the realization that unwritten languages are as legitimate a field of study, and perhaps a more important one, than the world’s handful of literary languages – writing systems were (rightly) seen as secondary to phonological systems and (wrongly) set aside as unworthy of study or at best irrelevant to spoken language. The one exception was I. J. Gelb’s attempt (1952, reissued with additions and corrections 1963) to create a theory of writ- ing informed by the linguistics of his time. Gelb said that what he wrote was meant to be the first word, not the last word, on the subject, but no successors appeared until after his death in 1985.1 Although there have been few lin- guistic explorations of writing, a number of encyclopedic compilations have appeared, concerned largely with the historical development and diffusion of writing,2 though various popularizations, both new and old, tend to be less than accurate (Daniels 2000). Daniels and Bright (1996; The World’s Writing Systems: hereafter WWS) includes theoretical and historical materials but is primarily descriptive, providing for most contemporary and some earlier scripts information (not previously gathered together) on how they represent (the sounds of) the languages they record. This chapter begins with a historical-descriptive survey of the world’s writ- ing systems, and elements of a theory of writing follow.
    [Show full text]
  • Initial Literacy in Devanagari: What Matters to Learners
    Initial literacy in Devanagari: What Matters to Learners Renu Gupta University of Aizu 1. Introduction Heritage language instruction is on the rise at universities in the US, presenting a new set of challenges for language pedagogy. Since heritage language learners have some experience of the target language from their home environment, language instruction cannot be modeled on foreign language instruction (Kondo-Brown, 2003). At the same time, there may be significant gaps in the learner’s competence; in describing heritage learners of South Asian languages, Moag (1996) notes that their acquisition of the native language has often atrophied at an early stage. For learners of South Asian languages, the writing system presents an additional hurdle. Moag (1996) states that heritage learners have more problems than their American counterparts in learning the script; the heritage learner “typically takes much more time to master the script, and persists in having problems with both reading and writing far longer than his or her American counterpart” (page 170). Moag attributes this to the different purposes for which learners study the language, arguing that non-native speakers rapidly learn the script since they study the language for professional purposes, whereas heritage learners study it for personal reasons. Although many heritage languages, such as Japanese, Mandarin, and the South Asian languages, use writing systems that differ from English, the difficulties of learning a second writing system have received little attention. At the elementary school level, Perez (2004) has summarized differences in writing systems and rhetorical structures, while Sassoon (1995) has documented the problems of school children learning English as a second script; in addition, Cook and Bassetti (2005) offer research studies on the acquisition of some writing systems.
    [Show full text]
  • 6 the Hindu-Arabic System (800 BC)
    6 The Hindu-Arabic System (800 BC) Today the most universally used system of numeration is the Hindu-Arabic system, also known as the decimal system or base ten system. The system was named for the Indian scholars who invented it at least as early as 800 BC and for the Arabs who transmitted it to the western world. Since the base of the system is ten, it requires special symbols for the numbers zero through nine. The following list the features of this system: (1) Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. These symbols can be used in combination to represent all possible numbers. (2) Grouping by ten: Grouping into sets of 10 is a basic principle of this sys- tem. Figure 6.1 shows how grouping is helpful when representing a collection of objects. Figure 6.1 (3) Place value: The place value assigns a value of a digit depending on its placement in a numeral. To find the value of a digit in a whole number, we multiply the place value of the digit by its face value, where the face value is a digit. For example, in the numeral 5984, the 5 has place value "thousands", the 9 has place value "hundreds", the 8 has place value "tens", and 4 has place value "units". (4) Expanded form: We can express the numeral 5984 as the sum of its digits times their respective place values, i.e. in expanded form 5984 = 5 × 1000 + 9 × 100 + 8 × 10 + 4 = 5 × 103 + 9 × 102 + 8 × 101 + 4: Example 6.1 Express the number 45,362 in expanded form.
    [Show full text]