Samsung R&D Institute Poland Submission to WAT 2021 Indic

Total Page:16

File Type:pdf, Size:1020Kb

Samsung R&D Institute Poland Submission to WAT 2021 Indic Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task Adam Dobrowolski, Marcin Szymanski,´ Marcin Chochowski, Paweł Przybysz Samsung R&D Institute, Warsaw, Poland fa.dobrowols2, m.szymanski, m.chochowski, p.przybyszg @samsung.com MultiIndicMT: An Indic Language Multilingual Task Team ID: SRPOL Abstract machine translation (Chu and Wang, 2018)(Dabre et al., 2020). This paper describes the submission to the This document is structured as follows. In sec- WAT 2021 Indic Language Multilingual Task tion 2 we describe the sources and techniques of by Samsung R&D Institute Poland. The task covered translation between 10 Indic Lan- corpora preparation used for the training. In sec- guages (Bengali, Gujarati, Hindi, Kannada, tions 3 and 4 we describe the model architecture Malayalam, Marathi, Oriya, Punjabi, Tamil and techniques used in training, tuning and ensem- and Telugu) and English. bling and finally Section 5 presents the results we We combined a variety of techniques: translit- gained on every stage of the training. eration, filtering, backtranslation, domain All trainings were performed on Transformer adaptation, knowledge-distillation and finally models. We used standard Marian NMT 1 v.1.9 ensembling of NMT models. We applied an ef- framework. fective approach to low-resource training that consist of pretraining on backtranslations and 2 Data tuning on parallel corpora. We experimented with two different domain- 2.1 Multilingual trainings adaptation techniques which significantly im- Multilingual models trained for the competition use proved translation quality when applied to a target language tag at the beginning of sentence monolingual corpora. We researched and ap- plied a novel approach for finding the best to select the direction of the translation. hyperparameters for ensembling a number of 2.2 Transliteration translation models. All techniques combined gave significant im- Indian languages use a variety of scripts. Using provement - up to +8 BLEU over baseline re- transliteration between scripts of similar languages sults. The quality of the models has been con- may improve the quality of multilingual models firmed by the human evaluation where SRPOL as described in (Bawden et al., 2019)(Goyal and models scored best for all 5 manually evalu- Sharma, 2019). The transliteration we applied was ated languages. to replace Indian letters of various scripts to their equivalents in Devanagari script. We used indic- 1 Introduction NLP 2 library to perform the transliteration. Samsung R&D Poland Team researched effective In our previous experiments with Indian lan- techniques that worked especially well for low- guages we noticed an overall improvement of the resource languages: transliteration, iterative back- quality for multi-indian models, so we used translit- translation followed by tuning on parallel corpora. eration in all trainings. However, additional exper- We successfully applied these techniques during iments on transliteration during the competition the WAT2021 competition (Nakazawa et al., 2021). were not conclusive. The results for trainings on Especially for the competition we also applied cus- raw corpora, without transliteration were similar tom domain-adaptation techniques which substan- (see Table1). tially improved the final results. 1github.com/marian-nmt/marian Most of the applied techniques and ideas are 2https://github.com/anoopkunchukuttan/ commonly used for works on Indian languages indic_nlp_library 224 Proceedings of the 8th Workshop on Asian Translation, pages 224–232 Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics 2.3 Parallel Corpora Filtering different techniques to select the in domain sen- The base corpus for all trainings was the concaten- tences for backtranslation. With these techniques taion of complete bilingual corpora provided by we trained two separate families of MT models. the organizers (further referenced as bitext) (11M Domain adaptation by fastText (FT) - We ap- lines in total). No filtering or preprocessing (but plied the domain adaptation described in (Yu et al., the transliteration) were performed on this corpus. 2020). Following the hints from the paper, we The corpus included parallel data from: CVIT-PIB, trained the fastText (Joulin et al., 2017) model PMIndia, IITB 3.0, JW, NLPC, UFAL EnTam, Uka using balanced corpus containing sentences from Tarsadia, Wiki Titles, ALT, OpenSubtitles, Bible- PMIndia labelled as in-domain and CCAligned sen- uedin, MTEnglish2Odia, OdiEnCorp 2.0, TED, tences labelled as out-domain. Using the trained WikiMatrix. During the competition we performed model we filtered the parallel as well as monolin- several experiments to enrich/filter this parallel cor- gual corpora. pora: Domain adaptation by language model (LM) • Inclusion of CCAligned corpus As the second approach to select a subset of • Removing far from domain sentence pairs like best PMI-like sentences from monolingual general- religious corpora domain AI4Bharat (Kunchukuttan et al., 2020) cor- pora available for the task, we used the approach • Removing sentence pairs of low probability described in (Axelrod et al., 2011). For each of (according to e.g. sentence lengths, detected 10 Indian languages two RNN language models language etc.) were constructed using Marian toolkit: in-domain trained with a particular part of PMI corpus and • Domain adaptation by fastText out-of-domain created using a similar number of lines from a mix of all other corpora available for • Domain adaptation by language model that language respectively. All these models were None of these techniques applied on parallel cor- regularized with exponential smoothing of 0.0001, pora had led to quality improvement which is why dropout of 0.2 along with source and target word we decided to continue with the basic non-filtered token dropout of 0.1. For the AI4Bharat mono corpora as the base for future trainings. corpus sentence ranking, we used a cross-entropy difference between scores of previously mentioned 2.4 Backtranslation models as suggested in (Axelrod et al., 2011), nor- Backtranslation of monolingual corpora is a com- malized by the line length. Only sentences with a monly used technique for improving machine trans- score above arbitrarily chosen threshold were se- lation. Especially for low-resource languages lected for further processing. We noticed a signifi- where only small bilingual corpora are available cant influence of domain adaptation while selecting (Edunov et al., 2018). Training on backtransla- mono corpora used for backtranslation (see Table tions enriches the target language model, which 3). improves the overall translation quality. The syn- thetic backtranslated corpus was joined with the 2.6 Multi-Agent Dual Learning original bilingual corpus for the trainings. For some of trainings, we used the simpli- Using backtranslations of the full monolingual fied version of Multi-Agent Dual Learning corpuses led to the improvement of results on trans- (MADL) (Wang et al., 2019), proposed in Kim lation on Indian to English directions by 1.2 BLEU et al.(2019), to generate additional training data on average. There was no improvement in the op- from the parallel corpus. We generated n-best trans- posite directions. See Tables5 and6. lations of both the source and the target sides of the parallel data, with strong ensembles of, respec- 2.5 Domain adaptation tively, the forward and the backward models. Next, We enriched the parallel training corpora with back- we picked the best translation from among n candi- translated monolingual data selecting only sen- dates w.r.t. the sentence-level BLEU score. Thanks tences similar to PMI domain to increase the rate of to these steps, we tripled the number of sentences in-domain data in the training corpus. We used two by combining three types of datasets: 225 1. original source – original target, Parallel En-In In-En bitext 18.03 31.41 2. original source – synthetic target, CCAligned 6.82 12.15 PMIndia 5.59 11.94 3. synthetic source – original target, bitext+CC 17.62 30.56 where the synthetic target is the translation of the bitext, no religious 15.33 29.02 original source with the forward model, and the bitext, filtered FT 17.84 29.38 synthetic source is the translation of the original bitext, most likely 17.98 31.00 target with the backward model. bitext, no transliteration 18.36 31.27 With backtranslation 2.7 Postprocessing bitext+BT filtered LM 18.22 31.38 In comparison to our competitors we noticed sig- bitext+BT filtered FT 18.71 32.77 nificantly weaker performance on the En-Or di- bitext+CC+BT flitered FT 18.21 30.64 rection. After the analysis we found out that the MADL generated corpora contain sequences of characters MADL 18.87 31.94 (U+0B2F-U+0B3C, U+0B5F) not present in the MADL+BT filtered FT 18.83 33.25 devset corpora. Replacing these sequences with sequence (U+0B5F-U+0B3E) gave a significant Table 1: Average BLEU for preliminary trainings (4.1) on different corpora. improvement for En-Or of about +4 BLEU. 3 NMT System Overview layer dimension of 4096. The transformer-big All of our systems are trained with the Marian trainings were regularized with a dropout between NMT3 (Junczys-Dowmunt et al., 2018) framework. transformer layers of 0.1 and a label smoothing of 0.1 unlike the transformer-base which was trained 3.1 Baseline systems for preliminary without a dropout. experiments 4 Trainings First experiments were performed with transformer models (Vaswani et al., 2017), which we will now 4.1 Preliminary trainings refer to as transformer-base. The only difference During preliminary trainings, we tested which tech- is that we used 8 encoder layers and 4 decoder niques of filtering/backtranslation/MADL work layers instead of default configuration 6-6. The best for the task. Preliminary trainings were per- model has default embedding dimension of 512 formed for all 20 directions on a single transformer- and a feed-forward layer dimension of 2048. base model with no dropout.
Recommended publications
  • Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe Romanization: KNAB 2012
    Institute of the Estonian Language KNAB: Place Names Database 2012-10-11 Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe romanization: KNAB 2012 I. Consonant characters 1 ᦀ ’a 13 ᦌ sa 25 ᦘ pha 37 ᦤ da A 2 ᦁ a 14 ᦍ ya 26 ᦙ ma 38 ᦥ ba A 3 ᦂ k’a 15 ᦎ t’a 27 ᦚ f’a 39 ᦦ kw’a 4 ᦃ kh’a 16 ᦏ th’a 28 ᦛ v’a 40 ᦧ khw’a 5 ᦄ ng’a 17 ᦐ n’a 29 ᦜ l’a 41 ᦨ kwa 6 ᦅ ka 18 ᦑ ta 30 ᦝ fa 42 ᦩ khwa A 7 ᦆ kha 19 ᦒ tha 31 ᦞ va 43 ᦪ sw’a A A 8 ᦇ nga 20 ᦓ na 32 ᦟ la 44 ᦫ swa 9 ᦈ ts’a 21 ᦔ p’a 33 ᦠ h’a 45 ᧞ lae A 10 ᦉ s’a 22 ᦕ ph’a 34 ᦡ d’a 46 ᧟ laew A 11 ᦊ y’a 23 ᦖ m’a 35 ᦢ b’a 12 ᦋ tsa 24 ᦗ pa 36 ᦣ ha A Syllable-final forms of these characters: ᧅ -k, ᧂ -ng, ᧃ -n, ᧄ -m, ᧁ -u, ᧆ -d, ᧇ -b. See also Note D to Table II. II. Vowel characters (ᦀ stands for any consonant character) C 1 ᦀ a 6 ᦀᦴ u 11 ᦀᦹ ue 16 ᦀᦽ oi A 2 ᦰ ( ) 7 ᦵᦀ e 12 ᦵᦀᦲ oe 17 ᦀᦾ awy 3 ᦀᦱ aa 8 ᦶᦀ ae 13 ᦺᦀ ai 18 ᦀᦿ uei 4 ᦀᦲ i 9 ᦷᦀ o 14 ᦀᦻ aai 19 ᦀᧀ oei B D 5 ᦀᦳ ŭ,u 10 ᦀᦸ aw 15 ᦀᦼ ui A Indicates vowel shortness in the following cases: ᦀᦲᦰ ĭ [i], ᦵᦀᦰ ĕ [e], ᦶᦀᦰ ăe [ ∎ ], ᦷᦀᦰ ŏ [o], ᦀᦸᦰ ăw [ ], ᦀᦹᦰ ŭe [ ɯ ], ᦵᦀᦲᦰ ŏe [ ].
    [Show full text]
  • Devan¯Agar¯I for TEX Version 2.17.1
    Devanagar¯ ¯ı for TEX Version 2.17.1 Anshuman Pandey 6 March 2019 Contents 1 Introduction 2 2 Project Information 3 3 Producing Devan¯agar¯ıText with TEX 3 3.1 Macros and Font Definition Files . 3 3.2 Text Delimiters . 4 3.3 Example Input Files . 4 4 Input Encoding 4 4.1 Supplemental Notes . 4 5 The Preprocessor 5 5.1 Preprocessor Directives . 7 5.2 Protecting Text from Conversion . 9 5.3 Embedding Roman Text within Devan¯agar¯ıText . 9 5.4 Breaking Pre-Defined Conjuncts . 9 5.5 Supported LATEX Commands . 9 5.6 Using Custom LATEX Commands . 10 6 Devan¯agar¯ıFonts 10 6.1 Bombay-Style Fonts . 11 6.2 Calcutta-Style Fonts . 11 6.3 Nepali-Style Fonts . 11 6.4 Devan¯agar¯ıPen Fonts . 11 6.5 Default Devan¯agar¯ıFont (LATEX Only) . 12 6.6 PostScript Type 1 . 12 7 Special Topics 12 7.1 Delimiter Scope . 12 7.2 Line Spacing . 13 7.3 Hyphenation . 13 7.4 Captions and Date Formats (LATEX only) . 13 7.5 Customizing the date and captions (LATEX only) . 14 7.6 Using dvnAgrF in Sections and References (LATEX only) . 15 7.7 Devan¯agar¯ıand Arabic Numerals . 15 7.8 Devan¯agar¯ıPage Numbers and Other Counters (LATEX only) . 15 1 7.9 Category Codes . 16 8 Using Devan¯agar¯ıin X E LATEXand luaLATEX 16 8.1 Using Hindi with Polyglossia . 17 9 Using Hindi with babel 18 9.1 Installation . 18 9.2 Usage . 18 9.3 Language attributes . 19 9.3.1 Attribute modernhindi .
    [Show full text]
  • An Introduction to Indic Scripts
    An Introduction to Indic Scripts Richard Ishida W3C [email protected] HTML version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html PDF version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.pdf Introduction This paper provides an introduction to the major Indic scripts used on the Indian mainland. Those addressed in this paper include specifically Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. I have used XHTML encoded in UTF-8 for the base version of this paper. Most of the XHTML file can be viewed if you are running Windows XP with all associated Indic font and rendering support, and the Arial Unicode MS font. For examples that require complex rendering in scripts not yet supported by this configuration, such as Bengali, Oriya, and Malayalam, I have used non- Unicode fonts supplied with Gamma's Unitype. To view all fonts as intended without the above you can view the PDF file whose URL is given above. Although the Indic scripts are often described as similar, there is a large amount of variation at the detailed implementation level. To provide a detailed account of how each Indic script implements particular features on a letter by letter basis would require too much time and space for the task at hand. Nevertheless, despite the detail variations, the basic mechanisms are to a large extent the same, and at the general level there is a great deal of similarity between these scripts. It is certainly possible to structure a discussion of the relevant features along the same lines for each of the scripts in the set.
    [Show full text]
  • Brahminet-ITRANS Transliteration Scheme
    BrahmiNet-ITRANS Transliteration Scheme Anoop Kunchukuttan August 2020 The BrahmiNet-ITRANS notation (Kunchukuttan et al., 2015) provides a scheme for transcription of major Indian scripts in Roman, using ASCII characters. It extends the ITRANS1 transliteration scheme to cover characters not covered in the original scheme. Tables 1a shows the ITRANS mappings for vowels and diacritics (matras). Table 1b shows the ITRANS mappings for consonants. These tables also show the Unicode offset for each character. By Unicode offset, we mean the offset of the character in the Unicode range assigned to the script. For Indic scripts, logically equivalent characters are assigned the same offset in their respective Unicode codepoint ranges. For illustration, we also show the Devanagari characters corresponding to the transliteration. ka kha ga gha ∼Na, N^a ITRANS Unicode Offset Devanagari 15 16 17 18 19 a 05 अ क ख ग घ ङ aa, A 06, 3E आ, ◌ा cha Cha ja jha ∼na, JNa i 07, 3F इ, ि◌ 1A 1B 1C 1D 1E ii, I 08, 40 ई, ◌ी च छ ज झ ञ u 09, 41 उ, ◌ु Ta Tha Da Dha Na uu, U 0A, 42 ऊ, ◌ू 1F 20 21 22 23 RRi, R^i 0B, 43 ऋ, ◌ृ ट ठ ड ढ ण RRI, R^I 60, 44 ॠ, ◌ॄ ta tha da dha na LLi, L^i 0C, 62 ऌ, ◌ॢ 24 25 26 27 28 LLI, L^I 61, 63 ॡ, ◌ॣ त थ द ध न pa pha ba bha ma .e 0E, 46 ऎ, ◌ॆ 2A 2B 2C 2D 2E e 0F, 47 ए, ◌े प फ ब भ म ai 10,48 ऐ, ◌ै ya ra la va, wa .o 12, 4A ऒ, ◌ॊ 2F 30 32 35 o 13, 4B ओ, ◌ो य र ल व au 14, 4C औ, ◌ौ sha Sha sa ha aM 05 02, 02 अं 36 37 38 39 aH 05 03, 03 अः श ष स ह .m 02 ◌ं Ra lda, La zha .h 03 ◌ः 31 33 34 (a) Vowels ऱ ळ ऴ (b) Consonants Version: v1.0 (9 August 2020) NOTES: 1.
    [Show full text]
  • Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati
    Component-I (A) – Personal details: Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati. Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati. & Dr. K. Muniratnam Director i/c, Epigraphy, ASI, Mysore Dr. Sayantani Pal Dept. of AIHC, University of Calcutta. Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati. Component-I (B) – Description of module: Subject Name Indian Culture Paper Name Indian Epigraphy Module Name/Title Kharosthi Script Module Id IC / IEP / 15 Pre requisites Kharosthi Script – Characteristics – Origin – Objectives Different Theories – Distribution and its End Keywords E-text (Quadrant-I) : 1. Introduction Kharosthi was one of the major scripts of the Indian subcontinent in the early period. In the list of 64 scripts occurring in the Lalitavistara (3rd century CE), a text in Buddhist Hybrid Sanskrit, Kharosthi comes second after Brahmi. Thus both of them were considered to be two major scripts of the Indian subcontinent. Both Kharosthi and Brahmi are first encountered in the edicts of Asoka in the 3rd century BCE. 2. Discovery of the script and its Decipherment The script was first discovered on one side of a large number of coins bearing Greek legends on the other side from the north western part of the Indian subcontinent in the first quarter of the 19th century. Later in 1830 to 1834 two full inscriptions of the time of Kanishka bearing the same script were found at Manikiyala in Pakistan. After this discovery James Prinsep named the script as ‘Bactrian Pehelevi’ since it occurred on a number of so called ‘Bactrian’ coins. To James Prinsep the characters first looked similar to Pahlavi (Semitic) characters.
    [Show full text]
  • Simplified Abugidas
    Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan fchenchen.ding, mutiyama, [email protected] phonogram segmental abjad Abstract writing alphabet system An abugida is a writing system where the logogram syllabic abugida consonant letters represent syllables with Figure 1: Hierarchy of writing systems. a default vowel and other vowels are de- noted by diacritics. We investigate the fea- ណូ ន ណណន នួន ននន …ជិតណណន… sibility of recovering the original text writ- /noon/ /naen/ /nuən/ /nein/ vowel machine ten in an abugida after omitting subordi- diacritic omission learning ណន នន methods nate diacritics and merging consonant let- consonant character ters with similar phonetic values. This is merging N N … J T N N … crucial for developing more efficient in- (a) ABUGIDA SIMPLIFICATION (b) RECOVERY put methods by reducing the complexity in abugidas. Four abugidas in the south- Figure 2: Overview of the approach in this study. ern Brahmic family, i.e., Thai, Burmese, ters equally. In contrast, abjads (e.g., the Arabic Khmer, and Lao, were studied using a and Hebrew scripts) do not write most vowels ex- newswire 20; 000-sentence dataset. We plicitly. The third type, abugidas, also called al- compared the recovery performance of a phasyllabary, includes features from both segmen- support vector machine and an LSTM- tal and syllabic systems. In abugidas, consonant based recurrent neural network, finding letters represent syllables with a default vowel, that the abugida graphemes could be re- and other vowels are denoted by diacritics.
    [Show full text]
  • Designing Devanagari Type
    Designing Devanagari type The effect of technological restrictions on current practice Kinnat Sóley Lydon BA degree final project Iceland Academy of the Arts Department of Design and Architecture Designing Devanagari type: The effect of technological restrictions on current practice Kinnat Sóley Lydon Final project for a BA degree in graphic design Advisor: Gunnar Vilhjálmsson Graphic design Department of Design and Architecture December 2015 This thesis is a 6 ECTS final project for a Bachelor of Arts degree in graphic design. No part of this thesis may be reproduced in any form without the express consent of the author. Abstract This thesis explores the current process of designing typefaces for Devanagari, a script used to write several languages in India and Nepal. The typographical needs of the script have been insufficiently met through history and many Devanagari typefaces are poorly designed. As the various printing technologies available through the centuries have had drastic effects on the design of Devanagari, the thesis begins with an exploration of the printing history of the script. Through this exploration it is possible to understand which design elements constitute the script, and which ones are simply legacies of older technologies. Following the historic overview, the character set and unique behavior of the script is introduced. The typographical anatomy is analyzed, while pointing out specific design elements of the script. Although recent years has seen a rise of interest on the subject of Devanagari type design, literature on the topic remains sparse. This thesis references books and articles from a wide scope, relying heavily on the works of Fiona Ross and her extensive research on non-Latin typography.
    [Show full text]
  • Töwkhön, the Retreat of Öndör Gegeen Zanabazar As a Pilgrimage Site Zsuzsa Majer Budapest
    Töwkhön, The ReTReaT of öndöR GeGeen ZanabaZaR as a PilGRimaGe siTe Zsuzsa Majer Budapest he present article describes one of the revived up to the site is not always passable even by jeep, T Mongolian monasteries, having special especially in winter or after rain. Visitors can reach the significance because it was once the retreat and site on horseback or on foot even when it is not possible workshop of Öndör Gegeen Zanabazar, the main to drive up to the monastery. In 2004 Töwkhön was figure and first monastic head of Mongolian included on the list of the World’s Cultural Heritage Buddhism. Situated in an enchanted place, it is one Sites thanks to its cultural importance and the natural of the most frequented pilgrimage sites in Mongolia beauties of the Orkhon River Valley area. today. During the purges in 1937–38, there were mass Information on the monastery is to be found mainly executions of lamas, the 1000 Mongolian monasteries in books on Mongolian architecture and historical which then existed were closed and most of them sites, although there are also some scattered data totally destroyed. Religion was revived only after on the history of its foundation in publications on 1990, with the very few remaining temple buildings Öndör Gegeen’s life. In his atlas which shows 941 restored and new temples erected at the former sites monasteries and temples that existed in the past in of the ruined monasteries or at the new province and Mongolia, Rinchen marked the site on his map of the subprovince centers. Öwörkhangai monasteries as Töwkhön khiid (No.
    [Show full text]
  • General Historical and Analytical / Writing Systems: Recent Script
    9 Writing systems Edited by Elena Bashir 9,1. Introduction By Elena Bashir The relations between spoken language and the visual symbols (graphemes) used to represent it are complex. Orthographies can be thought of as situated on a con- tinuum from “deep” — systems in which there is not a one-to-one correspondence between the sounds of the language and its graphemes — to “shallow” — systems in which the relationship between sounds and graphemes is regular and trans- parent (see Roberts & Joyce 2012 for a recent discussion). In orthographies for Indo-Aryan and Iranian languages based on the Arabic script and writing system, the retention of historical spellings for words of Arabic or Persian origin increases the orthographic depth of these systems. Decisions on how to write a language always carry historical, cultural, and political meaning. Debates about orthography usually focus on such issues rather than on linguistic analysis; this can be seen in Pakistan, for example, in discussions regarding orthography for Kalasha, Wakhi, or Balti, and in Afghanistan regarding Wakhi or Pashai. Questions of orthography are intertwined with language ideology, language planning activities, and goals like literacy or standardization. Woolard 1998, Brandt 2014, and Sebba 2007 are valuable treatments of such issues. In Section 9.2, Stefan Baums discusses the historical development and general characteristics of the (non Perso-Arabic) writing systems used for South Asian languages, and his Section 9.3 deals with recent research on alphasyllabic writing systems, script-related literacy and language-learning studies, representation of South Asian languages in Unicode, and recent debates about the Indus Valley inscriptions.
    [Show full text]
  • 3 Writing Systems
    Writing Systems 43 3 Writing Systems PETER T. DANIELS Chapters on writing systems are very rare in surveys of linguistics – Trager (1974) and Mountford (1990) are the only ones that come to mind. For a cen- tury or so – since the realization that unwritten languages are as legitimate a field of study, and perhaps a more important one, than the world’s handful of literary languages – writing systems were (rightly) seen as secondary to phonological systems and (wrongly) set aside as unworthy of study or at best irrelevant to spoken language. The one exception was I. J. Gelb’s attempt (1952, reissued with additions and corrections 1963) to create a theory of writ- ing informed by the linguistics of his time. Gelb said that what he wrote was meant to be the first word, not the last word, on the subject, but no successors appeared until after his death in 1985.1 Although there have been few lin- guistic explorations of writing, a number of encyclopedic compilations have appeared, concerned largely with the historical development and diffusion of writing,2 though various popularizations, both new and old, tend to be less than accurate (Daniels 2000). Daniels and Bright (1996; The World’s Writing Systems: hereafter WWS) includes theoretical and historical materials but is primarily descriptive, providing for most contemporary and some earlier scripts information (not previously gathered together) on how they represent (the sounds of) the languages they record. This chapter begins with a historical-descriptive survey of the world’s writ- ing systems, and elements of a theory of writing follow.
    [Show full text]
  • 2008-2009 Major and Concentration
    DEGREES CONFERRED Major and Concentration: AY2008-09 Aug Dec May Total University Total 54 167 802 1023 Undergraduate 48 156 751 955 BSBA Accounting 1 9 27 37 BS Applied Physics 0 1 1 2 BA Biology 0 1 4 5 BS Biology 4 17 74 95 BSBA Business Administration: Economics 0 3 2 5 BSBA Business Administration: Finance 1 6 2 9 BSBA Business Administration: Management 2 13 9 24 BSBA Business Administration: Marketing 0 3 7 10 BS Chemistry 0 3 9 12 BA Communication Studies 5 17 81 103 BS Computer Engineering 0 1 6 7 BS Computer Science 1 5 11 17 BSBA Economics 0 0 6 6 BA English: Creative Writing 0 0 4 4 BA English: Journalism 0 1 6 7 BA English: Language Arts 0 4 18 22 BA English: Language Arts: Pre K-6 0 0 1 1 BA English: Literature 0 0 9 9 BA English: Technical Writing 0 1 1 2 BA English: Writing 2 2 18 22 BS Environmental Science 0 1 7 8 BSBA Finance 0 0 10 10 BA Fine & Performing Arts: Music 1 1 0 2 BA Fine & Performing Arts: Studio Art 1 0 4 5 BA Fine & Performing Arts: Theater Arts-Acting 1 0 0 1 BA Fine & Performing Arts: Theater Arts-Arts Admin 0 0 1 1 BA Fine & Performing Arts: Theater Arts-Design/Technolo 0 1 0 1 BA Fine & Performing Arts: Theater Arts-Music/Dance 0 0 1 1 BA Fine Arts: Art History 0 0 3 3 BA Fine Arts: Studio Art 0 4 11 15 BA French 0 1 3 4 BA German 1 0 1 2 BA History 2 13 54 69 BSIS Information Science 1 2 2 5 BSIS Information Science: Information Systems 0 1 0 1 BSIS Information Science: Networking & Communications 0 0 5 5 BA Interdisciplinary Studies 1 0 6 7 BSBA Management 0 4 32 36 BSBA Marketing 0 0 47 47 BA Mathematics
    [Show full text]
  • Geochemistry Guide 2020 the Foundation of Project Success
    GEOCHEMISTRY GUIDE 2020 THE FOUNDATION OF PROJECT SUCCESS ABOUT SGS COMMITMENT TO QUALITY ON-SITE LABORATORIES COMMERCIAL TESTING SERVICES SUPPORT ACROSS THE PROJECT LIFECYCLE ADDITIONAL INFORMATION TECHNICAL GUIDE GEOCHEMISTRY GUIDE 2020 1 ABOUT SGS By incorporating an integrated approach, we deliver testing and expertise Whether your requirements are in the throughout the entire mining life cycle. Our services include: field, at a mine site, in a smelting and refining plant, or at a port, SGS has the • Helping you to understand your resources with a full-suite of geological experience, technical solutions and services, including XploreIQ, laboratory professionals to help you reach • Offering innovative analytical testing capabilities through dedicated on-sites, your goals efficiently and effectively. FAST or one of commercial laboratories, • Consulting and designing metallurgical testing programs, unique to your deposit, VISIT US AT: • Optimizing recovery and throughput through process control services, WWW.SGS.COM/MINING • Keeping final product transparent through impartial trade analysis, • Ensuring safe operations for employees while helping to mitigate environmental impact. GEOCHEMISTRY GUIDE 2020 2 COMMITMENT TO QUALITY SGS is committed to customer satisfaction and providing a consistent level of quality service that sets the industry benchmark. SGS management and staff are appropriately empowered to ensure these requirements are met. All employees and contractors are familiar with the requirements of the Quality Management System, the above objectives and process outcomes. Click here to learn more SAMPLE INTEGRITY Sample integrity is at the core of our operations, ensuring your project and its associated data is treated responsibly and transparently. To comply with regulatory oversight, samples must be collected and handled correctly.
    [Show full text]