Linguistic Issues in Encoding Sanskrit. Peter M. Scharf and Malcolm D. Hyman. 2010

i “LIES” — 2011/6/21 — 15:43 — page i — #1 i i i Linguistic Issues in Encoding Sanskrit Peter M. Scharf Malcolm D. Hyman Brown University MPIWG June 21, 2011 i i i i i “LIES” — 2011/6/21 — 15:43 — page iv — #2 i i i iv Scharf, Peter M. and Malcolm D. Hyman. Linguistic Issues in Encoding Sanskrit. Providence: The Sanskrit Library, 2011. Copyright c 2011 by The Sanskrit Library. All rights reserved. Repro- duction in any medium is restricted. i i i i i “LIES” — 2011/6/21 — 15:43 — page v — #3 i i i Foreword by GEORGE CARDONA Questions surrounding the encoding of speech have been considered since scholars began to consider the history of different writing systems and of writing itself. In modern times, attention has been paid to such issues as standardizing systems for portraying in Roman script the scripts used for recording other languages, and this has given rise to discussions about distinctions such as that between transliteration and transcription. In re- cent times, moreover, the advent and general use of digital technology has allowed us not only to replicate with relative ease details of various scripts and to produce machine searchable texts but also to reproduce images of manuscripts that can be viewed and manipulated, a true boon to philologists in that they are thus enabled to consult and study mate- rials with all the details found in original manuscripts, such as different hands that can be discerned and clues to modifications made due to features of different scripts. At the source of such endeavors lie the facts of language: phonological and phonetic matters that scripts portray with various degrees of fidelity. India can justifiably lay claim to being the home of what is doubt- less the most thorough and sophisticated consideration of speech production, phonetics, and phonology in ancient times. The preservation of Vedic texts and their proper recitation according to the norms of various groups of reciters led to the early analysis of continuously recited texts (samhit˙ ap¯ at¯.ha) into constituents — called pada — characterized by phonological alternations that appear at word boundaries, including boundaries before particular morphemes within syntactic words. A text v i i i i i “LIES” — 2011/6/21 — 15:43 — page vi — #4 i i i vi FOREWORD that includes such elements is termed padapat¯.ha. At least one such ana- lyzed text predates the grammarian Pan¯ .ini, the padapat¯.ha to the Rgveda ˚ by Sákalya.¯ The padapat¯.ha related to any samhit˙ ap¯ at¯.ha obviously derives from the latter, its source. On the other hand, the separate padas of the padapat¯.ha can be viewed theoretically as the source of the continuously recited text, gotten by removing pauses at boundaries and thereby apply- ing phonological rules that take effect between contiguous units. This is, in fact, the theoretical stance taken by authors of texts called prati¯ sákhya¯ , which formulate phonological rules modifying padas in contiguity with other padas. Thus, phonological alternations within Vedic texts were ob- jects of concern by at least the early sixth century B.C. Pan¯ .ini himself — who can hardly be dated later than around 500 B.C. — composed a generalized grammatical work, his sabd´ anu¯ sásana,¯ which includes both a set of rules, called As..tadhy¯ ay¯ ¯ı, serving to account through a derivational system for the accepted usage of his time and place as well as certain dialectal differences and features particular to earlier Vedic. One of the appendices to the As..tadhy¯ ay¯ ¯ı is an inventory of sounds — referred to as the aks. arasamamn¯ aya¯ by early students of Pan¯ .ini’s work — that is divided into fourteen sets, each set off from the others by a final conso- nantal marker (it), which serves to form abbreviatory terms (pratyah¯ ara¯ ) referring to groups of sounds with respect to phonological rules as for- mulated in the As..tadhy¯ ay¯ ¯ı. The order of sounds in Pan¯ .ini’s aks.arasamamn¯ aya¯ shows properties best explained as due to its being a reworking of an earlier source. The five sets of stops in such earlier inventories, moreover, show an obvi- ous phonetic ordering, from velar to labial, that is, an order based on the production of sounds, from the back of the oral cavity to the front. Moreover, prati¯ sákhyas¯ not only state rules of phonological replacement but also describe the production of sounds, a topic which is dealt with in works on phonetics (siks´ . a¯) such as the Api¯ sali´ siks´ . a¯ of Api¯ sali.´ Accord- ingly, scholars are justified in maintaining that early Indian texts reflect a sophisticated investigation of Sanskrit phonology and phonetics. Scholars have also frequently debated whether or not writing played a role in the composition and transmission of such early works as the prati¯ sákhyas¯ and the As..tadhy¯ ay¯ ¯ı. There can be no doubt whatever that the latter was later transmitted orally. It is also most plausible that Pan¯ .ini himself composed and transmitted his work orally. Thus, Pan¯ .ini formu- i i i i i “LIES” — 2011/6/21 — 15:43 — page vii — #5 i i i FOREWORD vii lates a group of rules identifying certain sounds as markers, given the class name it, and provides that such sounds are unconditionally deleted before any other operations apply. Had he transmitted his work in writing, thus being able to make use of script particularities such as placing given sounds above or below a line, Pan¯ .ini would not have needed such rules. That works such as Pan¯ .ini’s were transmitted orally does not mean, however, that the society in which Pan¯ .ini lived was not literate. To the contrary, he lived in a part of the subcontinent — Sal´ atura¯ in the extreme north-west — that at his time was under Persian control, and the Achemenid rulers had inscriptions recorded. Nevertheless, a literate society does not imply necessarily that compositions must be put in writing and thus transmitted; later Indian traditions, for example, stress the oral transmission, though writing was clearly known then. The earliest at- tested written documents on the subcontinent, nevertheless, come several centuries after Pan¯ .ini. These are the inscriptions of the emperor Asoka´ in the third century B.C., which for the most part employ two scripts: Brahm¯ ¯ı everywhere except the northwest, where Kharos.t.h¯ı is used; in the extreme-north-west, one finds also Aramaic and Greek used. Peter M. Scharf and the late Malcom D. Hyman have written a valu- able work, Linguistic Issues in Encoding Sanskrit, in which Sanskrit and its systems of description and transmission serve as a background to more general discussions concerning encoding of language. The authors ex- plain the need for a work such as this and set forth their general aims in the introduction (p. 2) as follows: Today people use computers to manipulate linguistic and textual data in sophisticated ways; yet current encoding systems tend to reflect visual and orthographic design factors to the exclusion of more relevant information-processing principles. Thus these systems reproduce deficiencies inherent in the traditional orthogra- phies themselves. In this book we examine some fundamental issues in the coding of natural language texts. We consider above all the relation the information selected for encoding bears to natural language structure. We focus on Sanskrit, which is characterized by an extensive oral tradition, a highly phonetic orthography, and a copious literature. We survey various Sanskrit encoding schemes in past and present use and investigate their suitability for particular applications. We conclude by advancing some concrete proposals. i i i i i “LIES” — 2011/6/21 — 15:43 — page viii — #6 i i i viii FOREWORD Although this book centers on Sanskrit, it covers a great many important issues and history relative to the general subject of encoding. The second and third chapters take up different coding systems. A brief sketch of the history of Indian printing serves as a background to pre- senting coding systems, including Roman transliterations, keyboard ar- rangements, and Unicode. These are subjected to a critique that centers on issues of ambiguity and redundancy consequent to their being based on Devanagar¯ ¯ı and Roman transliteration. The fourth chapter may well be the most important one from a theoretical viewpoint. Here the authors take up what they deem to be the basis for encoding. Their discussion is organized around three axes, as follows (p. 47): Axis I: Graphic–phonetic: Is the basic unit of the encoding a written character or a speech sound? Axis II: Synthetic–analytic: Are units encoded as a sin- gle Gestalt? Or are they decomposed into distinctively encoded features? Axis III: Contrastive–non-contrastive: Are codepoints selected only for units that contrast minimally (graphemes or phonemes)? The sixth and seventh chapters deal with the basic issue of encoding elements of speech or writing. The discussion of distinctive elements in chapter six is partic- ularly wide ranging and includes succinct presentation of issues in areas such as generative grammar and historical linguistics. Given that the principal emphasis throughout is on Sanskrit, it is appropriate that these discussions are preceded, in chapter five, by considerations of Sanskrit phonetics and phonology. These include both presentations of what was said in various prati¯ sákhyas¯ and siks´ .as¯ — including treatments of these statements by modern scholars — and feature analysis (section 5.2.6). In the eighth and final chapter, the authors emphasize that, since computers now are used to carry out many tasks in addition to displaying data, this can no longer be considered the primary factor in determin- ing a scheme for encoding.

Linguistic Issues in Encoding Sanskrit. Peter M. Scharf and Malcolm D. Hyman. 2010

75 Characters Maximum

Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs

Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names

Names of Countries, Their Capitals and Inhabitants

Headmost Accent Wins Headmost Accent Wins

Slides for My Lecture for the Texperience 2010

Prakriya Documentation Release 0.0.7

Tugboat, Volume 15 (1994), No. 4 447 Indica, an Indic Preprocessor

2 the Theory of Lexical Accents

Contrastive Modelling of the Intonation of Recapitulatory Echo Interrogative Sentences in Modern American English and Cuban Spanish

Constraints on Structual Borrowing in a Multilingual Contact Situation

SOUTHERN LISU DICTIONARY Qaaaqrc Qbq[D @^J Hell Ebll Ell