ICON 2016 Workshop on Bridging the Gap Between Sanskrit Computational Linguistics Tools and Management of Sanskrit Digital Libraries 17 – 20 December 2016, IIT-BHU
Total Page:16
File Type:pdf, Size:1020Kb
ICON 2016 Workshop on bridging the gap between Sanskrit computational linguistics tools and management of Sanskrit digital libraries 17 – 20 December 2016, IIT-BHU edited by AMBA KULKARNI,GERARD´ HUET SRINIVAS VARAKHEDI, and PAWAN GOYAL 28 November 2016 ii Contents 1 Sanskrit Library conventions 1 1 Character encoding . .2 2 TEI text markup . .6 3 TEI bibliography . .7 4 TEI dictionary . .8 5 Dictionary headword coordination . .9 6 Root lists . 11 7 Morphological and syntactic tagging . 12 7.1 Lexical tagging . 12 7.2 Inflectional tagging . 13 7.3 Syntactic tagging . 14 8 Meter recognition . 15 iii iv 0 P. SCHARF 1 Sanskrit Library conventions of digital representation and annotation of texts, lexica, and manuscripts PETER M. SCHARF The Sanskrit Library [email protected] The ICON 2016 workshop on bridging the gap between San- skrit computational linguistics tools and management of Sanskrit digital libraries aims to improve interoperability between software and corpora. The first and most important step in facilitating such interoperability is the explicit description and publication of con- ventions used. Hence in order to foster interoperability, the current paper surveys the best of the conventions developed and used at the Sanskrit Library and the efforts to coordinate them with colleagues in the Sanskrit Computational Linguistics Consortium. These con- ventions include character encoding, text markup, bibliographic annotation, morphological tagging, syntactic tagging, meter recog- nition, and the markup and lexical representation of dictionaries and root lists. This survey makes no claim to being comprehensive but only to contributing to productive interchange at the workshop. 1 2 P. SCHARF 1 Character encoding Contemporary uses of information technology demand higher standards of encoding than the inherited systems prevalent today. Guided by visual factors, current encoding systems reproduce defi- ciencies inherent in traditional writing systems. The contemporary use of computers for the manipulation of linguistic and textual data demands more relevant information-processing principles. The fundamental issue in encoding natural language texts concerns the relation information selected bears to natural language structure. Selecting appropriate information to encode demands addressing whether to encode written characters or speech sounds, whether to encode segments or features, and what criteria to use to con- trast items selected for encoding. Scharf and Hyman (2011) and Scharf (2014) considered these issues in detail and provided ex- tensive bibliography. To prevent loss of knowledge requires determining what in- formation is essential and devising a lossless method of encod- ing that information while ignoring the extraneous. Efficiency in representing essential information implies the fundamental prin- ciple of character encoding, namely, to avoid ambiguity and re- dundancy. To avoid ambiguity and redundancy requires that an encoding system be characterized by a one-to-one correspondence between characters and items to be encoded, and that all encoded items be of the same kind (e.g., phonemes or written characters). Most contemporary encoding systems for Sanskrit designed for use with the computer are based either upon the Indic script structure represented by Devanagar¯ ¯ı or on the Sanskrit Romaniza- tion. The encodings therefore retain features inherent in the scripts themselves and likewise reproduce the following five deficiencies inherent in these scripts: SANSKRIT LIBRARY CONVENTIONS 3 1. In the Devanagari¯ standards, there are separate charac- ters for vowels when they appear post-consonantally versus when not preceded by a consonant. In the Roman standards, a single character is used in all contexts. 2. In the Devanagari¯ standards, post-consonantal /a/ is implic- itly indicated by the graph of the preceding consonant, while its absence is explicitly represented by a sign indicating the cessation of speech (virama¯ ). In the Roman standards, the distribution of hai corresponds exactly to the distribution of the vowel /a/. 3. In the Roman standards, certain single sounds are repre- sented by digraphs: the aspirate stops (kh, gh, ch, jh,. th, d. h, th, dh, ph, bh) and the open diphthongs (ai, au). In the De- vanagari¯ standards, single characters represent each of these segments. 4. In both the Devanagari¯ and Roman standards the aspirate retroflex lateral flap / h/ is represented by a digraph: \h, .lh. 5. An additional discrepancy exists between the encoding of accent in Devanagari¯ script and the encoding in standard Romanization. The Romanization encodes lexical or post- prosodic high pitch and independent circumflex, or deep ac- cent. Devanagari¯ encodes manifest pitch or surface accent. The failure of scholars to recognize the difference has led to confused explanations of Devanagari¯ accentual systems and the obfuscation of genuinely different traditions of recitation and dialects (Scharf 2012). In items (1), (3), and (4), a single sound is represented by more than one character, and in (2), a sound is inversely represented: that is, the presence of the sound is represented by the absence 4 P. SCHARF of a character, and the absence of the sound by the presence of a character. The departure from the principle of a one-to-one corre- spondence between what is to be represented and the representa- tion signals confusion concerning the principles of encoding. In the digraphs that represent aspirated consonants in Sanskrit, the hhi represents aspiration, a phonetic feature rather than a segment, while by itself hhi represents a phonetic segment. The Devanagari¯ representation of the aspirate retroflex lateral flap / h/ suffers the same fault. Similarly in the digraphs haii and haui that represent open diphthongs in Romanization, the graphs hai, hii, and hui rep- resent subsegments, while independently they represent indepen- dent segments. While the dual use of hhi (and hhi in Devanagari)¯ does not lead to ambiguity because the sequence of consonant plus /h/ does not occur in Sanskrit, the dual use of hii and hui in the Ro- manization is ambiguous. The identical sequence haii in taih. and mana¨ıccha¯, for example, does not itself indicate that it represents a diphthong in the former and a sequence of two independent vow- els in the latter. Similarly the sequence of characters haui repre- sents the sequence of two simple vowels in prauga¨ but represents a diphthong in praud. ha. Although one could disambiguate the Ro- manization by introducing a diaeresis over the second character to show that the two characters represent two vowels in sequence rather than a diphthong (thus mana¨ıccha¯, prauga¨ ), such a proce- dure introduces redundancy. There would then be two ways of representing the phonetic segments /i/ and /u/, one with and one without diaeresis. In practical terms, someone searching for the occurrence of the word iccha¯ would have to search a second time for ¨ıccha¯ as well, or software developers would have to accommo- date the anomalous encoding in their search program. Introduction of the diaeresis to remove the ambiguity is a visual patch that re- veals deeper structural problems in the encoding. Digital archives and software that persist in the use of Unicode Devanagari or Unicode representation of the International Alpha- SANSKRIT LIBRARY CONVENTIONS 5 bet of Sanskrit Transliteration (IAST) perpetuate these structural problems. While encodings such as WX address them to a limited extent,1 the Sanskrit Library phonetic encoding schemes address them in a consistent and comprehensive manner. These encoding schemes are capable of showing all the contrastive information in the corpus of Sanskrit texts using a broad concept of phoneme. Un- like Devanagar¯ ¯ı script and the encoding schemes based upon it but like Romanis.ations, the Sanskrit Library basic encoding scheme (SLP1) represents vowels, including the vowel /a/, by unique char- acters, regardless of whether they follow a consonant or not. Un- like most Romanis.ations, SLP1 represents diphthongs and aspi- rated stops by a single character as well. Unlike both Devanagar¯ ¯ı and Romanis.ations, it represents the aspirated retroflex lateral flap by a single character too. Yet because SLP1 is limited to ASCII characters, accents, nasalization and other modifications of basic sound segments are represented by modifier characters that rep- resent features rather than segments. SLP2 is a strictly segmen- tal encoding that represents each segment by a single codepoint. SLP3 is a strictly featural encoding that represents only phonetic features; segments are represented by bundles of features rather than by single codepoints. The Sanskrit Library Phonetic encoding schemes simplify lin- guistic processing. Hence the Sanskrit Library utilizes these en- codings for the storage of a corpus of Sanskrit texts and for lin- guistic processing. We segregate these functions from the func- tions of data-entry and display and employ transcoding software to interface between the functions. Besides simplifying linguis- tic processing, the segregation of functions allows for a high de- gree of flexibility in data-entry and display options thereby permit- 1Like SLP1, WX usually represents a single segment by a single character. The scheme is, however, limited, since it does not have characters that designate long vocalic ¯l, the unaspirated and aspirated retroflex lateral flaps .l and .lh, any system of accents,˚ or other sounds peculiar to Vedic traditions. 6 P. SCHARF ting