Automatic Phonetic Transcription of Tamil in Roman Script
Total Page:16
File Type:pdf, Size:1020Kb
Proe. Indian Acad. Sci., Vol. 86 A, No. 6, December 1977, pp. 503-512, 9 Printed in India. Automatic phonetic transcription of Tamil in Roman script E V KRISHNAMURTHY Department of Applied Mathematics, Indian Institute of Science, Bangalore 560 012 MS received 14 Febmary 1977; revised 20 June 1977 Abstract. Simple formalized rules are proposed for automatic phonetic transcription of Tamil words into Roman script. These rules are syntax-directed and require a one-symbol look-ahead facility and hence easily automated in a digital computer. Some suggestions are also put forth for the linearization of Tamil script for handling these by modern machinery. Keywords. Automatic phonetic transcription; data representation; linearization of Tamil script; Roman script; Tamil script; transcription. 1. Introduction Ir is well known that in the oldest of the Dravidian languages, Tamil, thepronuncia- tion of the letters eorresponding to the eonstants ~, g~, ,', ~, t~, ~ (called Surds) and their eonsonanto-vowels* vary eonsiderably depending upon the sequence in which these letters are embedded (hence may be called phonetically eontext-sensitive) e.g. there is only one set of a consonanto-vowel (CV) corresponding to the four different sound-values** in the words below; these are underlined. Na hai Ka t tu Pakhkhu Ponggal In other words there is a many-to-one mapping from sound-values to the Surd anda one-to-many mapping from the surd to its sound value. Accordingly, Tamil is nota strictly phonetic language (in relation to its writing system) unlike all other Indian languages. The advantage of not being strictly phonetic in this sense is rewarding, since there is ala enormous reduction in the number of letters in the alphabet; the disadvantage, however, from the point of view of a non-native speaker is the difficulty in leaxning the actual pronunciation which varies depending upon the context. It is further important to study whether there exist simple formalized sound- generative rules (which are context-sensitive) for the correct pronunciation (as in *Consonanto-vowel=a letter representing a consonant-vowel (CV) combination. **These are the sound-values met with in the author's own dialect. Other sound values may be found in other dialects. The same method would still hold with suitable modi¡ in figure 2. As our airo in this paper is towards an automatic approach to the problem, we confine to as few rules as possible. We believe further studies will result in improvement. 503 504 E V Krishnamurthy current-day-usage) of Tamil words. Tlª paper is a prcliminaxy attempt towaxds this goal. We will see later that the set of formalized rulest axe so few and simple and that these are purely dependent upon syntax; therefore it is possible to generate Tamil words with correct pronunciation by an algorithm and hence by the modero machinery (artificial laxynx anda hybrid computer). Accordingly, this form will be useful for the automatic generation of speech from T text. At this point one might wonder as to whether such formalized rules were non- existent in ah ancient, rich and glorious language, sucia as Taxnil. This question has been raised and answered in a different manner by scveral renowned scholars. In fact Caldwell (see Varadaxajan 1963) refers to this phenomenon as convertibility of surds into sonants (sec also Burrow 1968 and Masica et al 1963) and asserts that this has been in existence in Tamil since ancient times. Caldwell did not, however, make any detailed study of the set of formalized rules. Vaxadaxajan (1963), however, is of the view that this convertibility phenomenon might have arisen due to the influence of Sanskrit. His view is based on the following three aspects: (i) None of the classical Tamil grammar works of great repute--such as Tolkappiam, Nannool, Yapparungalam which deal with many other more intricate and complicated aspects of grammax have paid attcntion to this phenomenon. (ii) In othcr Dravidian languages which have extensivcly borrowed words from Sanskrit, this phcnomenon is nearly absent. (ª Instead of borrowing words, Tamil might have adapted some of its own wotds with foreign sounds. Howevcr, the present author is of the view that this phenomenon of convertibility of sounds might have evolvcd with a view to economize on the effort so as to satisfy Zipf's principle of least effort, (see Chcrry 1971; Zipf 1959); and accordingly, might have cxistcd since ancient times. Ir sectas possible to substantiate this economy aspcct purely from acoustical energy considerations. Another reason for the lack of formalized rules for this phenomenon may be the absence of a suitable meta-language (the language for description of the formalized system) since many of these variations in sounds cannot be expressed in written forro without using the near-equivalent sounds of Devanagari or English (as is done in the present paper) whieh were not available then. Therefore, ir can be assumed that these rules were oraUy transmitted in ancient days. In any case, leaving aside the above eontroversy, it appeaxs that no systematic set of rules are available for automating the process of pronunciation of Tamil words. Ir is the object of this paper to propose certain basic rules for the sake of arriving at a mechanized pronunciation scheme. This study has given rise to a very interesting result that Tamil has achieved the economy in the number of letters as well as in the effort required for pronunciation, by resorting to very simple syntax-directed rules based on one-symbol look-ahead without back-tracking (Wirth 1976). Of course, one important aspect of the Tamil language to be borne in mind is that the ehange of sound value in Surds does not produce words of different meanings, although the pronunciation may be a depaxture from the theoretieal norm. tExtensive analysis of dialr162of Tamil may perhaps demanda r in this basic set of mies; we hope to do this eventually after gaining enough experiencr Automatic phonetic transeription of Tamil in Roman script 505 In our study, we would confine our attention to only that part of sound change at the syntax level and not considered the more complex rules of timing of each alphabet in a typical real-time conversation; these are more complex to handle at present. It is also assumed that the words used satisfy the standard Tamil grammar NannooI (Arumuga 1968; see also Ramasubramaniam 1968). 2. Defmitiom, terminology and input data representation Although ir is possible to use ah automatic recognition device to convert Tamil script into Roman script through the formalized rules we describe, the lack of these devices forces us to resort to some other convenient representation for input into a computer. After many serious considerations ir is found convenient to use a two digit base 20 system for representing the Tamil ]etters; twenty symbols being {0--9; A--J}. Figures 1 and 2 represent the vowels, consonants, their Devanagari and Roman letter equiva.lents and Base-20 representations. In figure 2 we have de¡ the sound values of the surds by the following scheme, with near equivalent Devanagari letters corresponding to those sounds. --l=Lax; 0----Neutral; l=Aspirated; 2=Voiced. 2.1. Remarks (i) In using RomaR-speUing we have neaxly followed the excellent suggestions of Ganeshsundaram (1976) except for minor deviations to facilitate easy reading. "O.A. A AA , . u u. E EE Al 0 O0 AU BASE 20 01 02 03 O 05 06 07 08 09 OA O60C 1(O) CON., ~ ~~,~ oEv. ~ VT . ,-Z ROMAN NG NJ N& NH M N BASE 20 10 20 30 4O 50 60 '. ', i CON, OEV. "~ ROMAN, y R L v ZH BASE 20 DO CO FO GO HO I0 l(b) Figure la. Tamil vowels and equivalents Figure lb, Tamil consonants and equivalcnts 506 E V Krishnamurthy 7O #0 $0 AO li CO .$ i~. Ks - ~ O ! Tll Flgmre 2. Tamilsurds and sound values (ª Note that two vowels cannot occur adjacenfly (except in por in a Tamil word; accordingly, the Roman-spelling suggr wil[ not IeaA to any ambiguity. ~iii) We have used in vr places ampersand (&) insteazi of X used by Ganesh- sundaram; this is psychologically more pleasing and is indicative of the doubling of the 3ound of the preceding letter, at the sarae time e[iminating arabiguity. (ir) The pronunciation of r~ and ,0 ate nearly identical; except for context (see rule 5 below). (v) Although ~ and ~r have the same pronunciation, we dislinguish these for the sake of uniqueness in transcribing from Roman to Tamil script. (vi) By adding the Base-20 code words of vowcls and consonants we can obtain the Base-20 code words of CV. e.g., 85-----~--SU. (vii) It is easy to recognize the nasals by code words and also ir is easy to de- compose any Tamil letter into its vowel § consonant parts by using the above data representation. This will be useful for parsing and other operations involved in Tamil grammar. Correspondingly the product set CV viz. (C x V) will have (I 2 • 32) sound leveIs. We now illustrate the pronunciation of certain words using the above scale levels (marked as superscript). a. u o ~-I ,~ ~o o ~ ~: Pa ha li I Ku ra ng gu va n dhu Pa zha m Thi ndra thu. b. u o ~x ~,t ~t ~x ~o Pa ch chai kh khu ra th thi oo si vi t ra l&. c. ~ m-x ,~x ,~Irr ~ Na ha th th a I Moo kh khai ku ra ng gu Kee ri ya thu, 3. Formalized pronunclation mies 3.1. Rule I (Word-startingru/e) When any member ofV or any pr member of CV occurs at the beginning of a word, it is pronounced at ir3 zero level (indicated by OL), (Note membr of C, Automatie phonette transcription of Tamtl In Roman serlpt 507 as well as members of CV derived from ,', t~, ~,/~, ~r, ~~r, ~r cannot occur as the ¡ letter).