<<

Proe. Indian Acad. Sci., Vol. 86 A, No. 6, December 1977, pp. 503-512, 9 Printed in .

Automatic of Tamil in Roman

E V KRISHNAMURTHY Department of Applied Mathematics, Indian Institute of Science, Bangalore 560 012

MS received 14 Febmary 1977; revised 20 June 1977

Abstract. Simple formalized rules are proposed for automatic phonetic transcription of Tamil words into Roman script. These rules are syntax-directed and require a one-symbol look-ahead facility and hence easily automated in a digital computer. Some suggestions are also put forth for the linearization of for handling these by modern machinery.

Keywords. Automatic phonetic transcription; data representation; linearization of Tamil script; Roman script; Tamil script; transcription.

1. Introduction

Ir is well known that in the oldest of the , Tamil, thepronuncia- tion of the letters eorresponding to the eonstants ~, g~, ,', ~, t~, ~ (called Surds) and their eonsonanto-* vary eonsiderably depending upon the sequence in which these letters are embedded (hence may be called phonetically eontext-sensitive) .g. there is only one set of a consonanto- (CV) corresponding to the four different sound-values** in the words below; these are underlined.

Na hai t tu Pakhkhu Ponggal

In other words there is a many-to-one mapping from sound-values to the Surd anda one-to-many mapping from the surd to its sound value. Accordingly, Tamil is nota strictly phonetic language (in relation to its system) unlike all other Indian languages. The advantage of not being strictly phonetic in this sense is rewarding, since there is ala enormous reduction in the number of letters in the ; the disadvantage, however, from the point of view of a non-native speaker is the difficulty in leaxning the actual pronunciation which varies depending upon the context. It is further important to study whether there exist simple formalized sound- generative rules (which are context-sensitive) for the correct pronunciation (as in

*Consonanto-vowel=a letter representing a -vowel (CV) combination. **These are the sound-values met with in the author's own dialect. Other sound values may be found in other dialects. The same method would still hold with suitable modi¡ in figure 2. As our airo in this paper is towards an automatic approach to the problem, we confine to as few rules as possible. We believe further studies will result in improvement.

503 504 E V Krishnamurthy

current-day-usage) of Tamil words. Tlª paper is a prcliminaxy attempt towaxds this goal. We will see later that the set of formalized rulest axe so few and simple and that these are purely dependent upon syntax; therefore it is possible to generate Tamil words with correct pronunciation by an algorithm and hence by the modero machinery (artificial laxynx anda hybrid computer). Accordingly, this form will be useful for the automatic generation of speech from T text. At this point one might wonder as to whether such formalized rules were non- existent in ah ancient, rich and glorious language, sucia as Taxnil. This question has been raised and answered in a different manner by scveral renowned scholars. In fact Caldwell (see Varadaxajan 1963) refers to this phenomenon as convertibility of surds into sonants (sec also Burrow 1968 and Masica et al 1963) and asserts that this has been in existence in Tamil since ancient times. Caldwell did not, however, make any detailed study of the set of formalized rules. Vaxadaxajan (1963), however, is of the view that this convertibility phenomenon might have arisen due to the influence of . His view is based on the following three aspects: () None of the classical works of great repute--such as Tolkappiam, Nannool, Yapparungalam which deal with many other more intricate and complicated aspects of grammax have paid attcntion to this phenomenon. (ii) In othcr Dravidian languages which have extensivcly borrowed words from Sanskrit, this phcnomenon is nearly absent. (ª Instead of borrowing words, Tamil might have adapted some of its own wotds with foreign sounds. Howevcr, the present author is of the view that this phenomenon of convertibility of sounds might have evolvcd with a view to economize on the effort so as to satisfy Zipf's principle of least effort, (see Chcrry 1971; Zipf 1959); and accordingly, might have cxistcd since ancient times. Ir sectas possible to substantiate this economy aspcct purely from acoustical energy considerations. Another reason for the lack of formalized rules for this phenomenon may be the absence of a suitable meta-language (the language for description of the formalized system) since many of these variations in sounds cannot be expressed in written forro without using the near-equivalent sounds of or English (as is done in the present paper) whieh were not available then. Therefore, ir can be assumed that these rules were oraUy transmitted in ancient days. In any case, leaving aside the above eontroversy, it appeaxs that no systematic set of rules are available for automating the process of pronunciation of Tamil words. Ir is the object of this paper to propose certain basic rules for the sake of arriving at a mechanized pronunciation scheme. This study has given rise to a very interesting result that Tamil has achieved the economy in the number of letters as well as in the effort required for pronunciation, by resorting to very simple syntax-directed rules based on one-symbol look-ahead without back-tracking (Wirth 1976). Of course, one important aspect of the to be borne in mind is that the ehange of sound value in Surds does not produce words of different meanings, although the pronunciation may be a depaxture from the theoretieal norm.

tExtensive analysis of dialr162of Tamil may perhaps demanda r in this basic set of mies; we hope to do this eventually after gaining enough experiencr Automatic phonetic transeription of Tamil in Roman script 505

In our study, we would confine our attention to only that part of sound change at the syntax level and not considered the more complex rules of timing of each alphabet in a typical real-time conversation; these are more complex to handle at present. It is also assumed that the words used satisfy the standard Tamil grammar NannooI (Arumuga 1968; see also Ramasubramaniam 1968).

2. Defmitiom, terminology and input data representation

Although ir is possible to use ah automatic recognition device to convert Tamil script into Roman script through the formalized rules we describe, the lack of these devices forces us to resort to some other convenient representation for input into a computer. After many serious considerations ir is found convenient to use a two digit base 20 system for representing the Tamil ]etters; twenty symbols being {0--9; A--J}. Figures 1 and 2 represent the vowels, , their Devanagari and Roman letter equiva.lents and Base-20 representations. In figure 2 we have de¡ the sound values of the surds by the following scheme, with near equivalent Devanagari letters corresponding to those sounds.

--l=Lax; 0----Neutral; l=Aspirated; 2=Voiced.

2.1. Remarks (i) In using RomaR-speUing we have neaxly followed the excellent suggestions of Ganeshsundaram (1976) except for minor deviations to facilitate easy reading.

".A. A AA , . u. E EE Al 0 O0

BASE 20 01 02 03 O 05 06 07 08 09 OA O60C

1(O)

CON., ~ ~~,~ oEv. ~ VT . ,-Z

ROMAN NG NJ N& NH M N

BASE 20 10 20 30 4O 50 60 '. ', i CON, OEV. "~

ROMAN, y R L v ZH

BASE 20 DO CO FO GO HO I0 l(b) Figure . Tamil vowels and equivalents Figure lb, Tamil consonants and equivalcnts 506 E V Krishnamurthy

7O #0 $0 AO li CO

.$ i~. Ks - ~

O

! Tll

Flgmre 2. Tamilsurds and sound values

(ª Note that two vowels cannot occur adjacenfly (except in por in a Tamil word; accordingly, the Roman-spelling suggr wil[ not IeaA to any ambiguity. ~iii) We have used in vr places ampersand (&) insteazi of X used by Ganesh- sundaram; this is psychologically more pleasing and is indicative of the doubling of the 3ound of the preceding letter, at the sarae time e[iminating arabiguity. (ir) The pronunciation of r~ and ,0 ate nearly identical; except for context (see rule 5 below). (v) Although ~ and ~r have the same pronunciation, we dislinguish these for the sake of uniqueness in transcribing from Roman to Tamil script. (vi) By adding the Base-20 code words of vowcls and consonants we can obtain the Base-20 code words of CV. e.g., 85-----~--SU. (vii) It is easy to recognize the nasals by code words and also ir is easy to de- compose any Tamil letter into its vowel § consonant parts by using the above data representation. This will be useful for parsing and other operations involved in Tamil grammar. Correspondingly the product set CV viz. (C x V) will have (I 2 • 32) sound leveIs. We now illustrate the pronunciation of certain words using the above scale levels (marked as superscript). a. u o ~-I ,~ ~o o ~ ~: li I Ku ng gu n dhu

Pa zha m Thi ndra thu. b. u o ~x ~,t ~t ~x ~o Pa ch chai kh khu ra th thi oo si vi t ra l&. c. ~ m-x ,~x ,~Irr ~ Na ha th th a I Moo kh khai ku ra ng gu

Kee ri thu,

3. Formalized pronunclation mies 3.1. Rule I (Word-startingru/e) When any member ofV or any pr member of CV occurs at the beginning of a word, it is pronounced at ir3 zero level (indicated by OL), (Note membr of C, Automatie phonette transcription of Tamtl In Roman serlpt 507 as well as members of CV derived from ,', t~, ~,/~, ~r, ~~r, ~r cannot occur as the ¡ letter).

Exampl~:

3.2. Rule 2 ( Diminutive rule)

Ir a member of CV or V, or precedes another member of CV, then the following CV is pronounced at its lowest permissible sound value (LSL).

Exa.mples:

3.3. Rule 3 (Nasal augment rule)

Whenever any permissible member of CV follows the members of the set of Nasals (t~, ~, ~, ~r, ,~, ~) the following letter is pronounced at the maximum permissible sound value (HSL.)

Examples:

3.4. Rule 4 (Consonant augment rule) Ifa member of CV (other tlmn that derived from ,~) is preceded by its own correspond- ing member in C from which ir was derived, or ," or ~ then the following CV is pronounc~d at the sound-level value 1 (denoted by IL).

Examples:

~A~u~ u~tr~g'~; urraca; ~~u~~;

3.5.1. Rule 5a (T-rule)

(i) Whenever ~ is foUowed by a CV derived from it, then the two together pro- nounced, as though a T precedes the corresponding CV.

Exmnples:

Mutram, Mutri, Mutrum 5O8 E V l~sknam~thy

3.5.2. Ru/e 5b (D--tu/e)

Wh the nasal d~ precedes a member of CV derived from ~ thr this CV is pronounced as though a D precedes ir.

Examples:

4. MechaMzatioa of the algorithm

A flow-r for mechanization of the above rules is given in figure 3. In this diagram, the parsing is carried out one word ata time, each work consisting of a sequence of letters of the Tamil alphabet. The symbol j~ denotes blank. The other symbols are explained in the diagram. The following words msy be usr to study the flow of the steps.

| *'* * * I I 3~) i

i~j I

l~~re ~. Word-startand diminutivr rules Automatlc phonetic transcription of Tamll in Roman script 509

PtOnO~cr 01

3r~)

~J(c) (

Fig~e 3b. Nazi augment rule Figm~ 3c. Consonant augment rulo

a,l Of

3(d) f=:: "t

3 (e) Ÿ

Fl8me 3d. T-rule Figure 3e. D-rulo 510 E V Krishnamurthy

5. Coneluding remarks

(i) The algorithmic rules described above require that the choice of every analysis step depends only on the present state of computation and on a single next symbol being read and that no step wiU have to be reinvoked later on. These two aspects make the algorithm for pronunciation as one-symbol-look-ahead without back- tracldng, (Wirth 1976). (ii) Recently there have been attempts to introduce a cerumen script for Indian languages. The above formalized rules would be of help in this attempt, (Ganesh- sundaram 1976; Saran 1969; see also remark vii). 0ii) It is also evident that the above set of rules provide a method for automatic conversion from reading into a typescript. However, the process of generating actual speech from tb.e typescript and vice-versa would require a lot of experimenta- tion and further theoretical studies. (iv) In ah excellent report on the relative ei¡ of Indian languages, Rama- krishna et al (I 962) have shown the following interesting features of Tamil which make it as one among the eificient languages from the point of view of information theory. (a) The use of unaspirated consonants and short vowels constitute the bulk (90 %) of Tamil speech (and writing), thereby minimizing the effort and time. (b) The number of bits of information required in Tamil to convey the same semantic content ofa message is lower than that required for English and much lower than other Indian languages. (c) Telegraphic transmission in Tamil is nearly as e¡ as English. Ramakrislma et al (1962) have assumed that Tamil does not use aspirated consonants. As we saw earlier, this is indeed not true ifwe consider Tamil speech. The discrepancy has arisen due to the fact that the analysis was based on written material the statistics was based on counting the letters of the alphabet. Therefore, the percentage of aspirated consonants carne down to zero. (v) We already mentioned that although Tamil has no letters in the alphabet corresponding to the aspirated consonants, these are generated when the unaspirated consonants occur in conjunction with the nasals or other consonanto-vowels (as dictated by our rules); the author is of the view that such syntactic rules minimize the transitional effort required in pronouncing successive letters, tkereby reducing the total effort (Zipf 1969). Itis not difficult to test this statement for oneselfby violating the syntactic rules. However, further tkeoretical and experimental studies are required to substantiate this view. (vi) The above formalized rules (called phonological production rules) can now be summarized in algebraic forro thus: (+stands for concatenation). (a) ~+CV-~CV~ (b) CV+CV} V+CV -*CV+ CV LsL Lb+CV (c) N+CV~N+CV HSL (d) C. +CV.(~ CV(~)~C.+CVJ L (e) ~ + CV (~)-+TCV(~) ~+cv (~~~DCV(~) Automatic phonettc transcrtption of Tamil in Reman scrlpt 511

This generative approach to pronunciation has the advanatge of being able to handle variations of pronunciation from speaker to speaker satisfactorily-- phonetically different dialects of the sarne language could be easily accommodated by postulating different sets of rules operating on the same phonelogical representations (Fudge 1970; Kiparsky 1970). We propose to carry out these studies as soon as w• are able to generate speech from text using a computer and able to perform largo scale experimentation. (vil) Finally we remark on the linearization of Tamil script. Itis well known that all modero data communication system and digital information processing demand a linear sequence of symbols. Accordingly, for a language to be very efficient for typing, printing, computer processing, it is necessary that the script (vowels and consonants) as well as the modi¡ of markers used for representing various Cand CV should lie in the same line rather than above of below. Most of the Indian languages use diacritical markers which ate above, across or below. In Tamil, how- ever, the system is nearly linear. Ir is, therefore, vcry easy to modify Tamil script very slightly so as to linearize the modi¡ This suggestion is put forth and illustrated in figure 4. Here 1 to 5 are post-modifiers; 6, 7, 8 and 12 ar• pre-modifiers; and 9, I0, 11 are both pre- and post-modifiers. In fact, 9, I0 and 11 could further be modified to a two letter symbolism; but tl¨ would deviate from the traditional convention. The symbolism of figure 4 has been chosen after very many eareful considerations so as to not deviate very much from the existing system and at the sarne time is unique, unambiguous and stable towards various styles of hand-writing. Also, we use 39 symbols (all written in a linear faskion) that can be accommodated within 6 bits. When large amounts of data in Tamil language ate to be processed the linearization will be of immens• advantage. It is desirable that Tamil scholars consid•r this aspect and gire a serious consideration to a convenient linearisation scaxeme.

Vowel AA I II U UU E EE A! O OO AU CON Modifir tr I T -- I- 1 [ ITI lxrr [• LXLTI Number 1 2 3 4 5 6 7 8 9 10 11 12

~t~~£ ~[. t~ g~rr@ = [~rr t_.--;

Fig~e 4. Linear modifiers for Tamil seript

References

Arumuga Navalar 1968 Nannool g,andikai Urat (Madras: Arumuga Navalar Press) Burrow 1968 Collected papers on Dravidian Ltngulsttcs (Annamalai University) Cherry C 1961 On human communicatlon (New York: John Wiley and Sons) Fudge E C 1970 Pho.ology in New Horlzons in Lingutstlcs (Baltimore, Maryland: Penguin) Ganeshsundaram P C 1976 J. Indlan Inst. Sci. 58 225 Kiparsky P 1970 Htstortcal Ltngutstlcs (Baltimore, Maryland: Penguin) Mastica et al 1963 Bull. Central Inst. Eng[ish 3 Ramak¡ B S st al 1962 Relative efficlencles oflndian Languages (Bangalore: Indian Institute of Science) 512 E l# Kr~hnamurthy

RamasubramAniam N 1968 Tamll Oliyilakkanam (Bombay:TataInstituteofFundamental Research) Saran S 1969 Indian Nattonal Writing (Delhi: Munshiram Manoharlal) Varadarajan M 1963 Nool (Madras: Salva Siddhantha Kazhagam) White G M 1976 IEEE 9 40 Wirth N 1976 Algorithm, Data $tructures, Programs (N. J.: Prentice Hall) Zipf G K 1969 Human Behaviour and Principle of least effort (Cambridge Mass : Addison-Wesley)