From Pan¯ . inian Sandhi to Finite State Calculus
Malcolm D. Hyman∗ Max Planck Institute for the History of Science, Berlin
Abstract chapters (termed adhyaya¯ ). Conciseness (laghava¯ ) is a fundamental principle in Pan¯ . ini’s formulation of care- The most authoritative description of the mor- fully interrelated rules (Smith, 1992). Rules are either phophonemic rules that apply at word bound- operational (i. e. they specify a particular linguistic op- aries (external sandhi) in Sanskrit is by the eration, or karya¯ ) or interpretive (i. e. they define the great grammarian Pan¯ . ini (fl. 5th c. B. C. E.). scope of operational rules).2 Rules may be either oblig- These rules are stated formally in Pan¯ .ini’s atory or optional. grammar, the As..tadhy¯ ay¯ ¯ı ‘group of eight chap- A brief review of some well-known aspects of Pa-¯ ters’. The present paper summarizes Pan¯ ini’s . n.ini’s grammar is in order. The operational rules rel- handling of sandhi, his notational conventions, evant to sandhi specify that a substituend (sthanin¯ ) is and formal properties of his theory. An XML replaced by a substituens (ade¯ sa´ ) in a given context vocabulary for expressing Pan¯ . ini’s morpho- (Cardona, 1965b, 308). Rules are written using meta- phonemic rules is then introduced, in which linguistic case conventions, so that the substituend is his rules for sandhi have been expressed. Al- marked as genitive, the substituens as nominative, the though Pan¯ .ini’s notation potentially exceeds a left context as ablative (tasmat¯ ), and the right context finite state grammar in power, individual rules as locative (tasmin). For instance: do not rewrite their own output, and thus they may be automatically translated into a rule cas- 8.4.62 jhayo ho ’nyatarasyam¯ cade from which a finite state transducer can jhaY-ABL h-GEN optionally be compiled. This rule specifies that (optionally) a homogenous 1 Sandhi in Sanskrit sound replaces h when preceded by a sound termed jhaY — i. e. an oral stop (Sharma, 2003, 783–784). Sanskrit possesses a set of morphophonemic rules Pan¯ .ini uses abbreviatory labels (termed pratyah¯ ar¯ a) (both obligatory and optional) that apply at morpheme to describe phonological classes. These labels are and word boundaries (the latter are also termed pada interpreted in the context of an ancillary text of the boundaries). The former are called internal sandhi (< As..tadhy¯ ay¯ ¯ı, the Sivas´ utr¯ as, which enumerate a catalog sam. dhi ‘putting together’); the latter, external sandhi. of sounds (varn. asamamn¯ aya¯ ) in fourteen classes (Car- This paper only considers external sandhi. Sandhi rules dona, 1969, 6): involve processes such as assimilation and vowel coa- lescence. Some examples of external sandhi are: na 1. a i u N. 8. jh bh N˜ asti > nasti¯ ‘is not’, tat ca > tac ca ‘and this’, etat hi
1
> etad dhi ‘for this’, devas api > devo ’pi ‘also a god’. 2. r l K 9. gh d.h dh S. 3. e o N˙ 10. j b g d. d S´ 2 Sandhi in Pan¯ . ini’s grammar 4. ai au C 11. kh ph ch th th c t t V Pan¯ . ini’s As..tadhy¯ ay¯ ¯ı is a complete grammar of San- . . skrit, covering phonology, morphology, syntax, seman- 5. h y v r T. 12. k p Y tics, and even pragmatics. It contains about 4000 rules (termed sutr¯ a, literally ‘thread’), divided between eight 6. l N. 13. s´ s. s R ∗ This work has been supported by NSF grant IIS- 7. n˜ m n˙ n. n M 14. h L 0535207. Any opinions, findings, and conclusions or recom- mendations expressed are those of the author and do not nec- essarily reflect the views of the National Science Foundation. 2The traditional classification of rules is more fine- The paper has benefited from comments by Peter M. Scharf grained and comprises sam. jn˜a¯ (technical terms), paribhas¯. a¯ and by four anonymous referees. (interpretive rules), vidhi (operational rules), niyama (restric- 1The symbol ’ (avagraha) does not represent a phoneme tion rules), pratisedha (negation rules), atidesa´ (extension h i . but is an orthographic convention to indicate the prodelision rules), vibhas¯. a¯ (optional rules), nipatana¯ (ad hoc rules), adhi- of an initial a-. kar¯ a (heading rules) (Sharma, 1987, 89). The final items (indicated here by capital letters) are are specified as Perl-compatible regular expressions markers termed it ‘indicatory sound’ and are not con- (PCREs) (Wall et al., 2000).5 Sanskrit sounds are in- sidered to belong to the class. A pratyah¯ ar¯ a formed dicated in an encoding known as SLP1 (Sanskrit Li- from a sound and an it denotes all sounds in the se- brary Phonological 1) (Scharf and Hyman, 2007). An quence beginning with the specified sound and ending encoding such as Unicode is not used, since Unicode with the last sound before the it. Thus jhaY denotes the represents written characters rather than speech sounds class of all sounds from jh through p (before the it Y): (Unicode Consortium, 2006). The SLP1 encoding fa- jh, bh, gh, d. h, dh, j, b, g, d. , d, kh, ph, ch, .th, th, c .t, t, k, cilitates linguistic processing by representing each San- p (i. e. all oral stops).3 Sutra¯ 8.4.62, as printed above, skrit sound with a single symbol (see fig. 1).6 The use is by itself both elliptic and uninterpretable. Ellipses in of Unicode would be undesirable here, since (1) there sutras¯ are completed by supplying elements that occur would not be a one-to-one correspondence between in earlier sutras;¯ the device by which omitted elements character and sound, and (2) Sanskrit is commonly
can be inferred from preceding sutras¯ is termed anuvr¡ tti written in a number of different scripts (Devanagar¯ ¯ı, ‘recurrence’ (Sharma, 1987, 60). It will be noticed that Tamil, Romanization, etc.). no substituens is specified in 8.4.62; the substituens The following is the XML representation of sutra¯ savarnah. ‘homogenous sound-NOM’ (Cardona, 1965a) 8.3.23 mo ’nusvar¯ ah. , which specifies that a pada-final is supplied by anuvr¡ tti from sutra¯ 8.4.58 anusvar¯ asya m is replaced by the nasal sound anusvar¯ a (m. ) when yayi parasavarn. ah. . Still, the device of anuvr¡ tti is in- followed by a consonant (Sharma, 2003, 628): sufficient to specify the exact sound that must be in-
£ £¥¤ ¦ § © ¨ a a¯ i § ¯ı u u¯ the pratyah¯ ar¯ a iK (a simple vowel other than a or a¯).
a A i I u U ik
The macro is defined:
r r¯ l ¯l
f F x X ¤ ¤ £¥¤¦ ¦ £¤ * e ai o au value="@(i)@(u)@(f)@(x)" e E o O ref="A.1.1.71"/> ¦ ¦ ¤ ¦ k kh g gh n˙ k K g G N and depends on the macro definitions: !" ¦ # $% ¦ &'¦ ()¦ c ch j jh n˜ c C j J Y + , - . /0¦ * ref="A.1.1.69"/> t. t.h d. d.h n. w W q Q R ¦ 34¦ 5 ¤ 67¦ 8¦ 12 ref="A.1.1.69"/> t th d dh n t T d D n 9:¦ . ;: <¦ =)¦ >)¦ . ref="A.1.1.69"/> p ph b bh m p P b B m ref="A.1.1.69"/> ¦ @ A0 B:¦ ? y r l v y r l v The iK is stored in pattern memory, and the substituend G C0¦ . D:¦ EF¦ s´ s. s h (a-varn. a) is replaced by the output of calling the func- S z s h tion gunate on the stored iK. A function is defined * anusvara¯ = M; visarga = H with an element tion: i-, or u-varn. a, outputs the corresponding gun. a vowel; ¡ ref="A.6.1.87"/> homorganic semivowel (either r or l). The domain of ¡ ¡ ¡ The source matches either short a or long a¯ (by the gunate is a, a,¯ i, ¯ı, u, u,¯ r¡ , r¯, l, l¯ and its range is a, { } { definition of the macro a), a word boundary, and then e, o, ar, al . } a regular grammar, all production rules have a single non-terminal on the left-hand side, and either a single terminal or a combination of a single non-terminal and a single terminal on the right-hand side. That is, all rules are of the form A a, A aB (for a right regular → → PSfrag replacements regular grammar), or A Ba (for a left regular gram- → context free mar), where A and B are single non-terminals, and a is context sensitive a single terminal (or ). A regular grammar is the least powerful type of grammar in the Chomsky hierarchy recursively enumerable (see fig. 2) (Chomsky, 1956). A context free grammar describes a context free language, a context sensitive Figure 2: The Chomsky hierarchy of languages (after grammar describes a context sensitive language, and Prusinkiewicz and Lindenmayer (1990, 3)) an unrestricted grammar describes a recursively enu- merable language. The hierarchy is characterized by proper inclusion, so that every regular language is con- text free, every context free language is context sensi- 4 Regular languages and regular tive, etc. (but not every context free language is regular, relations etc.). A regular grammar cannot describe a context free language such as anbn 1 n , which consists of the { | ≤ } First, it is necessary to define a regular language. Let Σ strings ab, aabb, aaabbb, aaaabbbb . . . (Kaplan and denote a finite alphabet and Σ denote Σ (where { } ∪ { } Kay, 1994, 346). is the empty string). e is a regular language where { } Since Pan¯ . inian external sandhi can be modeled using e Σ, and the empty language is a regular lan- ∈ ∅ finite state grammar, it is highly desirable to provide a guage. Given that L1, L2, and L are regular languages, finite state implementation, which is computationally additional regular languages may be defined by three efficient. Finite state machines are closed under com- operations (under which they are closed): concatena- position, and thus sandhi operations may be composed tion (L1 L2 = xy x L1, y L2), union (L1 L2), · { | ∈ ∈ i ∪ with other finite state operations to yield a single net- and Kleene closure (L∗ = ∞ L ) (Kaplan and Kay, ∪i=0 work. 1994, 338). Regular relations are defined in the same fashion. 6 From rewrite rules to regular An n-relation is a set whose members are ordered n- tuples (Beesley and Karttunen, 2003, 20). Then e grammars { } is a regular n-relation where e Σ . . . Σ ( ∈ × × ∅ A string rewriting system is a system that can trans- is also a regular n-relation). Given that R1, R2, and form a given string by means of rewrite rules. A rewrite R are regular n-relations, additional regular n-relations rule specifies that a substring x1. . . xn is replaced by may be defined by three operations (under which they a substring y . . . y : x . . . x y . . . y , where 1 m 1 n → 1 m are closed): n-way concatenation (R1 R2 = xy x x , y Σ (and Σ is a finite alphabet). Rewrite rules · { | ∈ i i ∈ R1, y R2), union (R1 R2), and n-way Kleene clo- used in phonology have the general form ∈ i ∪ sure (R∗ = ∞ R ). ∪i=0 φ ψ/λ ρ 5 Finite state automata and regular → grammars Such a rule specifies that φ is replaced by ψ when it is preceded by λ and followed by ρ. Tradition- A finite state automaton is a mathematical model that ally, the phonological component of a natural-language corresponds to a regular language or regular relation grammar has been conceived of as an ordered se- (Beesley and Karttunen, 2003, 44). A simple finite ries of rewrite rules (sometimes termed a cascade) state automaton corresponds to a regular language and W1, . . . , Wn (Chomsky and Halle, 1968, 20). Most is a quintuple S, Σ, δ, s0, F , where S is a finite set phonological rules, however, can be expressed as reg- h i of states, Σ the alphabet of the automaton, δ is a tran- ular relations (Beesley and Karttunen, 2003, 33). But S sition function that maps S Σ to 2 , s0 S is a if a rewrite rule is allowed to rewrite a substring in- × ∈ single initial state, and F S is a set of final states troduced by an earlier application of the same rule, ⊆ (Aho et al., 1988, 114). A finite state automaton that the rewrite rule exceeds the power of regular relations corresponds to a regular relation is termed a finite state (Kaplan and Kay, 1994, 346). In practice, few such transducer (FST) and can be defined as a quintuple rules are posited by phonologists. Although certain S, Σ . . . Σ, δ, s0, F , with δ being a transition func- marginal morphophonological phenomena (such as ar- h × × i tion that maps S Σ . . . Σ to 2S (Kaplan and bitrary center embedding and unlimited reduplication) × × × Kay, 1994, 340). exceed finite state power, the vast majority of (mor- A regular (or finite) grammar describes a regular lan- pho)phonological processes may be expressed by reg- guage and is equivalent to a finite state automaton. In ular relations (Beesley and Karttunen, 2003, 419). e, o Since phonological rewrite rules can normally (with ? the provisos discussed above) be reexpressed as regular e, o , # relations, they may be modeled as finite state transduc- s0 s1 s2 ers (FSTs). FSTs are closed under composition; thus if ? e, o T1 and T2 are FSTs, application of the composed trans- ducer T T to a string S is equivalent to applying T ?, a:’ 1 ◦ 2 1 to S and applying T2 to the output of T1: Figure 3: An FST for sutra¯ 6.1.109 T T (S) = T (T (S)) 1 ◦ 2 2 1 So if a cascade of rewrite rules W1, . . . , Wn can be expressed as a series of FSTs T1, . . . , Tn, there is a n possible values, the rule is expanded into n rewrite single FST T that is a composition T . . . T and rules. Mappings and functions are automatically ap- 1 ◦ ◦ n is equivalent to the cascade W1, . . . , Wn (Kaplan and plied where they occur in ψ (the substituens). Kay, 1994, 364). G = T1 . . . Tn constitutes a regular The following cascade represents the four sutras¯ ex- ◦ ◦ grammar. Efficient algorithms are known for compil- emplified above: ing a cascade of rewrite rules into an FST (Mohri and Sproat, 1996). s $ / ( #) → | $ #u / a ( #)a → | 7 An FST for Pan¯ inian sandhi (a A)( #)(a A) a . | | | → (a A)( #)(i I) e The XML formalism for expressing Pan¯ inian rules in | | | → . (a A)( #)(u U) o 3 contains a number of devices; it is not immediately | | | → § (a A)( #)(f F) ar evident how rules employing these devices might be | | | → (a A)( #)(x X) al compiled into an FST. A rule compiler, however, is de- | | | → a ’ / (e o)( #) scribed here that translates Pan¯ .inian rules expressed in → | | the XML formalism into rewrite rules that can be au- Here the notation (x y z) expresses alternation; the rule | | tomatically compiled into an FST using standard algo- matches either x, or y, or z. The symbol represents rithms. a space character; the symbol # represents a boundary It is useful to begin with an instance of Pan¯ . inian that has been inserted by a rule. devo ’pi < devas api sandhi derivation of the string These rules may be efficiently compiled into FSTs. ‘also a god’. An FST encoding the final rule is shown in fig. 3. In FORM SU¯ TRA this transducer, s0 is the initial state and doubly-circled devas api states are the members of F , the set of final states. The deva$ api 8.2.66 graphical representation of the FST has been simpli- deva#u api 6.1.113 fied, so that symbols that are accepted on transition dev!(gunate(u)) api 6.1.87 from si to sj share an arc between si and sj (in this devo ’pi 6.1.109 case, the symbols are separated by commas). The nota- tion x : y indicates the symbols on the upper and lower Sutra¯ 8.2.66 replaces a pada-final s with rU (here sym- tapes of the transducer, respectively. If the transducer bolized by $). Sutra¯ 6.1.113 replaces pada-final rU reads input from the upper tape and writes output to the with u preceded by a boundary marker (#) when the lower tape, x : y indicates that if x is read on the up- next pada begins with a. Sutra¯ 6.1.87 has been dis- per tape, y is written on the lower tape. If the symbols cussed above. Sutra¯ 6.1.109 replaces pada-initial a on the upper and lower tapes are the same, a shorthand with avagraha (represented by ’ in SLP1) when the notation is used; thus x is equivalent to x : x. The spe- previous pada ends in the pratyah¯ ar¯ a eN˙ (i. e. e or o).7 cial symbol ? indicates that any symbol is matched on Although the PCREs in the XML format exceed the either the upper or lower tape. power of regular relations, and the implementation of Since it is possible to translate each rule in the above mappings and functions is not obvious in a regular cascade into an FST, and FSTs are closed under compo- grammar, the rule compiler mentioned above is able sition, it is possible to compose a single FST that imple- to produce a cascade of rewrite rules that may be ef- ments the portion of Pan¯ .ini’s sandhi represented by the ficiently compiled into an FST. A consequence of the rules in the cascade above. A compiler developed by rule compilation strategy is that a single Pan¯ .inian rule the author will be capable of compiling the entirety of may be represented as several rewrite rules. Where a Pan¯ .ini’s sandhi rules into a single FST. The FST is then rule makes use of pattern memory, which can contain converted to Java code using a toolkit written by the au- thor. Compilation of the generated source code yields 7Although avagraha is not a phoneme, it may be con- ceived of linguistically as a “trace” left after the deletion of binary code that may be run portably using any Java a-. For this reason, it is represented in the SLP1 phonological Virtual Machine (JVM) (Lindholm and Yellin, 1999). coding. Alternatively, for improved performance, the Java code may be compiled into native machine code using the strings (sequences of segments). As far as the com- GNU Compiler for Java (gcj). putational model is concerned, individual symbols are atomic, and no class relations between the symbols ex- 8 Implications ist. The goal in this study has been to develop an in- termediate representational structure, based on XML, While one of the guiding principles of Pan¯ ini’s gram- . that can faithfully encode some of the linguistically sig- mar is conciseness (laghava¯ ), a computational imple- nificant aspects of a portion of Pan¯ ini’s grammar (the mentation poses other demands, such as tractability and . rules involved in external sandhi) and at the same can efficiency. Pan¯ ini’s rules are formulated in terms of . be automatically translated into an efficient computa- classes based on distinctive features and must be con- tional implementation. strued with the aid of ‘interpretive’ (paribhas¯.a¯) rules, technical terms (sam. jn˜a¯), and other theoretical appara- 9 Appendix: core rules for external tus. However desirable the mathematical properties of a sandhi regular grammar may be, a grammar stated in such The following is a list of modeled vidhi rules (8.3.4: terms is at odds with Pan¯ .ini’s principles. Rather, it adhikar¯ a, 8.4.44: pratis.edha) for external sandhi. No hearkens back to the vikar¯ a system employed by ear- account is taken here of pragr¡ hya rules that specify cer- lier linguistic thinkers, in which individual segments tain sounds as exempt from sandhi (since these rules are the target of specific rules (Cardona, 1965b, 311). refer to morphosyntactic categories). Various optional By way of contrast, Pan¯ .ini states a rule econom- rules are not listed. ically in terms of the sound classes enumerated in the Sivas´ utr¯ as. Thus 8.4.41 s..tuna¯ s..tuh. specifies the 6.1.73 che ca retroflexion of an s or dental stop either (1) before a 6.1.74 a¯nm˙ a¯no˙ sca´ pada-final -s. or (2) pada-finally before a .tU (i. e. a 6.1.132 etattadoh. sulopo ’koranansam˜ ase¯ hali retroflex stop). 8.2.66 sasajus.o ruh. With a little less economy, we can represent Pan¯ . ini’s 8.2.68 ahan rule by two rules in XML: 6.1.113 ato roraplutadaplute¯ lcontext="z[@(wb)]" 6.1.88 vr¡ ddhireci ref="A.8.4.41"/> 6.1.87 adgun¯ . ah. 6.1.77 iko yanaci George Cardona. 1965a. On Pan¯ . ini’s morphophone- mic principles. Language, 41(2):225–237. George Cardona. 1965b. On translating and formaliz- ing Pan¯ .inian rules. Journal of the Oriental Institute, Baroda, 14:306–314. George Cardona. 1969. Studies in Indian grammar- ians: I. The method of description reflected in the Si´ vasutras.¯ Transactions of the American Philosoph- ical Society, 59(1):3–48. Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. MIT Press, Cambridge, MA. Noam Chomsky. 1956. Three models for the descrip- tion of language. IEEE Transactions on Information Theory, 2(3):113–124. Ronald M. Kaplan and Matin Kay. 1994. Regular models of phonological rule systems. Computa- tional Linguistics, 20(3):332–378. Tim Lindholm and Frank Yellin. 1999. The JavaTM Virtual Machine Specification. Addison-Wesley, Reading, MA, 2d edition. Mehryar Mohri and Richard Sproat. 1996. An efficient compiler for weighted rewrite rules. In 34th Annual Meeting of the Association for Computational Lin- guistics, pages 231–238, Santa Cruz, CA. ACL. Prezemyslaw Prusinkiewicz and Aristid Lindenmayer. 1990. The Algorithmic Beauty of Plants. Springer, New York. Peter M. Scharf and Malcolm D. Hyman. 2007. Lin- guistic issues in encoding Sanskrit. Unpublished manuscript, Brown University. Rama Nath Sharma. 1987. The As..tadhy¯ ay¯ ¯ı of Pan¯ . ini, Vol. 1: Introduction to the As..tadhy¯ ay¯ ¯ı as a Grammatical Device. Munshiram Manoharlal, New Delhi.