<<

Unification : Another look at "synthesis-by-rule"

John Coleman Experimental Laboratory Department of Langm~ge and Linguistic Science University of York Heslington YORK YOI 5DD United Kingdom e-maih JANET%UK.AC.YORK.VAX::J SC 1

(1) p --, p-/s__

else p --, ph/__V Transformational and "synthesis- (where V is any symbol) by-rule" Most current text-to-speech systems else p --~ p' (e.g. Allen et al. 1987; Hertz 1981~ 1982, forthcoming; Hertz et aL 1985) are, at heart, unconstrained string- Often, of course, grammars made with rules of this based transformational grammars. Generally, text- type may be (contingently) quite restricted. For in.- to-speech programs are implemented as the compo- stance, if the rules apply in a fixed order without sition of three non-invertible mappings: cyclicity, they may be compiled into a finite-state transducer (Johnson 1972). But in general there is no guarantee that a program which implements 1. grapheme to mapping (inverse spelling such a will halt. This would be pretty r ales + exceptions dictionary) disastrous in speech recognition, and is undesirable even in generation-based applications, such as text- 2. phoneme to mapping (pronunciation to-speech. However, this has not prevented the at> rules) pearance of a number of "linguistic rule compilers" 3. allophone to parameter mapping (interpolation such as Van Leenwen's (1987, 1989) and Hertz's sys.- rules) gems.

Tile basic operations of a transformational gram° ])'or example: mar -- deletion, insertion, permutation, and copying -- are apparently empirically instantiated by such ph] %, pit well-established phonological phenemona as , [p'] ~-- /p/ ~-- sip , , and coarticula- [p-] e/ spit tion. Copying (i): Assimilation ~ ,--- graphemes h denotes strong release of breath (aspiration) e.g. 1 ran [hi ran quickly [9 ] ' denotes slight/weak aspiration - denotes no aspiration Rule: n-~O/ {k,g} These mappings are usually defined using rules of the form A -+ B/U D e.g. (1), usually called [0] denotes back-of-tongue (velar) "context-sensitive", but which in fact define unre- nasal closure stricted rewriting systems, since B may be the empty string (Gazdar 1.987). It should be recalled that "if e.g. 2 sandwich [samwitJ'] all we can say about a grammar of a natural language is that it is an unrestricted rewriting system, we have said nothing of any interest" (Chomsky 1963:360). Rule: n -~ m/ {l),b,w etc.}

79 Copying (ii): Ooarticulation e.g. "chip" e.g. keep [~_] cool / \ c rt [k] Onset Rime ! ! \ V Nucleus Coda Rules: k-, +k /--i-back] / \ I I V Closure Friction Vowel Closure

v t ... 8h ... i ... p

denotes advanced articulation + Figure 1: Richer structure in phonological represen- denotes lip-rounding tations

denotes retracted articulation

Insertion: Epenthesis In partial recognition of some of these problems, pho- nologists have been attemptiug to reconstruct the e.g. mince [mints] transformational component as the epiphenomenal pence [pents] result of several interacting general "constraints". Numerous such "constraints" and ~principles" have been proposed, such as the Well-Formedness Con- Rule: ns --~ nts dition (Goldsmith 1976 and several subsequent for- Deletion: Elision nmlations), the Obligatory Contour Principle (Leben 1973), Cyclicity (Kaisse and Shaw 1985, Kiparsky 1985), Structure-Preservation (Kiparsky 1985), the e.g. sandwich [sanwitf] Elsewhere Condition (Kiparsky 1973) etc. While this line of research is in some respects conceptually Rule: nd --* n cleaner than primitive transformational grammars, there has been no demonstration that a "principle'- Permutation: Metathesis based phonology is indeed more restrictive than primitive transformational phonology in any compu- e.g. burnt [brunt] rationally relevant dimension.

Rule: ur-+ ru A declarative model of speech For the last few The problems inherent in this approach are many: years, I have been developing a "synthesis-by-rule" program which does not employ such string-to-string 1. Deletion rules can make Context- transformations (Coleman and Local 1987 forthcom- Se,lsitive grammars undecidable. (Salomaa ing; Local 1989 forthcoming; Coleman 1989). 1973:83, Levelt 1976:243, Lapointe 1977:228, Berwick and Weinberg 1984:127) The basic hypothesis of this (and related) research is that there is a trade-off between the richness of the 2. Non-monotonicity m~kes for computational rule component and the richness of the representa- complexity. tions (Anderson 1985). According to this hypothesis, the reason why transformational phonology needs to 3. There is no principled 1 way of limiting the do- use transformations is because its data structure, main of rule application to specific linguistic do- strings, is too simple. Consequently, it ought to be mains, such as . possible to considerably simplify or even completely 4. Using sequences as data-structures is really only eliminate the transformational rule component by plausible if all speech parameters change with using more elaborate data structures than just well- nlore-or-less equal regularity. ordered sequences of letters or feature-vectors. For instance if we use graphs {fig. 1) to represent phono- 1N.B. The use of labelled brackets to delimit domains is completely unrestricted mechanismfor partitioning strings. logical objects, then instead of copying, we can im-

80 Tongue-back: ANY Tongue-back: CLOSURE Tongue-back: CLOSURE I I I ~Ibngue-tip: CLOSURE Tongue-tip: ANY Tongue4ip: CLOSURE I I ¢~ / \ Nasality: + Nasality: - Nasality: + Nasality:-

k ,) k

Figure 2: Declarative characterisation of assimilation

plement harmony phenomena using the structure- Closure Friction Closure Friction ,dlaring technique. I I ~ / \ / Na~ality Non-nasality Nasality Non-nas. :(ncorpora~ing richer data-structures allows many if not all rewriting rules to be abandoned, to the extent I] S n t S J~hat the transformational rewrite-rule mechanism can be ditched, along with the problems it brings. Figure 3: Declarative characterisation of epenthesis Consider how the "processes" discussed above can be given a declarative (or "configurational") analy- sis. Closure Non-clo. Closure Non-clo. Allophony can be regarded as the different interpre- / \ / *~ I I tation of t:he same element in different structural con- Nasality Non-nas. N,~sality Non-nas. ~exts, rather than as involving several slightly differ- ent phonological objects instantiating each phoneme. n d w n w

Onset Coda Onset Figure 4: Declarative eharacterisation of elision i i / \ p p s p

[ph] [p,] [p] minor variations in the temporal coordination of in- dependent parameters (Jespersen 1933:54, Anderson Aspirated Slightly Unaspirated 1976, Mohanan 1986, Browman and Goldstein 1986) aspirated (fig. 3).

Assimilation can also be modelled non-destructively It has been demonstrated (Fourakis 1980, Kelly and by unification (fig. 2). Local 1989) that epenthetic elements are not phonet- ically identical to similar non-epenthetic elements. is simple to model if parametric pho- The transformational analysis, however, holds that netic representations may be glued together in par- the phonetic implementation of a segment is depen- allel, rather than simply concatenated. dendent on its features, not its derivatonal history may then overlaid over , rather than simply ('% [t] is a It] is a It]'), and thus incorrectly predicts concatenated to them (Ohman ]966, Perkell 1969, that an epenthetic [t] should be phonetically identi- Gay 1977, Mattingly 1981, Fowler 1983). If required, ca.1 to any other It]. ~his analysis can also be implemented in the phono- logical component, using graphs of the 'overlap' re- Elision is the inverse of epenthesis, and is thus in lation (Griffen 1985, Bird and Klein 1990): e.g.: some sense "the same" phenomenon, taking the "un- elided': form as more primitive than the "elided" ii tiLl aa / \ / \ / \ form, a decision which is entirely meaningless in the declarative account (fig. 4) k p k I k t Metathesi3 is another instance of "the same" phe- E~] [~] [_k] nomenon i.e. different temporal synchronisation of an invariant set of elements. Epenthesis, Elision and It is now common to analyse epenthesis, not as the Metathesis may all be regarded as instances of the insertion of a segment into a string, but as due to more general phenomenon of non-significant variabil- ity ill the timing of parallel events.

81 Figure 5: Phrase structure grammar of English phoneme strings Word [+inflected] -~ [-inflected] Inflection Inflection e.g. cat+s

Word Word Word Compounding e.g. black+bird [-in fleeted]-~ [-inflected] [-inflected] Word Prefix* Word * [-Latinate]-~ [-Latinate] [-inflected] [-Latinate] Word Stress Morphology o denotes complete constituent overlap [+Latinate]--* [+Latinate] o [+Latinate] Stress [+Latinate] --* Non_final_feet Foot

Foot Non_final_feet -~ [+initial] Foot* Sylab, (i ll b e) +heavy -heavy

Syllable Syllable Syllable -heavy, -heavy

Morphology Pre fix* Stem Suffix* [+Latinate] -* [+Latinate] [+Latinate] [+ Latinate] Syllable Rime [~hea,~y] -' (Onset) [aheavy]

Onset Affricate [avoi I -~ [avoi] Onset [-voi] -~ Aspirate

Onset ( Obstruence ) (Glide) [~voi] -* [~voi] Obstruence [-~oi] -~ ([sl), Closure (Either order)

Constraint: in onsets, [s] < Closure Rime Nucleus ( Coda ) [aheavy]--* [aheavy] [c~heavy] Nucleus -~ Peak Offglide

etc. etc.

82 U U motion of the synthesis parameters for specified in- \ / telwals of time. These parametric time-functions are r i" finally instantiated with actual numbers represent- ing times, in order to derive a complete matrix of As well as these relatively low-level phonological (parameter, value) pairs. phenomena, work in Metrical Phonology (Church 1985) and Dependency Phonology (Anderson and As well as being computationally "clean", this Jones 1974) has shown how stress assignment, a method of synthesis has the additional merit of being paradigm example for transformational phonology, genuinely non-segmental in (at least) two respects: can be given a declarative analysis. there are no segments in the phonological representa- tions, and there is no cross-parametric segmentation in the phonetic representations. The resulting speech Overview of text-to-parameter conversion in does not manifest the discontinuities and rapid cross- the YorkTalk system parametric changes which often cause clicks, pops, and the other disfiuencies which typify some syn- 1. Each symbol in the input text string is trans- thetic speech. On the contrary, the speech is fluent, lated into a column-vector of distinctive pho- articulate and very human-like. When the model netic features (nasal, vowel, tongue-back, etc.) is wrong in some respect, it sounds like a speaker Sequences of letters are thus translated into se- of a different language or dialect, or someone with quences of feature-structures. dysfluent speech. For all these reasons, the York- Talk model is attracting considerable interest in the 2. The sequence of feature-structures is parsed. speech technlogy industry and research commulfity, a This process translates the sequence into a circumstance which I hope will promote a widespread directed graph representing the of approach to computational phonology in constituent structure of the utterance. future. 3. The phonological structure is traversed and an interpretation function applied at each node to derive a phonetic parameter matrix. References Parsing is done using a Phrase Structure Grammar [1] Allen, J., S. Hunnicutt and D. Klatt. 1987./~¥om of phoneme strings. A very simplified version of such Text to Speech: The MITalk System. Cambridge a grammar is fig. 5. I have implemented several such University Press. grammars so far, including a DCG implementation and a PATR-II-like implementation. With one or [2] Anderson, J. and C. Jones. 1974. Three theses two simple extensions to the grammar formalism, it concerning phonological representations. Jour- is also possible to parse re-entrant (e.g. ambisyllabic) nal of Linguistics 10, 1-26. structures and other overlapping structures, such as [3] Anderson, S. R. 1976. Nasal Consonants and the those arising from bracketting "paradoxes". The re- internal Structure of Segments. Language 52.2 suiting graphs are thus not trees, but directed acyclic 326-344. graphs. [4] Anderson, S. R. 1985. Phonology in the Twen- tieth Century. University of Chicago Press. In computational syntactic theory, one of the main uses for the parse-tree of a string is to direct the [5] Bach, E. 1976. An extension of classical trans- construction of a compositional (Fregean) semantic formational granunar. Problems in Linguistic interpretation, according to the rule-to-rule hypoth- Metatheory, Proceedings of the 1976 Conference esis (Bach 1976). In the YorkTalk system, the same at Michigan State University, 183-224. approach is employed to assign a phonetic interpre- [6] Berwick, R. C. and A. S. Weinberg. 1984. tation to the phonological representation. A sec- The Grammatical Basis of Linguistic Perfor- ond, theory-internal motivation for constructing rich mance: Language Use and Acquisition. Cam- parse-graphs of the phonemic string is that it enables bridge, Massachusetts: M. I. T. Press. the phoneme string to be discarded completely, thus [7] Bird, S. and E. Klein. 1990. Phonological liberating the phonetic interpretation function from Events. To appear in Journal of Linguistics 26 the sequentiality and other undesirable properties of (1). seglnental strings. [8] Browman, C. P. and L. Goldstein. 1986. To- After the phonological graph has been constructed wards an articulatory phonology. Phonoloqy by the parser, a head-first graph-traversal algorithm Yearbook 3, 219-252. maps the (partial) phonological category of each [9] Chomsky, N. 1963. Formal Properties of Gram- node into equations describing the time-dependent mars. In R. D. Luce, R. R. Bush and

83 E. Galanter, eds. Handbook of Mathematical [23] Jespersen, O. 1933. Essentials of English Gram- Psychology Vol. II. New York: John Wiley. mar. London: George Allen and Unwin. [10] Church, K. 1985. Stress Assignment in Letter [24] Johnson, C. D. 1972. Formal Aspects of Phono- to Sound Rules for Speech Synthesis. In 28rd logical Description, Mouton. Annual Meeting of the Association .for Compu- [25] Kaisse, E. and P. Shaw. 1985. On the Theory of tational Linguistics Proceedings. Lexical Phonology. Phonology Yearbook 2, 1-30. [11] Coleman, J. S. and J. K. Local. 1987 forthcom- [26] Kelly, J. and J. K. Local. 1989. Doing Phonol- ing. Monostratal Phonology and Speech Synthe- ogy. Manchester University Press. sis. To appear in C. C. Mock and M. Davies (eds.) In press. Studies in Systemic Phonology [27] Kiparsky, P. 1973. ~Elsewhere" in Phonology. London: Francis Pinter. Indiana University Linguistics Club. [12] Coleman, J. S. 1989. The Phonetic Interpreta- [28] Kiparsky, P. 1985. Some Consequences of Lexi- tion of Headed Phonological Structures Con- cal Phonology. Phonology Yearbook 2, 82--136. raining Overlapping Constituents. ms. (Cur- [29] Lapointe, S. 1977. Recursiveness and deletion. rently submitted to Phonology) Linguistic Analysis 3: 227-265. [13] Fourakis, M. S. 1980. A Phonetic Study of Sono- [30] Leben, W. 1973. Suprasegmental Phonology. rant Clusters in Two Dialects of En- Ph.D. dissertation, M. I. T. glish. 1, Department of Research in Phonetics [31] Levelt, W. J. M. 1976. Formal grammars and Linguistics, Indiana University. the natural language user: a review. In A. Mar- [14] Fowler, C. A. 1983. Converging Sources of Ev- zollo, ed. Topics in Artificial Intelligence CISM idence on Spoken and Perceived Rhythms of courses and lecture notes no. 256. Springer. Speech: Cyclic Production of Vowels in Mono- [32] Local, J. K. 1989. Modelling assimilation in syllabic Stress Feet. Journal of Experimental non-segmental rule-free synthesis. To appear in Psychology: General Vol. 112, No. 3, 386-412. D. R. Ladd and G. Docherty, eds. Papers in [15] Gay, T. 1977. Articulatory Movements in VCV Laboratory Phonology H Cambridge University Sequences. Journal of the Acoustical Society of Press. America 62, 182-193. [33] Mattingly, I. G. 1981. Phonetic Representations [16] Gazdar, G. 1987. COMIT ==> * PATR II. In and Speech Synthesis by RUle. In T. Myers, TINLAP 3: Theoretical Issues in Natural Lan- J. Laver and J. Anderson, eds. The Cognitive guage Processing 3. Position Papers. 39-41. As- Representation of Speech. North-Holland. sociation for Computational Linguistics. [34] Mohanan, K. P. 1986. The Theory of Lezical [17] Goldsmith, J. 1976. Autosegmetttal Phonology. Phonology. D. Reidel. Indiana University Linguistics Club. [35] Ohman, S. E. G. 1966. Coarticulation in [18] Griffen, T. D. 1985. Aspects of Dynamic Phonol- VCV Utterances: Spectrographic Measure- ogy Amsterdam studies in the theory and his- ments. Journal of the Acoustical Society of tory of linguistic science. Series 4: Current is- America 39, 151-168. sues in linguistic theory, vol. 37: Benjamins. [36] Perkell, J. S. 1969. Physiology of Speech Pro- [19] Hertz, S. R. 1981. SRS text-to-phoneme rules: a duction: Results and Implications of a Quanti- three-level rule strategy. Proceedings of ICASSP tative Cineradiographic Study Cambridge, Mas- 81, 102-105. sachusetts: M. I. T. Press. [20] Hertz, S. R. 1982. From text to speech with [37] Salomaa, A. 1973. Formal Languages. New SRS. Journal of the Acoustical Society of Amer- York: Academic Press. ica 72(4), 1155-1170. [38] Van Leeuwen, H. C. 1987. Complementation in- [21] Herez, S. R., Kadin, J. and Karplus, K. 1985. troduced in linguistic rewrite rules. Proceedings The Delta rule development system for speech of the European Conference on Speech Technol- synthesis from text. Proceedings of the IEEE ogy 1987 Vol. 1, 292-295. 73(11), 1589-1601. [39] Van Leeuwen, H. C. 1989. A development tool [22] Hertz, S. R. forthcoming. The Delta program- for linguistic rules. Computer Speech and Lan- ming language: an integrated approach to non- guage 3, 83-104. linear phonology, phonetics and speech synthe- sis. In J. Kingston and M. Beckman, eds. Papers in Laboratory Phonology I: Between the Gram- mar and Physics of Speech. Cambridge Univer- sity Press.

6

84