Issues in TexttoSp eech Conversion for Mandarin

Chilin Shih Y Richard Sproat v B Ö

Sp eech Synthesis Research Department

Bell Lab oratories

Lucent Technologies

Mountain Avenue Ro om dd

Murray Hill NJ USA

fclsrwsgb elllabscom

Abstract

Research on texttosp eech TTS conversion for is a much younger en

terprise than comparable research for English or other Europ ean languages Nonetheless im

pressive progress has b een made over the last couple of decades and Mandarin Chinese systems

now exist which approach or in some ways even surpass in quality available systems for En

glish This article has two goals The rst is to summarize the published literature on Mandarin

synthesis with a view to clarifying the similarities or dierences among the various eorts One

prop erty shared by a great many systems is the dep endence on the syllable as the basic unit of

synthesis We shall argue that this prop erty stems b oth from the accidental fact that Mandarin

has a small numb er of syllable typ es and from traditional Sinological views of the linguistic

structure of Chinese Despite the p opularity of the syllable though there are problems with

using it as the basic synthesis unit as we shall show The second goal is to describ e in more

detail some sp eci problems in texttosp eech conversion for Mandarin namely text analysis

concatenative unit selection segmental duration and and intonation mo deling We illus

trate these topics by describing our own work on Mandarin synthesis at Bell Lab oratories The

pap er starts with an intro duction to some basic concepts in sp eech synthesis which is intended

as an aid to readers who are less familiar with this area of research

Keywords Chinese sp eech synthesis text analysis concatenative units duration tone and

intonation

PART ONE A GENERAL OVERVIEW

Intro duction

Texttosp eech synthesis TTS is the automatic conversion of a text into sp eech that resembles

as closely as p ossible a native sp eaker of the language reading that text In all current TTS systems

the text is in a digitally co ded format such as ASCI I for English or Big or GuoBiao GB for

Chinese We take it as axiomatic that a complete texttosp eech system for a language must b e

able to handle text as it is normally written in that language using the standard orthography Thus

a TTS system for Chinese that can only accept input in romanization is not a true TTS

system More generally text contains input b esides ordinary A typical Chinese newspap er

article may contain numb ers written in Arabic numerals p ercentages and dollar amounts non

Chinese words written in Roman orthography and so on A full texttosp eech system must b e

able to do something reasonable with these kinds of input also

A closer examination of the problem of converting from text into sp eech reveals that there are

two rather welldened subproblems The rst is textanalysis or alternatively linguistic analysis

The task here is to convert from an input text which is usually represented as a string of characters

into a linguistic representation This linguistic representation is usually a complex structure that

includes information on grammatical categories of words accentual or tonal prop erties of words

proso dic phrase information and of course pronunciation The second problem is sp eech

synthesis prop er namely the synthesis of a digitally co ded sp eech waveform from the internal

linguistic representation

The articial pro duction of sp eechlike sounds has a long history with do cumented mechanical

attempts dating to the eighteenth century The synthesis of sp eech by electronic means is obviously

much more recent with some early work by Stewart Stewart that allowed the pro duction of

static vowel sounds Two inventions b oth from Bell Lab oratories help ed initiate the mo dern work

on sp eech synthesis that eventually led to the sp eech synthesizers that we know to day The rst

was the Vo der by Homer Dudley which was an electronic instrument with keys that allowed

a skilled human op erator to select either a p erio dic voiced source for sonorant sounds or a

noise source for fricative sounds keys were also available for controlling a lter bank to alter the

sp ectrum of the vo cal tract to approximate dierent vowel sounds as well as some controls to

pro duce stops fundamental frequency F was controllable by a fo ot p edal Demonstrated at the

Worlds Fair the Vo der clearly demonstrated the p otential of the electronic pro duction of

sp eech The second invention was the sound sp ectrogram Potter et al which for the rst

time gave a representation of sp eech that allowed researchers to discover robust acoustic correlates

of articulatory patterns and made it p ossible to conceive of automatically generating sp eech by

rule This second invention in particular was instrumental in the development of the socalled

1

Of course it is p ossible to ho ok a TTS system up to an optical recognition system so that the system

can read from printed text but this do es not aect the p oint since the output of the OCR system will b e a digitall y

co ded text that serves as the input to the TTS system

2

For example there was a mechanical device attributed to von Kemp elen

3

Much of the historical data reviewed in this section is culled from Dennis Klatts review article Klatt

which is highly recommended for anyone wishing to learn more ab out the history and problems of sp eech synthesis

from its inception to the late eighties

sourcelter theory of sp eech pro duction Fant up on which most mo dern work on synthesis

dep ends simply put the sourcelter theory states that sp eech can b e mo deled as one or more

p erio dic or noise sources the glottis or in the case of fricatives a constriction which are then

ltered by an acoustic tub e namely the vo cal tract sometimes coupled with the nasal tract from

the p oint of the source onwards to the lips

The three mo dern approaches to synthesis namely articulatory synthesis formant synthesis

and concatenative synthesis have a roughly equally long history the earliest rulebased articulatory

synthesizers include Nakata and Mitsuoka Henke Coker the earliest rulebased

formant synthesizer was Kelly and Gerstman and the concept of diphone concatenation was

discussed in Peterson et al with the rst diphonebased synthesizer rep orted b eing Dixon

and Maxey We discuss these dierent approaches in somewhat more detail in the next

section

Textanalysis for sp eech synthesis has a much shorter history One of the rst systems for full

texttosp eech conversion for any language and certainly the b est known was the MITalk

American English system Although a full description was only published in Allen et al

the work was actually done in the seventies MITalk was capable not only of pronouncing

most ordinary words correctly but it also dealt sensibly with punctuation numb ers abbreviations

and so forth MITalk was also signicant in setting an imp ortant precedent the approach taken to

such problems as morphological analysis and graphemetophoneme conversion was linguisticall y

very well informed with the approach mo deled to a large extent on The Sound Pattern of English

Chomsky and Halle The input of linguistic knowledge into TTS systems has remained

high and few p eople would seriously try to construct a full working TTS system based on a pure

engineering approach Lamentably the comparable p oint cannot b e said of sp eech recognition

where to this day linguistics has had little practical impact and where the most sophisticated

linguistic mo dels incorp orated in many systems are pronunciation dictionaries which are often

derived from a TTS system Since MITalk full TTS systems have b een constructed for many

languages with several systems including DECtalk Lernhout Hauspie Berkeley Sp eech Infovox

and the system b eing multilingual

The purp ose of this pap er is to review the problems of texttosp eech conversion as it applies

particularly to the conversion of Chinese text into Mandarin sp eech Although we provide a synopsis

of the literature in Section most of this pap er constitutes an overview of our own work on

Mandarin synthesis at Bell Labs There are a couple of reasons for this fo cus First while there is

by now a sizable literature on Mandarin TTS few systems are discussed in sucient detail for us to

b e able to illustrate all of the issues involved in Mandarin TTS conversion by examples culled from

the published literature The second reason is that while no formal evaluations of our Mandarin

system are available informal assessments lead us to b elieve that it is among the b est in terms of

sp eech quality and our admittedly limited knowledge of other systems leads us further to b elieve

that our systems ability to deal reasonably with unrestricted text input is as go o d as that of any

other system Thus the Bell Labs Mandarin system is as reasonable a mo del as any up on which to

base the discussion in this pap er

Before we b egin examining issues p ertaining particularly to Mandarin however it is necessary

to review some basic concepts in sp eech synthesis for readers who are not already familiar with

the eld of synthesis

Review of Some Basic Concepts in Sp eech Synthesis

As noted ab ove there are three basic approaches to synthesizing sp eech namely articulatory syn

thesis formant synthesis and concatenative synthesis Due to the complexity of the rst only the

latter two approaches have so far proved commercially viable

With one exception waveform concatenation all approaches to synthesis are based on the

sourcelter mo del intro duced ab ove That is the synthesis metho d can b e broken down into two

comp onents consisting of a mo del of the source and a mo del of the vo cal tract transfer function

In normal sp eech b oth the source and the transfer function change over time the only exceptions

to this are very unnatural utterances such as steady state vowels said on a constant pitch with

no change in amplitude It is standard to approximate the continuous variation in the source and

lter by dividing each second of sp eech into a set of a few hundred discrete steadystate frames

During synthesis the source is ltered with the transfer function at each frame to pro duce the

output sp eech parameters for that frame Source mo dels include mo dels of the p erio dic vibration

at the glottis during p erio ds of voicing and mo dels of noisy supraglottal sources such as one nds

with fricatives like x or s Changes in the fundamental frequency of the voice due for example to

changes in phonological tones during the course of the utterance are also handled by the glottal

source mo del by merely increasing or decreasing the rate of the pulses

The main dierence b etween the three basic metho ds of synthesis has to do with the way in

which the sets of transfer functions for an utterance are computed Articulatory synthesis attempts

to solve the problem from rst principles as it were Computational mo dels of the articulators

the tongue the lips and so forth are constructed that allow the system to simulate the various

congurations that the human sp eech organs can attain during sp eech Each conguration gives

a particular shap e to the vo cal tract and from these shap es theories of acoustics are put to work

to compute the transfer function for that particular shap e The diculty is that b oth problems

namely mimicking human articulator motion and computing acoustic prop erties from vo cal tract

shap es are very hard As a result most work on articulatory synthesis has made simplifying

assumptions ab out physical prop erties of the vo cal tract and has dep ended on twodimensional

midsagittal mo dels Only recently have researchers started seriously to build threedimensional

mo dels of crucial comp onents such as the threedimensional tongue mo del discussed in Wilhelms

Tricarico and Perkell Even with such mo dels in place however our knowledge of articulatory

phonetics is still suciently primitive that it will b e many years b efore the articulators can b e

appropriately controlled using purely linguistic information such as what one might compute from

text

Formant synthesis bypasses the complications of mo deling articulator movements and com

puting the acoustic eects of those movements by computing sp ectral prop erties of the transfer

function directly from the linguistic representation using a set of carefully designed rules Consider

a word such as @ kafei coee Figure gives a sp ectrogram for an utterance of this word by a

male sp eaker of Beijing Mandarin the formants also called resonances show up clearly as roughly

horizontal dark bars in the sp ectrogram For example the vowel a has three formants in the lower

half of the sp ectrogram b etween ab out and seconds To synthesize this utterance using a

formant synthesizer one needs to mo del as closely as p ossible the changes in the frequency and

amplitude in the formants for the vowels a and ei including the shap e of the transitions of those

formants into and out of the consonants since these transitions are imp ortant in identifying the

Figure Sp ectrogram of an utterance of the word kafei coee by a male sp eaker of Beijing

Mandarin Note that kcl denotes the closure p ortion of the k

consonant The transitions for a out of k are particularly clear with a noticeable velar pinch

b eing present at the b eginning of the a in the second and third formants right b efore seconds

Designing a set of rules to do this is certainly easier than constructing an articulatory mo del but

it is still a dicult task

The simplest approach to synthesis bypasses most of these problems since it involves taking

real recorded sp eech cutting it into segments and concatenating these segments back together

during synthesis In the Bell Labs concatenative synthesis metho d for example recorded sp eech is

rst co ded using linear predictive coding LPC which removes the source information leaving one

with a sp ectral representation of the timevarying transfer function of the utterance Segments of

this LPC representation are then cut out based on cut p oints determined by examining a normal

sp ectrogram and stored in a table The size and typ e of unit whether it b e a whole syllable a half

or demisyllable phonetophone transition a diphone diers widely across systems as we

shall see more fully in the next section Let us assume for the sake of discussion a diphonic system

similar to the one used in the Bell Labs synthesis work For an utterance such as @ coee one

would need the sequence of units k where represents silence ka af fei and ei

The ka unit for instance would b e selected starting during the silence just b efore the burst of

the k and extending to the right until just after the a reaches a steady state In general of course a

unit needed to synthesize a particular word will most likely not have b een originally uttered as part

of that word So the ka unit used for @ coee might have come from d kapian card and

the af unit might have come from  Рmafan troublesome The art in concatenative synthesis

is to select units in such a way as to minimize discrepancies b etween adjacent units which come

originally from completely dierent utterances

While many concatenative systems are like the Bell Labs system in that the source and lter are

handled separately the very notion of cutting up sp eech segments and reconcatenating them during

synthesis naturally leads to the idea of bypassing the sourcelter mo del entirely and synthesizing

directly in the time domain that is by storing and concatenating waveform segments In principle

the sp eech quality of such systems should b e higher than that of LPCbased systems in precisely

the same way as recorded sp eech is always of higher quality than LPC co ded and resynthesized

sp eech This is not as easy as it sounds since simply taking the two waveforms for the ka of

d card and the af of  Рtroublesome and concatenating them together will lead to very

p o or results due to the abrupt discontinuity that will b e almost imp ossible to avoid Furthermore

it is usually desirable to b e able to change the fundamental frequency contour of the synthesized

utterance from whatever contour was used for the utterance one might want to change a lowtoned

utterance da hit into a falling toned utterance j da big This is also not trivial to do using

a waveform representation Nonetheless the development of such techniques as pitchsynchronous

overlap and add PSOLA Charp entier and Moulines has allowed for the pro duction of high

quality sp eech synthesizers based on waveform concatenation Still waveform metho ds tend to b e

much more sensitive to precise placement of cut p oints than are LPCbased systems and extreme

variations in pitch remain a problem for waveform metho ds One solution to the latter problem has

b een to simply record a given unit in all desirable tonal environments thus giving a high degree of

naturalness but at the cost of a larger set of units

This section has intro duced concepts from just one part of the TTS problem namely synthesis

prop er Nothing has b een said here ab out various approaches to textanalysis of tone and intona

tion or of segmental durations We b elieve however that our discussion of these topics in ensuing

sections will serve to illustrate the issues in those domains

A Synopsis of the Literature on Mandarin Synthesis

In order to b e able to place in a prop er p ersp ective the various pro jects on Mandarin TTS that

have b een and are b eing undertaken it is useful to consider some of the more salient parameters

that can b e used to describ e a TTS system The following list is certainly not exhaustive but it

do es hit the main p oints that are relevant for the Mandarin work

What basic synthesis metho d is used Is the system concatenative or rulebased

a If the system is concatenative what basic co ding scheme is used Is the scheme time

domain waveform eg PSOLA or frequencydomain eg LPC

b If the system is rulebased do the rules describ e formant tra jectories or is an articulatory

mo del presumed

Assuming the system is concatenative what typ e of unit is used are the units basically

diphones demisyllables syllables or some other unit

How sophisticated are the proso dic mo dels in particular the mo dels for duration and intona

tion For example is there any mo deling of tonalcoarticulation an imp ortant phenomenon

in Mandarin phonetics

Is the system a full texttosp eech system By rights a system can only call itself that if it is

capable of converting from an electronically co ded text written in the standard orthography

for the language in question

As we shall see most Mandarin systems are concatenative with an approximately equal split

as far as we can tell b etween time and frequencydomain co ding schemes most use syllables as

the basic unit most early systems have fairly simpleminded proso dic mo deling with improved

techniques b ecoming evident in later work nally only since the very late eighties have full text

tosp eech capabilities b een rep orted

Typ es of Synthesis and Size of Units

Concatenative synthesis Mandarin synthesis work based on concatenative mo dels dates to the

early eighties with the earliest examples b eing Lei Huang et al Zhou and Cole

Lin and Luo Zhou In this work the emphasis was placed on the ability to t the

system into the tight sp eed and space constraints of the machines available at that time As a

result the unit inventories tended to b e small with a favorite scheme b eing a pseudodemisyllabi c

initialnal onsetrime mo del a syllable such as kai op en would b e split into two units

namely the initial demisyllable k and the nal demisyllable ai Such a scheme allows one to

limit the numb er of units to around Huang et al Qin and Hu Shi but only

at a cost of naturalness and even intelligibil ity since no coarticulation eects across adjacent units

are mo deled One other inuence not mo deled by these and many other kinds of concatenative

inventories is the inuence of tone For example there may b e some correlation b etween vowel

quality and the fundamental frequency of the voice and this could lead to subtle problems if a

particular unit that is recorded in one tonal environment is used to synthesize output for a dierent

environment Still such tonal inuences on a unit are generally b elieved to b e quite small certainly

much smaller than the inuence of abutting units It is noteworthy however that one early system

rather directly captures this tonal inuence while failing to capture the certainly more imp ortant

crossunit inuence by using a set of pseudodemisyllabic units collected from all p ossible

tonal environments Lin and Luo

Intrasyllabic coarticulation can b e mo deled b etter by changing to a true demisyllabic scheme

In the system describ ed in Mao et al a set of nontonesensitive demisyllables was used

consisting of vowelsensitive consonantal initial demisyllables and vo calic nal demisyllables

In such a scheme a syllable such as kai op en would b e synthesized by the two units ka and

ai Another system rep orted in Chiou et al uses vowelsensitive initial demisyllables

Using demisyllables is clearly preferable over the earlier pseudodemisyllabic scheme in that one

can now mo del the inuence of the vowel on the consonant initial in a way that was not p ossible

in the earlier scheme

A logical endp oint of this progression of ever larger and more contextsensitive units are the

various syllablebased concatenative systems OuhYoung OuhYoung et al Liu et al

Liu et al Ju et al Chan and Chan Cai et al Chu and Lu

Hwang et al Syllablesiz ed units needless to say allow one to p erfectly mo del withinsyllable

coarticulation However they merely shift up one level the problem faced by pseudodemisyllabi c

systems in that merely having an inventory covering all of the syllable typ es of Mandarin do es not

allow one to mo del coarticulation eects that cross syllable b oundaries As a result many of these

systems must resort to inserting short silences b etween adjacent syllables to avoid other audible

artifacts that would arise if the syllables were directly concatenated But this in turn results in an

unnatural choppiness or even lowered intelligibi li ty in some instances One instance where one

might exp ect inserted silences to cause problems is in medial syllables that b egin with fricatives

4

One unique characteristic of this system is that the vowels are generated by a formant synthesizer

Part of the acoustic distinction b etween a fricative such as sh and an aricate such as ch is that the

latter b egins with a silence So it is likely that a fricative in a sequence like n $ sh where marks a

syllable b oundary will b e confused with an aricate n $ ch due to the inserted silence esp ecially

if the silence is suciently long this would lead to the misinterpretation of H ren shao there

are few p eople as H n ren chao p eople are noisy Indeed this is precisely what was observed in

Zhang et al wherein it was noted that the most frequent confusion in several waveform

based Mandarin systems was b etween voiceless fricatives and aspirated plosives or aricates The

cause we susp ect was not the waveformbased co ding schemes themselves but rather the silences

that were inserted b etween syllables in what almost certainly were largely syllablebased systems

There is at least one attempt to build a wordbased system Xu and Yuan where syllables

as well as frequent words are collected in the sp eech database which contains around units

The problem facing such systems is of the same nature as the syllablebased system there is no

way to mo del coarticulation eects b etween words

With resp ect to typ e of analysis metho d used linearpredictive and waveformbased approaches

are roughly evenly balanced in the literature with a slight preference for the former LPC systems

include Huang et al Lin and Luo OuhYoung et al Mao et al Qin and Hu

Liu et al as well as the Bell Labs systems PSOLA systems include Chu and Lu

Hwang et al Choi et al Cai et al

Rulebased Synthesis Rulebased formant synthesizers for Mandarin have an earlier vin

tage than concatenative systems with the earliest citation that we have b een able to nd b e

ing Suen Suen used a VOTRAX synthesizer to pro duce Mandarin sp eech from a pho

netic transcription coupled with ve tonal contours in simple citationform Intelligibil i ty was

evidently a problem as it was for any synthesizer at that time sub jects could hear the syn

thesized passage with an encouraging intelligibil ity score of over after less than an hour of

exp osure to the sounds Subsequent work in a similar vein is Lee et al Later rule

based synthesis work includes work at KTH Zhang Matsushita Javkin et al In

stitute of Acoustics Academia Sinica Beijing Lu et al Taiwan University Shie

Hsieh et al and the Chinese Academy of So cial Sciences Beijing Yang and Xu The

latter two systems share the prop erty of b eing syllablebased in a rather p eculiar sense rules are

used to generate parameters to synthesize single syllables and these syllables are then concatenated

together much as in a purely concatenative system In neither system are there rules to handle

crosssyllable coarticulation

Synopsis So all of the concatenative work and at least some of the rulebased work that we

have discussed so far can b e describ ed as syllablebased in that the basic units of the system either

are syllables simpliciter or else are some welldened constituents of a syllable where if any cross

unit coarticulation is mo deled it is only syl lableinternal coarticulation As we have argued this

predisp osition towards syllablebased systems is problematic from the p oint of view of pro ducing

naturalsounding sp eech One solution to this problem is to abandon syllablebased mo dels in favor

of units which can mo del b oth syllableinternal and crosssyllable coarticulation One such unit is

5

To b e fair we should note that Yang and Xu represents work that was presented by the authors themselves

as incomplete we b elieve that in subsequent work crosssyllable dep endencies have b een mo deled though we have

b een unable to nd a citation to such later work

the diphone In a purely diphonic system a disyllabic sequence such as ù kai men op en the

do or might b e constructed out of the units ka aim me en Greater or lesser quality can b e

obtained by contextualizing the diphonic units For example in the Bell Labs femalevoice

Mandarin synthesizer Shih and Lib erman the ka unit used in synthesizing kai op en was

the same as the unit used in synthesizing the ka of @ kafei coee But this is not completely

satisfactory since the a in kai is higher and more fronted than the a in ka Thus in the new Bell

Labs Mandarin synthesizer presented in more detail b elow and see also Ao et al the ka

units used for these two syllables are dierent Apart from our own systems the only other system

that we are aware of which is diphonebased is the Apple system describ ed in Choi et al

Mo dels of Proso dy

All synthesis systems strictly sp eaking have a proso dic mo del in the sense that fundamental

frequency duration and intensity must b e assigned in the pro duction of sp eech An extremely

simple system for Mandarin might assign to each syllable a tone shap e selected by the lexical

tone and assign constant duration and constant intensity to each phone The proso dic mo del in

early Mandarin synthesis systems did not go much b eyond this simple description Later systems

have fairly sophisticated mo dels predicting variations of tone and duration suitable for the context

However it is fair to say that there are at this p oint in time no texttosp eech systems for any

language that incorp orate fully adequate mo dels of proso dy

There are many causes which contribute to the diculty of proso dic mo deling Even for the

plain nonexpressive reading style there are many factors controlling proso dy There are also

immense variations among sp eakers and even b etween dierent renditions of the same text by

the same sp eaker these considerations make it hard to discover the robust underlying regularities

governing proso dy Mo deling of expressive reading styles is even harder esp ecially since no TTS

system can b e said to truly understand what it is reading thus imp ortant information such as

when to emphasize a word and how much emphasis to put in cannot generally b e reliably predicted

Mo dels of Duration Few pap ers on Mandarin synthesis systems rep ort on duration mo deling

In early systems most of the eort was channeled into building the acoustic inventory and to some

extent into constructing mo dels of tone Among the relatively few published rep orts on duration

mo deling several dierent approaches are taken The systems rep orted in Shih and Lib erman

Chiou et al Choi et al use handcrafted duration rules based on published rep orts

such as Ren Feng along with some educated guesses to bridge the gaps not covered in

the published phonetic work on duration In contrast the new Bell Labs Mandarin system Shih

and Ao Shih and Ao describ ed more fully in Section and the ChiaoTung University

system Hwang and Chen Hwang et al derive duration values from a lab eled sp eech

database The Bell Labs mo del is a parametric mathematical mo del the parameters of which

are trained based on duration values in the database the ChiaoTung system is based on neural

networks In our own exp erience there is drastic improvement in our current duration mo del over

the rulebased system and as a result b oth the intelligibi li ty and naturalness of synthesized

sp eech improves greatly The contrast of the two systems attests to the imp ortance of a go o d

duration mo del

The results of the ChiaoTung neuralnetworkbased system are go o d esp ecially in contrast

to a simple rulebased system And the advantages of using neural nets are familiar in general

one can achieve passable b ehavior by training the network on sucient amounts of appropriately

lab eled training data without having to do any analysis b eforehand of what kinds of factors are

likely to b e relevant and what their likely interactions are This is obviously much less work than a

sophisticated handcrafted ruleset such as the wellknown duration mo del for the MITalk system

rep orted in Allen et al which dep end up on a great deal of phonetic understanding on the

part of the rule writer And a neural net approach is also less onerous than a mathematical mo del

of the kind we describ e in Section However there are various reasons for preferring the approach

we have chosen over a neural net approach The main reason is that general statistical mo deling

metho ds such as neural nets or CART Breiman et al suer from sparse data problems in

that they are often unable to pro duce reasonable results in cases involving combinations of factors

that were unseen in the training data in contrast the duration mo del we present b elow is much

more robust in the face of such unseen factor combinations

Mo dels of Tone and Intonation The simplest tone mo del for Mandarin synthesizers just

sp ecies the F contours of Mandarin lexical tones Suen Lin and Luo Qin and Hu

Mao et al It was so on realized that even a few simple tonal coarticulation rules improve

the smo othness of the synthesized sp eech Many systems implemented the rules describ ed in Chao

including the neutral tone rule tone sandhi rules the half tone rule and the half tone

rule Zhang OuhYoung et al Some systems also include tone rules that are intended

to capture the tendency for pitch to drop as the sentence pro ceeds for example a tone following

a tone is replaced by a variant of tone with lower pitch Lee et al Liu et al

Chiou et al Lee et al

A system built at University College London Shi incorp orated the intonation mo del of

Garding where intonation eects are represented by grids which dene the global F range

and shap e Tones are then realized within the connes of the grids This system do es not have the

capability of predicting intonation eects automatically and requires annotated input The system

rep orted in Zhou and Cai also allows handannotation for stress

The system discussed in Chu and Lu follows the mo del of Shen where the intona

tion topline reects the level of emphasis with higher values corresp onding to stronger emphasis

The baseline is considered to reect the strength of syntactic b oundary with lower F value corre

sp onding to stronger b oundaries

A few systems attempt to capture tone and intonation variation using statistical mo dels Chang

et al a Chang et al b Chen et al a Wang et al including neural networks

Chen et al b Hwang and Chen Hwang and Chen Wu et al Chen et al

forthcoming In order to facilitate clustering analysis on tones these systems typically represent

tones as a fourdimensional feature vector obtained by four orthonormal p olynomials Chen and

Wang where the rst co ecient represents the F mean and the other three represent the

tone shap e After clustering analysis various numb ers of representative tone patterns are chosen

and stored in co deb o oks The mo dels are trained to match tone patterns with selected contextual

features As with the training of duration mo dels these statisticallybased metho ds are capable

of creating an impressive proso dic mo del in a short time without linguistic analysis However the

problem of unseen factor combinations which we saw in the discussion of duration mo deling is

even more serious in the mo deling of F This is evident in the rep ort of Chen et al a while

of the tone patterns chosen by the mo del match that of the natural sp eech in the training

data a very impressive score compared with the ceiling of obtained by comparing dierent

rep etitions of the sp eech database the p erformance declines drastically to for the test set

The Bell Labs system is similar to the systems of Zhang OuhYoung et al in

using a rulebased tone and intonation mo del but there are some notable dierences Following

Pierrehumb ert Lib erman and Pierrehumb ert tones are represented as abstract targets

and sentential eects are separated from tonal coarticulation rules We found exp erimentally that

the sentential lowering eect primarily aects tone height not tone shap e and the amount of

lowering can b e mo deled as an exp onential decay This pro cess can in principle create an innite

numb er of toneheight distinctions which is not something that can b e handled by tonal replacement

rules We also carried out phonetic studies involving all p ossible tonal combinations and derived a

set of tonal coarticulation rules that is more complete than previously rep orted results Shih

A crucial asp ect of our tonal analysis is the strict separation of phonological tone sandhi rules from

phonetic tonal coarticulation rules These two tonal phenomena seem to b e frequently confounded

in the literature but it is imp ortant to keep them distinct For example in the statistical approach

discussed in Chen et al a no distinction is made in the training data b etween surface tone

eg R mai buy versus a sandhied tone eg R s mai jiu buy wine which is identical to

tone This results in redundancy in that all the tone shap es for tone must also b e stored in the

co deb o ok under the entry for tone Furthermore while the statistical mo del do es capture lo cal

tone sandhi changes more or less by brute force it is p ossible to construct reasonable sentences

where the realization of an underlying tone which precedes another underlying tone dep ends

up on information that is many syllables to the right in the sentence Such cases could not b e

readily handled using a purely statistical approach that confounds phonological and phonetic tone

changes We discuss these examples and other details in the phonological analysis of Mandarin

third tone sandhi in Section we further discuss phonetic tonal rules in Section

Textanalysis Capabilities

Most early systems for Mandarin presumed an input in some form of phonetic transcription such

as pinyin For example the Bell Labs system allowed input in pinyin optionally annotated

with bracketings indicating syntactic constituent structure Such systems do not qualify as text

tosp eech systems in the strict sense since they do not allow input represented in the normal

orthography of the language

The rst system that contained at least some textanalysis capabilities and was thus able to

convert from normal Chinese orthography into Mandarin sp eech was the Telecommunications Labs

system Liu et al This system was eventually linked to an optical character recognition

OCR system for converting printed text into sound

Of course by the mid nineties there were several Mandarin systems that could read from elec

tronically co ded Chinese text including the Apple system Choi et al several Mainland

systems eg Cai et al and our own system which we describ e in more detail b elow

There remain however relatively few publications on the textanalysis problems for Chinese TTS

6

We use the notation to denote a sandhied tone

7

Such systems have b een available for English for many years see for example Kurzweil But the OCR

problem in English is not nearly as formidable as it is in Chinese

with our own publications in this area b eing arguably the most extensive This dearth of publica

tions mirrors the lamentable secondclass status which textanalysis matters have in the worldwide

synthesis community Although the word text is one of the two content words in the phrase textto

speech the amount of published material in this area comprises signicantly less than half of the

published material on synthesis as a whole

Summary

The nineties has seen a large increase in the numb er of rep orted Mandarin synthesis systems

with several systems b eing improved versions of previous systems as well as many completely new

systems Sp eech quality has improved dramatically b oth in terms of segmental quality and in

terms of more natural sounding proso dy Several newer systems are full texttosp eech systems in

the sense we have dened Finally there has b een a small amount of work on synthesis for other

Chinese languages for example Taiwanese synthesizers have b een built at Tsing Hua University

Wang and Hwang Bell Lab oratories and Telecommunication Labs a synthesizer

has b een built at Hong Kong University Lo and Ching

A summary of the Mandarin texttosp eech systems is given in Table listing the most salient

attributes of these systems Demo sentences of some of these systems will b e included in the

Bell Labs TTS web page currently under construction The address will b e httpwwwb ell

labscompro jecttts

There has b een very little work on formal evaluation of Mandarin synthesizers with the ex

ception of the Mainland assessment program discussed in Lu et al Zhang et al

Zhang where systems are judged on the base of syllable word sentence intelligibil i ty as

well as overall naturalness Systems with text analysis capability are further evaluated on their

p erformance on selected texts Eight systems were evaluated in the rep ort

The syllable as a basic unit of synthesis a critique

As we have seen the vast ma jority of Mandarin synthesizers are based on syllablesize d units In

this regard they dier sharply in design from for example our own Mandarin synthesizer Rather

than merely noting this distinction however it is instructive to ask why it is such a prevalent notion

that Mandarin synthesizers should b e based on syllablesized units rather than units of some other

size

There seem to b e two reasons for this prevalence The rst is that as a practical matter the

numb er of syllables in standard Mandarin seems suciently small around four hundred without

counting tonal distinctions that one can easily conceive of building an inventory of all syllable

typ es even if one allows for tones one would only need to deal with ab out fourteen hundred

typ es For English of course a syllablebased approach would b e far less palatable a dictionary

of somewhat over a hundred thousand words yields approximately fteen thousand syllable typ es

and this almost certainly do es not exhaust the set of p ossible typ es But while the contrast seems

8

Note that it is not only in the published literature where one nds a strong bias towards syllableba sed synthesis

On several o ccasions in informal discussions with Chinese sp eech researchers we have b een told that syllablebas ed

systems are clearly the correct approach The prejudice is a widespread one indeed The primacy of syllables is also

evident in work on sp eech recognition for Mandarin

Inst or author Synth Meth Unit Text Anal Proso dy date

Suen VOTRAX phoneme no tones

Lee etal VOTRAX phoneme no tones

Huang etal LPC pseudodemisyllabl e no tones

Zhou etal pseudodemisyllabl e no tones

Lin Luo LPC toned phoneme no no

Taiwan U LPC syllable no rules

KTH formant no rules

Bell Labs LPC diphone no rules

Taiwan U formant syllable no rules

Tsinghua Beijing LPC demisyllable no tones

Acad of So c Sci formant syllable no stat mo del

Qin Hu LPC pseudodemisyllabl e no tones

UC London formant pseudodemisyllabl e no rules

Matsushita formant pseudodemisyllabl e no mo del

Telecom Labs LPC syllable yes rulesstat mo del

Tsing Hua U hybrid demisyllable no rulesstat mo del

Telecom Labs LPC toned syllable yes rules

Chiao Tung U PSOLA syllable stat mo del

Chiao Tung U LPC syllable stat mo del

Hong Kong U LPC syllables no rules

Xu etal words yes rules

Bell Labs LPC diphone yes rulesstat mo del

Inst Acoustics formant yes rules

Apple Apple diphone yes rules

Tsinghua Beijing PSOLA toned syllable yes rules

Inst Acoustics PSOLA toned syllable yes rules

Cheng Kung U CELP toned syllable yes stat mo del

Table Mandarin TTS Systems

plain enough it is misleading for a couple of reasons Firstly the count of four hundred typ es

is certainly an undercount when one considers new syllable typ es created as a result of à er

suxation in northern dialects and the various reduced syllables that one encounters in uent

Mandarin sp eech Thus one nds that L Ì tamen they which would canonically b e transcrib ed

as tamn is often pronounced tam which is not a legal syllable according to the normal inventories

of Mandarin Certainly one may with some legitimacy exclude such reductions in a texttosp eech

system on the basis of the fact that they do not represent reading style But assuming that the

ultimate goal of sp eech synthesis is to pro duce naturalsounding sp eech of all styles synthesis of

Mandarin on the basis of a xed inventory of four hundred syllables is b ound to fail Secondly

syllablebased concatenative systems still have to deal with the problem of crosssyllable and cross

word coarticulation as we have already seen In a purely concatenative system it would certainly

b e theoretically p ossible to store contextually determined syllablesize d units but the numb er of

such units would inevitably b e huge thus defeating the numerical argument for using syllables that

was advanced in the rst place

The second reason for the prevalence of syllablebased systems relates to a much deep er issue

we b elieve For many centuries Chinese philological tradition has centered on the written character

r larger linguistic units had little formal status The reason for this is clear in that characters

most commonly corresp ond to single and vice versa though see DeFrancis for

some renement of this p oint Furthermore they are graphically rather distinct thus making

the morphological analysis of a segment of text transparent in a way that has no parallel in say

Europ ean languages In analyzing Chinese there was thus a very strong incentive to view the

character as the minimal unit of the language much as Blo omeld Blo omeld

later analyzed the morpheme as the minimal unit of meaning in all languages Since characters with

few exceptions corresp ond to single syllables it is a short step to viewing the syllable in Chinese

as b eing the basic unit of phonological analysis However a wealth of phonological evidence for

subsyllabic units in Mandarin in addition to the more practical problems with using syllables that

we have already discussed suggest that this step is misguided

PART TWO THE BELL LABS MANDARIN TTS SYSTEM

Overall Structure of the TTS System

The Bell Labs TTS system consists of ab out ten mo dules which are pip eline d together With

the exception of the American English system which was built much earlier and has a somewhat

dierent structure the software is multilingual in that none of the programs contain any language

sp ecic data thus the set of programs that we use for Mandarin have also b een used for other

languages including German Spanish Russian Romanian and Japanese All that is necessary

to switch languages is to provide an appropriate set of languagesp ecic tables for textanalysis

duration intonation and concatenative units

9

This statement is slightly inaccurate in that we currently have two comp eting mo dels of intonation One is based

on the Lib ermanPierrehumb ertstyl e tonal target mo del which we use for Mandarin and describ e b elow The other

one which is more similar to the Fujisaki mo del is used for our set of Europ ean languages none of which are tonal

or pitchaccent languages These involve two dierent intonation mo dules

The ten mo dules are as follows

The textanalysis mo dule gentex

The phrasing mo dule which simply builds phrases as sp ecied by gentex

The phrasal phonology mo dule which sets up phrasal accents

The duration mo dule

The intonation mo dule

The amplitude mo dule

The glottal source mo dule

The unit selection mo dule which selects the concatenative units from a table given the

phonetic transcription of the words

The diphone concatenation mo dule which p erforms the actual concatenation of the units

The synthesis mo dule which synthesizes the nal waveform given the glottal source param

eters the F values and the LPC co ecients for the diphones

In this pap er we will discuss the Mandarinsp ecic data that relates to mo dules and

With the exception of the unitconcatenation mo dule which outputs a stream of LPC param

eters and source state information for the waveform synthesizer and the synthesis mo dule itself

which reads the LPC and sourcestate stream each mo dule in the pip eline communicates with the

next via a uniform set of data structures Information is passed along on a sentencebysentence

basis and consists of a set of arrays of linguistic structures Each array is preceded by the following

information

The structures typ e T eg word syllable phrase phoneme concatenative unit

N the numb er of structures of typ e T for this sentence

S the size of an individual structure of typ e T

After each of these headers is an N S byte array consisting of N structures of typ e T

There are a numb er of advantages to this mo dular pip eline design The rst advantage is a

standard observation if the systems mo dules share a common interface then it is relatively easy for

several p eople to work on dierent mo dules indep endently of one another as long as they agree on

the inputoutput b ehavior of each mo dule and the manner in which the mo dules should relate to

one another Second the pip eline design makes it easy to cut o the pro cessing at any p oint in the

pip eline During the design of a segmental duration mo dule for example one might want to test

the system by running it over a large corpus of text and sampling the segmental durations that are

computed For that task one would not normally care ab out what happ ens in mo dules subsequent

to duration such as intonation amplitude or dyad selection In the Bell Labs TTS architecture

one can simply run all mo dules up to and including the duration mo dule and then terminate the

pip eline with a program that prints out information ab out segments and their durations Similarly

it is easy to insert into the pip eline programs that mo dify TTS parameters in various ways

Text Analysis

We will describ e the overall framework for textanalysis only briey here concentrating rather

on giving some more sp ecic details that are relevant to Mandarin A more detailed discussion

of the mo del is given in Sproat Sproat an indepth analysis of the approach as it

applies to Chinese wordsegmentation is given in Sproat et al While the discussion here

naturally fo cuses on Chinese the reader should b ear in mind that the same mo del has b een applied

in synthesizers for seven other languages b esides Mandarin Spanish Italian French Romanian

German Russian and Japanese

Lexical Mo dels

To see how the textanalysis mo dule works let us start with an example sentence namely v {

« C Shi Lixuan has two mice We presume that the correct segmentation of this sentence

into words is v { Shi Lixuan has classier « mouse and C and that the

pronunciation should b e shlxuan you liang zh laoshu One can view a sentence or a text

generally as consisting of one or more words coming from a mo del separated by word

In English word delimiters include whitespace as well as punctuation In Chinese of course space

is never used to delimit words We can mo del this formally by representing the interword space

in Chinese with the empty string this is then used in addition to p ossible punctuation marks

as a of words Each word mo del maps words in the text that p ossibly b elong to that

mo del to a lexical analysis of that word So a p ersonal name mo del maps the string v { to

the lexical analysis v { furnpg which simply tags the string as an unregistered prop er name

an ordinary word mo del maps to fvbg a mo del of numb ers maps into â fnumg liang

nally the ordinary word mo del maps into fcl g and « into « fncg where nc means

common noun Of course precisely b ecause Chinese do es not generally delimit words in text

most sentences contain p otential ambiguities in word segmentation In the particular sentence at

hand there are no other plausible segmentations than the one b eing considered but it is nonetheless

the case that lao old and « shu rat are b oth words in our dictionary so it is at least p ossible

to segment the text in a way that splits these two characters into separate words One way to deal

with this ambiguity is to assign costs or weights to the dierent p ossible analyses As describ ed

in more detail in Sproat et al individual words in the lexical mo dels are assigned a cost

that is equal to the estimate of the negative log unigram probability of that word o ccurring For

common words such as « it is p ossible to estimate this cost directly from a corpus For unseen

words including unseen p ersonal names like v { one must estimate the cost using a mo del

for p ersonal names we follow a mo del prop osed in Chang et al with some mo dications as

discussed in Sproat et al

The lexical mo dels are implemented as weighted nitestate transducers WFST Pereira et al

Pereira and Riley which are constructed using a lexical to olkit describ ed in Sproat

For ordinary words includin g morphologically derived words a weighted nitestate mor

phological mo del is constructed which simply lists all words in the lexicon along with words that

can b e derived with pro ductive axes such as the plural marker Ì men Weighted nitestate

10

Though the latter pap er describ es the problem in a slightly dierent way from the way it is describ ed here Note

that the cited pap er also gives a fairly extensive p erformance analysis for the segmentation

mo dels are also constructed for p ersonal names and foreign names in transliteration For example

p ersonal names are allowed to follow the pattern FAMILYGIVEN where FAMILY names b elong

to a small list of a few hundred names weighted by their frequency of o ccurrence and GIVEN

names can b e basically any character weighted with an estimate of its likeliho o d in a name

The construction of the numeral expansion transducer requires a little more explanation The

construction factors the problem of expanding numerals into two subproblems The rst of these

involves expanding the numeral sequence into a representation in terms of sums of pro ducts of

p owers of the base usually ten call this the Factorization transducer The second maps from

this representation into numb er names using a lexicon that describ es the mapping b etween basic

numb er words and their semantic value in terms of sums of pro ducts of p owers of the base call

this the Number Lexicon transducer A sample Numb er Lexicon fragment for Mandarin is shown in

Figure Note that unlike the case of ordinary words and names the weights here subscripted

for some entries are arbitrary since one do es not in general need to have an estimate of the

likeliho o d of a sequence like in order to b e sure that it should b e treated as a unit but whereas

the mapping from fg to G er is free the mapping from fg to â liang has a p ositive cost which

means that in general this will b e a disfavored expansion The entry T indexes the particular

pronunciation of T in this case san that should b e used another p ossible pronunciation is

san Similarly index resp ectively the pronunciations y y and y each of which

is appropriate in a dierent context see b elow on how these dierent pronunciations are selected

dep ending on the context

The rst stage of the expansion is language or at least languagearea dep endent since languages

dier on which p owers of the base have separate words see Hurford inter alia so Chinese

and Japanese have a separate word for whereas most Europ ean languages lack such a word

The full numeral expansion transducer is constructed by comp osing the Factorization transducer

with the transitive closure of the Numb er Lexicon transducer The construction of the numeral

expander for Mandarin is shown schematically in Figure

Additional Mandarinsp ecic cleanup is required for numeral expansion over and ab ove what

was said here For example the lexicon given in Figure coupled with the mo del just describ ed

would predict G erbai instead of â liangbai for and d ¯ ¯ yqian lng lng

y rather than d ¯ yqian lng y for This cleanup can b e accomplished by rewrite

rules such as those given b elow which guarantee that the allomorph of is used b efore

hundred and d thousand that â is similarly used instead of G in the same environments and

that strings of ¯ reduce to just one ¯

¯ ¯ Numb er Numb er

G â jd

jd

Other rules select the allomorphs for and â for G b efore U wan ten thousand and õ

y hundred million As shown in Kaplan and Kay given certain conditions sets of rewrite

rules of this kind can b e compiled into nitestate transducers The particular algorithm for achiev

ing this compilation that is used here is describ ed in Mohri and Sproat

11

The G â allomorphy rule is of course just a sp ecial instance of the measure wordinduced allomorphy discussed

b elow

¯

G

â

T

Q

d

Figure Mandarin numb er lexicon The symb ol sequence on the lefthand side of each row maps

to the symb ol sequence on the righthand side See the text for an explanation of the symb ols

and T Subscripted costs of mark some entries as b eing disfavored

Factorization )



Numb erLexicon +

T T Q

Figure Expansion of in Mandarin

Returning now to our example recall that we dened a sentence as consisting of one or more

words separated by word delimiters We can now dene this somewhat more formally as follows

Call the ordinary word WFST W the p ersonal name WFST P and the numeral expansion WFST

N Then a single text word can b e mapp ed to all p ossible lexical analyses by the union of these

transducers along with other lexical transducers or in other words W P N The interword

delimiter mo del for Chinese simply maps to a word b oundary denoted as and

punctuation marks such as C into various typ es of proso dic phrase b oundary let us denote this as

punc for a phrase b oundary For example we assume that p erio d C maps to an intonational

phrase b oundary i As we note b elow this presumes that punctuation always signals a proso dic

phrase b oundary whereas a proso dic phrase b oundary never o ccurs b etween two adjacent words

clearly this is wrong and this is something that we plan to improve up on in future work The

interword mo del I is thus given as punc Finally a sentence can b e dened as

W P N I or in other words one or more instances of a word separated by interword

12

We assume some familiarity on the part of the reader with the concept of prosodic phrase A particularly clearly

articulated theory of proso dic phrasing is describ ed in Pierrehumb ert and Beckman

material call this the lexical analysis WFST L Representing the input sentence as an unweighted

acceptor S we can then compose the sentence with L to yield the weighted lattice S L of all

p ossible lexical analyses One could then apply a Viterbi search to this lattice to nd the b est

analysis based purely up on the lexical costs

Language Mo dels

However it is often desirable as in the case of the example sentence at hand to do some analysis of

context in order to more nely hone the analyses represented in the lattice and p ossibly eliminate

some from further consideration Adopting the standard terminology from sp eech recognition we

term such contextual mo dels as language models The contextual mo del in the current Mandarin

system is rather simple dealing only with the allomorphy of numerals b efore classiers When two

o ccurs as a simple numeral ie not in a comp ound numeral like twenty two b efore a classier

such as Ó ge zh or ø tiao it should take on the form â liang rather than G er This can b e

accomplished by a lter that simply disallows G o ccurring as an expanded numeral tag fnumg

directly b efore a classier An expression of the following form where hanzi stands for any charac

ter and is conventionally any lab el in the alphab et of the transducer accomplishes this

 

G fnumghanzi fclg

Represented as a transducer T and comp osed with S L the language mo del will weed out

analyses that violate the constraint in the example sentence analyses including the sequence G

are ruled out Now B estP athS L T yields the desired result v { furnpg fvbgâ

fnumg fclg « fncgi Note that this analysis includes the information that the sentence is a

single proso dic phrase since the only phrase b oundary is the marker i mapp ed from C

Phonological Mo dels

Having singled out a lexical analysis it is then necessary to pro duce a phonological analysis For

the most part this can b e handled by a simple transducer that maps from the lexical representation

of the characters into their phonemic representation note that homographs such as @ ka in @

kafei coee but ga in @ ù gal curry are generally disambiguated as part of the lexical analysis

though see b elow The two main complications in the transduction to phonetic representation

involve two rules of tone sandhi namely third tone sandhi and the sp ecial sandhi rule for not

whereby the default fourthtone variant bu is changed to a rising tone variant bu b efore a following

falling tone

Third Tone Sandhi Consider rst third tone sandhi This rule which is describ ed in detail

in Shih changes a low tone into a rising tone when it precedes another tone We

notate this derived tone as

The rule applies cyclically applying to the smallest proso dic structure rst and to successively

13

In a rule remiscent of the rule for sp eakers of Mandarin in northern China often change the high tone of the

numerals C q seven and K ba eight into a rising tone b efore a following falling tone we have not implemented

this rule in the current version of the Mandarin synthesizer

larger containing structures the proso dic structures involved is related to the morphosyntactic

structure of the utterance by an algorithm describ ed more fully in Shih For the purp oses

of the present discussion note that the algorithm will follow the syntactic structure in grouping

p olysyllabic words together into a proso dic unit b efore it will group those words with other words

into a larger proso dic unit Thus in the utterance p « xiao laoshu little rat « rat will b e

a unit that is contained inside the whole unit that contains p « little rat similarly « will

b e a proso dic unit within the larger proso dic unit spanning the utterance « Ë laoshu sh rat

droppings

xiao laoshu xiao laoshu p « little rat

laoshu sh laoshu sh « Ë rat droppings

As one can see this dierence in structure leads to a dierence in the application of third tone

sandhi In the case of p « third tone sandhi applies rst to the smallest unit laoshu changing

the rst syllable to lao In the next larger unit the syllable xiao no longer precedes a low tone

so third tone sandhi do es not apply is blo cked In « Ë lao is changed to lao in the same

way but here the larger structure is leftbranching and the extra syllable is to the right Thus the

low tone of shu still precedes the low tone of sh and so this syllable changes to shu

One commonly held misunderstanding ab out third tone sandhi is that the surface tonal pattern

directly reects the syntactic structure and that the b oundary strength b etween a sandhinon

sandhi pair is always weaker than a nonsandhisandhi pair The following examples show that this

view is incorrect In these two contrasting examples the structures of the sentences are identical

but b ecause of the tonal dierence on the last syllable the tonal outputs are quite dierent Given

the same sub jectpredicate b oundary b etween L and mai in one case tone sandhi applies to the

sub ject Li while in another case tone sandhi applies to the verb mai This prop erty of tone

sandhi makes it dicult for statistical mo deling b ecause the space of p ossible structures is large

Note that the eect is nonlo cal in that the tone of nal syllable ultimately determines the tone of

the second syllable of the sentence Longer sentences displaying exactly the same prop erty could

easily b e constructed

LaoL mai haojiu LaoLi mai haojiu R n s Old Li buys go o d wine

vp

LaoL mai haoshu LaoL mai haoshu R n Ñ Old Li buys go o d b o oks

vp

Using a cyclic rule we can easily derive the desired output In the rst sentence Lao and hao

assume the sandhi tone while the most deeply emb edded structures Lao L and hao jiu are b eing

scanned The next level of scanning includes mai hao jiu since hao now has a sandhi tone this

destroys the sandhi environment for mai The nal round of scanning includes the whole sentence

with the syllable pair of interest b eing L mai Since the tone sandhi environment is met the rule

applies to pro duce Li despite the relatively strong b oundary b etween sub ject and predicate

For the second sentence hao keeps its third tone b ecause it is not followed by another third tone

Next mai assumes the sandhi tone since it precedes a third tone Finally when the syllable

pair L mai is b eing considered tone sandhi is no longer applicable for L since the tone sandhi

environment was destroyed by the previous application to mai

The current implementation of third tone sandhi in the synthesizer do es not build deep proso dic

structures but rather relies on the observation in Shih that one p ossible pattern for Sandhi

application is a rstpass application within proso dic words and a secondpass phrasal application

This is implemented with the following two rules which apply in the order written the rst of

which is written so as to apply only within words the second which applies across words We

presume a phonetic representation where tone symb ols are written b efore the asso ciated syllable

fPhonemeg

fPhonemeg

These rules are compiled into a transducer using the rewrite rule

Sandhi for bu not The sp ecial sandhi rule for not is dierent from b oth third tone sandhi

on the one hand and the tonal variation for one on the other in that it makes reference to b oth

lexical prop erties ie the fact that the morpheme is and not some homophonous morpheme

such as bu cloth and to phonological prop erties of the following syllable ie that it has a

falling tone Such a rule could b e implemented as a twolevel rule Koskenniemi but we have

chosen the equivalent though formally dierent route of representing the underlying pronunciation

of with a sp ecial tonal symb ol and then having a rule which expresses as b efore

a following and otherwise as Thus Ý bu kan not lo ok b ecomes bu kan

[

The simple transducer that maps from characters to their basic pronunciations is comp osed

with the transducers implementing the tone sandhi rules just describ ed to pro duce a trans

ducer P that maps from lexical representations into pronunciations The entire transduction

from text into phonemic representation can now b e dened as B estP athB est P at hS L

T P For the input sentence v { « C this pro duces the phonetic transcription

S li xYeN yO lyaG Z lW Su i using the phonetic transcription scheme

intro duced in Section Once the b est paths of the lexical and phonological lattices have b een

built the information represented in those paths is placed into internal structures that are then

passed to the next mo dule of TTS the mo dule which builds proso dic phrases based on information

ab out the lo cation of those phrases as provided by the textanalysis mo dule

Future Work

Two ma jor deciencies in the current textanalysis mo dels for Mandarin are the prediction of

proso dic phrase b oundaries and the treatment of homographs

Proso dic phrase b oundary prediction for the Bell Labs American English system rep orted in

Wang and Hirschb erg dep ends up on a Classication and Regression Tree CART mo del

trained on a tagged corpus to predict the presence of proso dic phrase b oundaries The factors used

in the CART mo del include information on the part of sp eech of words surrounding the putative

b oundary as well as the distance of that b oundary from the nearest punctuation mark The mo del

is able to predict phrase b oundaries even when there is no punctuation to signal the presence of

those b oundaries only whitespace b etween the adjacent words An analogous solution could b e

used for Chinese Indeed as is shown in Sproat and Riley it is p ossible to compile a CART

tree into a weighted nitestate transducer so the information incorp orated in the decision tree

could b e used directly in the textanalysis system we have describ ed

As noted ab ove homographic characters are to some extent disambiguated using lexical infor

mation However there are many cases where lexical constraints are lacking or at least to o weak

to b e reliable Consider the word P which is tonghang if it is a noun meaning colleague but

if it is a verb meaning travel together In principle a reliable partofsp eech analyzer tongxng

could help disambiguate these cases So in a phrase like b P º ù zai tonghang jian de

pngjia in colleague among s evaluation evaluation amongst colleagues with the pronunciation

tonghang the frame b amongst should in principle b e a clear indicator that the contained

phrase is a noun But the case is p erhaps less clear for a phrase like P ylu tongxng

one way togethertravel traveling together all the way with the pronunciation tongxng here

lu could in principle b e either a noun or a measure word and in either case purely grammatical

constraints do not really suce to rule out the nominal interpretation of P In any event it has

b een our exp erience with English that partofsp eech algorithms of the kind rep orted in Church

in general p erform very badly at disambiguating homographs like wound ie waund as

the past tense of wind wund as a noun A partial solution to the problem for English has

b een discussed in Yarowsky The metho d starts with a training corpus containing tagged

examples in context of each pronunciation of a homograph Signicant lo cal evidence eg ngrams

containing the homograph in question that are strongly asso ciated to one or another pronuncia

tion and widecontext evidence ie words that o ccur anywhere in the same sentence that are

strongly asso ciated to one of the pronunciations are collected into a decision list wherein each

piece of evidence is ordered according to its strength log likeliho o d of each pronunciation given the

evidence A novel instance of the homograph is then disambiguated by nding the strongest piece

of evidence in the context in which the novel instance o ccurs and letting that piece of evidence

decide the matter The metho d has b een shown to have excellent p erformance in disambiguating

English homographs and some exploratory analysis using the metho d for Mandarin also seems

very promising In the Mandarin examples discussed ab ove P is correctly classied as an

instance of the xng pronunciation due to the preceding way which is statistically a reliable

indicator of that pronunciation similarly b P º ù is correctly assigned the pronunciation

hang b ecause of the following character among As with the case of decision trees it should

b e p ossible to represent the decision list as a WFST in which case the results of training can b e

used directly in the textanalysis mo del discussed in this section

Acoustic Inventory

Our synthesizer is an LPCbased concatenative system The sp eaker up on which the synthesizer is

based is a male sp eaker of Beijing Mandarin Sp eech segments are stored in a table and selected

for synthesis at run time These units are then concatenated together with some sp eech signal

manipulation to achieve the desired length pitch and amplitude Most of the units are diphones

but a few triphones are included to avoid problematic diphones As we have already noted in

Section the use of diphones rather than syllables distinguishes our system from most other

systems for Mandarin

Sound Inventory

Our current synthesizer distinguishes phones for Mandarin plus extra ones to read English

letters The Mandarin phones include initial consonants glides consonantal endings

14

Still this is not completely reliable since a deverbal noun could also app ear in this p osition b « zai xngzou

jian while going

15

Some English letter names have b een incorp orated into Chinese and can eectively b e said to b e part of the mo dern

language X eksguang Xrays OK PC In addition a high p ercentage of Chinese text contains untransliterated

single vowels and diphthongs These are listed in Table together with ÷ µ pinyin and ` µ Å

zhuy n fuh ao ZYFH notations Note that the corresp ondence b etween our system and pinyin

or ZYFH is not onetoone due to the quasiphonemic rather than phonetic nature of these

transliteration schemes and to phonetic ambiguities within the sp elling systems For the sake

of clarity we present pinyin sp ellings in italics and present our internal representation in square

brackets

The consonant inventory is uncontroversial Except for the addition of v which is used as a

allophone of w in front of nonback vowels by northern sp eakers our system is consistent with the

traditional description of Mandarin sound inventory There are two vo calic endings i and u which

are not listed in our system b ecause we treat diphthongs as whole units Furthermore Mandarin

has three onglides y w and yu which have similar formant target values to their corresp onding

high vowels i u and u

The numb er of Mandarin vowels has b een matter of heavy debate due to strong coarticulation

eects b etween vowels and surrounding phonetic material For example most traditional descrip

tions consider the three variants of a manifested in pinyin ba ban and bian to b e the same

phoneme Such a phonemicallyoriented categorization is not suitable for a sp eech synthesizer

which must capture phonetic reality So we separate the a sounds in ba and ban and represent them

with dierent symb ols a and At the same time we merge the a in bian with the phone e

since there is no acoustic dierence b etween the two Likewise the letter i in pinyin represents

three dierent sounds in zi zhi and di which are also represented separately in our system

The four diphthongs ei ai ao and ou are represented as whole units in our system

Diphone Units

A complete sp eech synthesizer must b e able to handle all linguisticall y p ossible phonetophone

transitions It is easy to calculate the required diphones in Mandarin b ecause the inventory of

syllables is rather uncontroversial at least at the phonemic level the same thing cannot b e

said for example ab out English One requires all syllable internal units including consonant

glide consonantvowel glidevowel and vowelending plus crosssyllable units which include the

transitions b etween all units that can end a syllable vowel ending or silence and all units that

can b egin a syllable silence consonant glide or vowel The numb er of required diphones can b e

drastically reduced by combining units that are acoustically indistinguishabl e For example given

two nasal endings N and G n and ng and ten stopsaricates p t k b d g z c zh ch we

in theory will need twenty diphonic units But in fact all Mandarin stopsaricates are voiceless

and b egin with a section of silent closure so we can simply collect two units N G and

use them for all of the stops and aricates One can go further in eliminating vowelstop and

vowelaricate units if the size of the inventory is an issue this was what was done in the Bell

Labs Mandarin system However this do es result in a slight loss of quality since there is of course

some coarticulation b etween the vowel and the following consonant we have therefore kept these

foreign usually English words Our textanalysis mo dule actually contains a dictionary giving pronunciations of

ab out twenty thousand frequent English words but we currently do not have the acoustic units necessary to handle

such words elegantly within the system

16

Northern Mandarin also has a retroex à r ending resulting from suxation in addition to the nasal endings

n and ng

Bell Labs System Pinyin ZYFH

Consonant p p u

t t y

k k

b b t

d d x

g g

z z

j j

Z zh

c c

C ch

q q

f f w

s s

x x

S sh

h h

n n z

m m v

l l

r r

v wei

Ending N in

G ng

L yar

Glide y y xie

w w shua

Y yu juan

Vowel i i

ci

shi

u u

U u

e xie

E ke ken keng

o duo

R er

a a

O ou

A ei

I ai

W ao

an

Table Phonemes

units in the current version

Another time and space saving strategy we adopted for the version was to eliminate

glides from the sound inventory using the corresp onding high vowels in their place Ab out eighty

diphones can b e eliminated that way We decided to revive the glides for the current version again

to improve quality at a cost of a larger inventory Interestingly the problem in combining glides

and vowels is not in formant incompatibility The ma jor problem lies in the diculty in mo deling

crosssyllable units with withinsyllable unit and vice versa For example the vowel to vowel unit

uA is typically pronounced with a glottal stop at the syllable b oundary If such a unit is used for

the glide to vowel unit wA for the syllable wei the presence of the glottal stop will b e noticeable

So care must b e taken to collect units without intervening glottal stops But if one do es that

then one has the opp osite problem in that some syllableb oundaries may b e hard to detect in the

synthetic output as a result sequences like Q ywu ysh one ve one ten systematically

were hard to understand in the version since the rst three syllables were blurred together

The version contained diphones note that this numb er is only marginally higher than

the syllable count The current Bell Labs Mandarin system contains diphones plus a few

triphones The units include diphones needed to read English letters

Diphoneset Expansion and Triphones

A diphonic system works well when the edges of concatenating units are similar Moreover in the

case of vowels and glides the formant tra jectories should match as well This ideal situation may

not materialize for two reasons First there is a lot of variation in sp eech pro duction in that two

recordings of the same phone in the same context may nonetheless have enough dierences to cause

problems in diphone concatenation thus two recordings by the same sp eaker of the word cat may

b e suciently dierent that there will b e a noticeable discontinuity if one concatenates the ka

unit from the rst utterance with the at unit of the second This problem can b e minimized

by making a large numb er of recordings usually with a sucient numb er of utterance one is able

to nd at least one unit that matches reasonably well with other units with which it must b e

concatenated

The second problem is systematic coarticulation that intro duces strong coloring to some diphone

units making them p o or concatenative units with other normal units This problem cannot b e

solved by simply making further recordings since the problem relates to fundamental phonetic

prop erties of the language and not accidental dierences in pronunciation Even if a sp eaker may

o ccasionally pro duce the unit without the coarticulation eect so that a particular diphone falls

into the acceptable range for concatenation such units may not b e accepted as a natural rendition

by listeners Two solutions to this problem are rstly to increase the set of phones so as to arrive

at a b etter classication of the phonetic variation and secondly to construct triphonic units We

will see instances of b oth these solutions in the treatment of units containing the central vowel

There are two problematic areas in Mandarin for diphone collection Shih The rst of

these involves glides Glides are short with fastmoving formant tra jectories their formant values

b eing heavily inuenced by the following vowel A consonantglide unit such as dw taken from

the o context may not connect well with wA One easy x is to collect triphones such as dwo

and dwA However the connectivity problem here is less severe than the problem with wEN

to b e discussed b elow so we have chosen not to implement these sequences as triphones in the E, E(N), E(G) E(N), (w)E(N) 500 2000 3500 500 2000 3500

1 2 3 4 5 1 2 3 4 5

Figure Coarticulation eects on E

present version

The second problem involves the central vowel e as in ô ge brother which we represent as

E It is not surprising that the central vowel or the schwa is a problematic sound in English

since schwa in English is always unstressed It is interesting that the central vowel presents a

problem to Mandarin as well since in Mandarin the central vowel E can b e fully stressed Still

even stressed central vowels show a considerable amount of coarticulation with preceding glides and

with following nasal endings Figure shows the ma jor coarticulation eects on E Tra jectories

of the rst three formants of E in dierent contexts are compared The ve p oints on the xaxis

corresp ond to the and p oints of the vowel duration Averaged formant

frequencies are plotted on the yaxis In the lefthand panel solid lines are the average values of

F F and F of the all the Es in our database Dotted line are the average formant values

of all the Es in the EN context central vowel followed by an alveolar nasal co da and dashed

lines are the average formant values of all the Es in the EG context central vowel followed

by a velar nasal co da The F of E in the EN context is signicantly higher than that of the

average E while the F of E in the EG context is signicantly lower than that of the average

E The directions of the coarticulation eects are exp ected and are p erfectly consistent with the

anticipated tongue p ositions of the following sound However the coarticulation eects start early

already present into the vowel Furthermore the magnitude of the eects are strong The

dierence in the F of E in EN and EG ranges from Hz at the p oint to Hz at

the p oint If such predictable discrepancies of the same phonemic sound are not taken care

of it can severely hamp er the quality of a concatenative sp eech synthesizer Take pEN pen as an

example It is synthesized by connecting two units pE and EN If the unit pE is taken from

the syllable pEG its F will b e to o low for EN On the other hand if the unit pE is taken

from the EN context then its F will b e to o high and its rising F pattern incompatible with

EG To ensure smo oth connection E in the EN context should b e considered a separate sound

from the rest of the E samples

The righthand panel of Figure shows additional coarticulation eects caused by the back

rounded onglide w The solid lines plots the tra jectories of the rst three formants of all the Es

in the EN context in contrast to the formant tra jectories of Es in the wEN context which

are ploted in dotted lines Note that a preceding w eectively lowers F for the rst of

the vowel resulting in a drastic rising slop e of F Fast moving formant tra jectories like this are

incompatible with a steady state vowel making wEN a p o or source for a generic diphone unit

EN and making wEN a go o d candidate for a triphonic entry Indeed our study shows that

wEN is the only triphone that is absolutely needed for Mandarin

Thus based on the coarticulation eects just discussed we collected a triphone unit wEN

and two sets of diphone units for E one set b eing taken from and used in the N context the

other set b eing taken from and used in all other contexts

Duration

The duration mo dule assigns a duration value to each phoneme The value is estimated from

various contextual factors by multiplicative mo dels that are tted from a sp eech database collected

sp ecically for the study of duration The general design of the mo del and research metho dology

were rep orted in van Santen and the Mandarin study was rep orted in more detail in Shih

and Ao As we explained in Section the diculty in proso dic mo deling lies in the large

numb er of factor combinations We approach this problem from b oth ends maximizing the coverage

in our database on the one hand and estimating unseen data p oints on the other hand

The rst stage taken in pro ducing a duration mo del involves constructing a set of sentences to

b e read by the sp eaker which cover all the desired factor combinations This preparation involves

the following steps

Automatically transcrib e short paragraphsized sentences in the online text database into

broad phonetic transcription using the textanalysis mo del describ ed in Section

Co de each segment with relevant factors including the identity of the segment and its tone

the category of the previous segment and tone the category of the next segment and tone the

syllable typ e and the distance from the b eginning and end of the word phrase and utterance

We then combine the co ded factors into factor triplets each containing the identity of the

current segment its tone and another factor

Feed the co ded factor triplets to a greedy algorithm Cormen et al In each round a

sentence with the largest numb er of unseen factor triplets is chosen until all factor triplets

are covered or the numb er of sentences has reached a preset limit In our case a total of

sentences consisting of segments covering all the factor triplets that are present in the

input were chosen from a p o ol of sentences

Record the sentences by the same male Beijing sp eaker used for the concatenative unit

inventory

In some cases the sp eaker uttered the sentence in a dierent way from what was exp ected

from the automatic transcription This required some reco ding of the transcriptions in order

to reect what was actually said Also added was information on proso dic phrasing and

three levels of prominence neither of which was generally predictable via the automatic

transcription

17

Our system contains a numb er of other triphones but those are due to accidental bad connection of diphones

rather than consistent unavoidabl e coarticulatio n eects

18

The online corpus used for duration mo deling was the ROCLING newspap er corpus

o E i U u e

R A O I F W a

Table Intrinsic Duration of Vowels in Msec

h f S x s

Table Intrinsic Duration of Fricatives in Msec

This pro cedure resulted in a data matrix including every phone in the database its co ded

contextual factors and of course its duration Preliminary exploratory data analysis was p erformed

on this data matrix and data p oints were separated into classes based on this analysis so as to

minimize factor interaction within class For example we know that nal lengthening aects onset

and co da consonants in dierent ways so onset and co da consonants are separated into dierent

classes for mo deling The eventual classes that were decided up on were vowel glide closure of

stop and aricate burst and aspiration of stop and aricate fricative sonorant consonant and

co da Multiplicative mo dels are tted for each class separately to predict the duration

We compare near minimal pairs in the database and interp olate the values for missing cells

Below is a muchsimplied example of the estimation of missing values namely a stressed S in a

certain segmental tonal and proso dic environment If the durations of unstressed f S and s

in the same environment are and msec resp ectively we hyp othesize that the relative

scale of these three sounds is in the ratio and that the scale should hold under the

inuence of stress as well If the stressed f and stressed s in the same contextual environment

are and msec resp ectively we deduce that stress in that environment has the eect of

increasing the duration value by so the value of the missing stressed S should b e msec

Our data show very clear patterns of such single factor independence and this is an imp ortant

theoretical assumption in estimating duration values of missing data p oints

Among the fourteen factors that we investigated the identity of the following tone is the only

one that shows nearly no eect on any classes of sound The factors that consistently have a strong

eect on all classes of sounds are the identity of current segment and prominence All the other

factors have some eect on some classes but not on others

The intrinsic duration IDur of vowels fricatives burstsaspirations and closures estimated

from our database are summarized in Table Table Table and Table resp ectively

At runtime segment identities and relevant factors are computed and factor co ecients multi

plied to derive the estimated duration For example if a sentence contain a single sound i its

duration will b e computed as follows

b d g Z z j p t k C c q

Table Intrinsic Duration of BurstAspiration in Msec

Ccl ccl qcl zcl tcl jcl Zcl kcl dcl gcl p cl b cl

Table Intrinsic Duration of Closure in Msec

D ur msec

i

I D ur msec

i

F F

tone pr ev iousphonen ull

F F

pr ev ioustonenu ll nextphonenull

F F

nexttonenull str ess

F

pr eceding sy llablein w or d

F

f ollow ing sy llableinw or d

F

pr eceding sy llablein phr ase

F

f ollow ing sy llablein phr ase

F

pr eceding sy llable inu tter ance

F

f ollow ing sy llableinu tter ance

F

sy llty pev

Intonation

The intonation mo dule computes F values for the synthetic sp eech Since Mandarin is a tone

language one imp ortant task is to capture tones and tonal variations

We start out with a basic representation of Mandarin tones derived from text analysis which

are then passed to a set of tonal coarticulation rules that take care of changes that result from the

mechanics of tonal concatenation At this stage we derive a sequence of tonal targets that is devoid

of sentence intonation and emphatic expression but otherwise represents smo othly connected tones

Eects of sentence intonation and emphasis can then b e added We have implemented declination

and downstep two ma jor factors that causes pitch to drop over the course of an utterance We

plan to add phrasing and emphasis eects in the near future

Mandarin has four tones plus the p ossibility that a syllable is reduced and carries no lexical

tones a phenomenon traditionally referred to as neutral tone or tone Table lists the tone

names their description and the tonal target we assign to the tones The letters H M L represent

Tone Symb ol Tone Tone Tone Tone Tone

Tone Name High Level Rising Fallingrising Falling Neutral

Targets HH MH M L M H L M

Table Mandarin Lexical Tones

Tone 1 CV HH

Tone 2 CV MH

Tone 3 CV M L

Tone 4 CV H+ L

Tone 0 CV

M

Figure Alignment of Tonal Targets

high mid and low targets resp ectively Figure shows the schematic alignment of targets to

syllables

The phonetic realization of the four lexical tones are relatively stable and is adequately reected

in their names The rst target of tone is placed near the center of the rime to free the rst p ortion

for coarticulation with the preceding tone The rst target of tone is higher than the normal H

level hence the notation H It is placed somewhat into the rime to create the p eak asso ciated

with tone

The phonetic shap e of the neutral tone is unstable and it primarily dep ends on the tonal

identity of the preceding tone In general neutral tones following tones and have a falling

shap e but neutral tones following tone have a rising shap e For the four lexical tones we assign

multiple tonal targets that span the duration of the rime which ensures the stability of the tone

shap es The single M target of tone is asso ciated to a p oint near the end of the rime The single

target captures the unstable tone shap e its asso ciation near the end of the rime accounting for

the fact that the shap e is largely determined by the preceding tone When the nal target of the

preceding tone is L as in the case of tone the pitch rises toward M but when the nal target of

the preceding tone is H as in the case of tone and tone the pitch falls to M A neutral tone

following a tone requires sp ecial treatment A nonnal tone diers from a nal one in that it

ends in a M target rather than L Given that the predicted shap e of a neutral tone following tone

is level not the observed falling This will b e handled by the coarticulation rules

Tones in connected sp eech deviate from their canonical forms considerably The changes may

come from tonal coarticulation Shih Shen b Xu or phrasal intonation Shen

Shih Shih Garding as well as various lowering eects Our rules are primarily

based on our own instrumental studies Both tonal target and target alignment may b e changed by

tonal coarticulation rules but sentence level scaling is accomplished by changing the pitch height

of H M and L targets

The most imp ortant tonal coarticulation rules are listed b elow with brief descriptions

If tone is not sentencenal change its L target to M

Our phonetic data shows that nonnal tone consistently falls to M rather than to L So

the ma jority of tone actually have the tonal targets HM This rule replaces a more limited

rule that changes a tone only b efore another tone used in many systems Zhang

Shen b Lee et al Chiou et al Hsieh et al apparently following

the early rep ort in Chao Listening tests show that restricting the rule to the tone

context results in overarticulation of tone in other tonal contexts and the result of our

more general version of the rule is more natural

Given a tone if the fol lowing tone starts with a H target then delete the H of tone

Comparing tone b efore a H target and tone b efore a L target the rising slop e b efore a H

target is more gradual and the pitch do es not reach the exp ected H at the end of the tone

syllable This is not a simple dissimilation eect The desired pitch contour can b e generated

easily by assuming that the H target of tone is absent

If a tone fol lows a tone change its M target to L

Tone should have a low falling pitch when it follows a tone Since nonnal tone ends

in M assigning a L target on the tone syllable will generate the desired falling contour

If the fol lowing tone start with a H target then the L target of tone is raised L L

M L

The L target of tone is raised when the following target is H This is an assimilation eect

If tone is sentencenal add a M target assign its value to be M M H M

It is a wellknown phenomenon that tone in the sentencenal p osition or in isolation has

a fallingrising shap e while other tone have a lowfalling shap e hence the name half tone

Chao This rule adds the rising tail in the sentencenal condition The pitch rises

ab ove the M p oint but not as high as H

Coarticulation rules that are often found in other systems Hsieh et al Lee et al

but are excluded here are sandhi rules which are handled in the textanalysis comp onent and

various lowering rules which we attribute to two separate eects of declination and downstep

Downstep Lib erman and Pierrehumb ert is a lowering eect triggered by L or nonhigh

targets For example in a combination H L H the second H is lower than the rst H Declination

in the early literature Lieb erman O hman refers to the general downtrend of pitch over

the course of an utterance including downstep Following Lib erman and Pierrehumb ert

we mo dify the denition of declination as the lowering eect after the downstep eect has b een

factored out Now Lib erman and Pierrehumb ert argued strongly that once downstep and other

eects like nal lowering were factored out there is in fact no evidence for a residual declination

factor Contra this p osition we nd considerable evidence of declination in Chinese In sentences

totally lacking nonhigh targets namely sentences consisting of tone syllables there is still a

gradual lowering of pitch

Our default setting of declination is to lower the pitch of subsequent syllables by of the

distance b etween H and M We implement downstep as an exp onential decay that asymptotes to

M a where a is a constant that we set initially at Hz

H H d H M a

i

The symb ol d denotes the downstep factor that varies with tonal target identity i Our inves

tigation shows that L target causes more downstep on the following tone than M Tone with a L

target causes the most amount of downstep tone and with M value cause less downstep Tone

lacking a L or M target causes no downstep

Emphasis causes expansion of pitch range and compression of pitch range after the emphasis

The eect is most notable in the raising of H targets One dierence in our ndings from previous

rep orts Zhang Garding is in the scop e of the emphasis eect We nd that emphasis is

not strictly a lo cal eect manifested only on the emphasized syllable Pitch raising due to emphasis

starts from the emphasized syllable but the duration of the eect dep ends on tonal comp osition

Pitch returns to normal when it hits the rst L or M target which includes the M or L target of

tone if this tone is emphasized If there is a long sequence of H targets following the elevated

pitch of the emphasized syllable the unemphasized H targets remain H for a while and drift down

in a similar fashion as declination but at a faster rate The emphasis eect do es not cross phrase

b oundaries so if a sp eaker pauses after the emphasized word the pitch will return to normal

immediately

This separation of tonal coarticulation factors from intonational factors reects the way we

b elieve intonation works in a tone language On the surface there are in principle an innite variety

of tonal values But underlyingly there are clearly only ve tonal categories Each of these is related

to a tonal template which captures the invariant asp ects of the tone The tonal coarticulation rules

ensure that tonal connections are smo oth but at this stage pitch values are still represented in

relative terms as shades of H M and L We then assign values to the targets and adjust their values

according to the intonation eects that are suitable for the context such as question intonation

declarative intonation emphasis and paragraph mo deling with appropriate lowering or raising

eects Although the broad characteristics of many of these intonation eects are known Chao

Shih Shen a Tseng the level of details and control that are required for a text

tosp eech system remain to b e worked out Our design makes it easy to add layers of intonational

eects when the research results b ecome available

Conclusions

As we b elieve this pap er has made clear there has b een substantial progress on Mandarin text

tosp eech synthesis over the last two decades starting from the mid seventies to the present We

have progressed from systems which could transform annotated phonetic transcriptions into barely

intelligibl e sp eech to systems which can take Chinese written in normal Chinese orthography and

transform it into sp eech that is highly intelligibl e though certainly still mechanical in quality

Where do we go from here To a large extent the problems that remain to b e solved for

Mandarin are the same as the problems that are faced by synthesizers for any language Text

analysis capabilities are still rudimentary with particular diculties arising with the prediction of

proso dic phrase b oundaries and the selection of appropriate levels of prominence for words

The selection of appropriate intonational contours is also a dicult problem It is interesting to

note though that Mandarin is to some extent easier in this regard than say English In English

essentially none of the F variation in a sentence is lexically determined all of it b eing determined

by the selection of pragmatically appropriate accent typ es In Mandarin on the other hand a great

deal of the F variation is lexically determined with phraselevel intonation b eing resp onsible not

so much for the overall shap e of the F pattern as for the relative values of the lexically determined

p eaks and valleys This is not to marginalize the imp ortance of work on Mandarin intonation but

it do es mean that it is easier in Mandarin than in English to pro duce a reasonablesounding F

contour

A third ma jor area for improvement is the quality of the actual sp eech output Concatenative

systems continue to b e plagued by subtle problems relating to slight incompatibilities in concate

nated units and further research is needed on improved metho ds for selecting and cutting units

There is also a rip e area of research in what are sometimes termed hybrid synthesis metho ds

whereby the formant structure of adjacent units can b e manipulated during the concatenation

pro cess so as to pro duce smo other transitions or even to implement phonetic reductions or coar

ticulation for example with adequate abilities to mo dify units on the y one could p ossibly avoid

the increase in the numb er of E units discussed in Section The synthesis metho ds themselves

require further work Waveformbased approaches are showing more and more promise and may

eventually b ecome more desirable than LPCbased metho ds Finally the b est synthesis metho d in

the world will not result in naturalsounding sp eech if adequate linguistic controls of various asp ects

of voice quality are not available To take just one example it was shown in Pierrehumb ert and

Talkin that intervo calic h in English can trigger a variety of glottal states ranging from a

full aspiration to slight glottalization or even complete lenition dep ending up on the proso dic con

ditions It certainly seems to b e the case that no commercially available TTS systems implement

this variation

To quality of output one can also add variety as a goal Most concatenative systems are based

on one or at most two voices in the latter case the voices are usually male and female as in the

case of the Bell Labs American English system By mo difying such parameters as the pitch range

the vo cal tract length and the sp eech rate it is p ossible to some extent to mo dify these voices

so that they sound like dierent p eople but the results of these mo dications are usually more

comical than realistic Two approaches to improving this situation suggest themselves The rst

is to provide a set of to ols that makes it virtually painless to record a new voice and build an

inventory for that voice Systems for automatically cutting diphones from recorded sp eech given

an orthographic transcription of that sp eech have b een rep orted but we have not seen evidence

that these approaches are yet capable of pro ducing an inventory that has the same quality as one

that involves at least some handwork Nonetheless we exp ect matters to improve in this area

in the next few years The second is to investigate whether one can design transformations that

are capable of mapping an inventory of one sp eaker onto an inventory that sounds plausibly like

another sp eaker either an actual p erson or a ctitious one

We exp ect that synthesizers will continue to improve in all of these areas though it will in

our view b e many years b efore TTS systems b ecome indistinguishabl e from humans and thus

pass the Turing test for sp eech synthesis One thing that is certain is that sp eech synthesis will

rapidly b ecome a technology that is familiar not only to a select group of researchers computer

acionados and p eople with sp ecial disabilities but indeed to the public at large The applications

of TTS are numerous Systems for English have b een used as aids for the handicapp ed including

readers for the blind and sp eech output devices for p eople who have lost their ability to sp eak the

Cambridge University physicist Stephen Hawking b eing the most famous example of this they

have b een used as email readers allowing one to access ones email over the telephone they have

b een used in socalled reversedirectory applications where a caller uses a telephone touchtone

pad to key in a telephone numb er or a p ostal co de and the system reads back the name and address

of the entry or entries that match the numb er there is interest in using TTS systems for reading

directions as part of navigational systems for drivers and they have b een used in such futuristic

applications as sp eechtosp eech translation Mandarin TTS systems have b een applied in some

of these domains and their use in various applications will surely increase over the coming decade

Undoubtedly there will also b e a demand for synthesizers in other Chinese languages As we have

mentioned there has b een some work on Taiwanese and Cantonese and we certainly exp ect that

work on regional languages will continue

Acknowledgments

We acknowledge BDC Taiwan for the Behavior ChineseEnglish Electronic Dictionary used in

our word segmenter and ROCLING for providing us with the text database used in our duration

study J S Chang kindly provided us with his database of p ersonal names We also wish to

thank Rob French for manually segmenting our acoustic inventory database and Jan van Santen

for duration analysis to ols as well as general advice on durationrelated issues Benjamin Ao also

contributed substantially to the work on duration mo deling Mike Tanenblatt was resp onsible for

pro ducing robust versions of the multilingual intonation and duration mo dules Finally we extend

our thanks to Shengli Feng for providing the voice for our synthesizer

REFERENCES

Jonathan Allen M Sharon Hunnicutt and Dennis H Klatt From Text to Speech The MITtalk

19

For example Silverman discussed the Apple work in this area at the ESCAIEEE Workshop on Sp eech

Synthesis in New Paltz New York

20

Our Mandarin and English systems are b eing used in a limiteddomai n sp eech translation system under devel

opment at ATT Research

System Cambridge University Press Cambridge UK

Benjamin Ao Chilin Shih and Richard Sproat A CorpusBased Mandarin TexttoSp eech Synthe

sizer In Proceedings of the International Conference on Spoken Language Processing Yoko

hama Japan pp ICSLP

Leonard Blo omeld Language Holt Rinehart and Winston New York

Leo Breiman Jerome H Friedman Richard A Olshen and Charles J Stone Classication and

Regression Trees Wadsworth Bro oks Pacic Grove CA

Lianhong Cai Hao Liu and Qiaofeng Zhou Design and Achievement of a Chinese TexttoSp eech

System under Windows Microcomputer Vol

NgorChi Chan and Chorkin Chan Proso dic Rules for Connected Mandarin Synthesis Journal of

Information Science and Engineering Vol pp

YuehChin Chang YiFan Lee BangEr Shia and HsiaoChuan Wang Statistical Mo dels for the

Chinese TexttoSp eech System In Proceedings of the nd European Conference on Speech

Communication and Technology pp

YuehChin Chang BangEr Shia YiFan Lee and HsiaoChuan Wang A Study on the Proso dic

Rules for Chinese Sp eech Synthesis In International Conference on Computer Processing of

Chinese and Oriental Languages pp

JyunSheng Chang ShunDe Chen Ying Zheng XianZhong Liu and ShuJin Ke LargeCorpus

Based Metho ds for Chinese Personal Name Recognition Journal of Chinese Information Pro

cessing Vol No pp

Yuen Ren Chao Tone and Intonation in Chinese Bul letin of the Institute of History and Philology

Academia Sinica Vol No pp

Yuen Ren Chao A Grammar of Spoken Chinese University of California Press

F Charp entier and E Moulines Pitchsynchronous Waverform Pro cessing Techniques for Text

toSp eech Synthesis Using Diphones Speech Communication Vol No pp

SinHorng Chen and Yih Ru Wang Vector Quantization of Pitch Information in Mandarin Sp eech

IEEE Transactions on Communications Vol pp

SinHorng Chen Saga Chang and SuMin Lee A Statistical Mo del Based Fundamental Frequency

Synthesizer for Mandarin Sp eech Journal of the Acoustical Society of America Vol No

pp

SinHorng Chen Shaw Hwa Hwang and ChunYu Tsai A First Study of Neural Net Based

Generation of Proso dic and Sp ectral Information for Mandarin TexttoSp eech In Proceedings

of IEEE ICASSP Vol pp

SinHorng Chen ShawHwa Hwang and YihRu Wang An RNNbased Proso dy Information

Synthesizer for Mandarin TexttoSp eech IEEE Audio and Speech Processing forthcoming

HongBin Chiou HsiaoChuan Wang and YuehChin Chang Synthesis of Mandarin Sp eech Based

on Hybrid Concatenation Computer Processing of Chinese and Oriental Languages Vol

No pp

John Choi HsiaoWuen Hon JeanLuc Lebrun SunPin Lee Gareth Loudon VietHoang Phan

and Yogananthan S Yanhui A Software Based High Performance Mandarin TexttoSp eech

System In Proceedings of ROCLING VII pp

Noam Chomsky and Morris Halle The Sound Pattern of English Harp er and Row New York

Min Chu and Shinan Lu High Intelligibil i ty and Naturalness Chinese TTS System and Proso dic

Rules In Proceedings of the XIII International Congress of Phonetic Sciences Sto ckholm

pp

Kenneth Church A Sto chastic Parts Program and Noun Phrase Parser for Unrestricted Text In

Proceedings of the Second Conference on Applied Natural Language Processing Morristown

pp Asso ciation for Computational Linguistics

Cecil Coker Sp eech Synthesis with a Parametric Articulatory Mo del In Speech Symposium

Kyoto Reprinted in Flanagan James and Rabiner Lawrence eds

Dowden Hutchinson and Ross Stroudsburg PA

T H Cormen C E Leiserson and R L Rivest Introduction to Algorithms The MIT Press

Cambridge Massachussetts

John DeFrancis The University of Hawaii Press Honolulu

N R Dixon and H D Maxey Terminal Analog Synthesis of Continuous Sp eech Using the Diphone

Metho d of Segment Assembly IEEE Transactions on Audio Electroacoustics Vol AU pp

Gunnar Fant Acoustic Theory of Speech Production Mouton The Hague

Long Feng Beijinghua Yuliu Zhong Sheng Yun Diao de Shichang Duration of Consonants Vowels

and Tones in Beijing Mandarin Sp eech In Beijinghua Yuyin Shiyanlu Acoustic Studies of

Beijing Mandarin pp Beijing University Press

Eva Garding Sp eech Act and Tonal Pattern in Constancy and Variation

Phonetica Vol pp

W L Henke Preliminaries to Sp eech Synthesis Based up on an Articulatory Mo del In Proceedings

of the IEEE Conference on Speech Communication Process pp IEEE

ChingJiang Hsieh ChiuYu Tseng and LinShan Lee An Improved Chinese TexttoSp eech Sys

tem Based on Formant Synthesis Techniques Computer Processing of Chinese and Oriental

Languages Vol No pp

TaiYi Huang CaiFei Wang and YohHan Pao A Chinese TexttoSp eech Synthesis System

Based on an InitialFinal Mo del Computer Processing of Chinese and Oriental Languages

Vol No pp

James Hurford The Linguistic Theory of Numerals Cambridge University Press Cambridge

Shaw Hwa Hwang and SinHorng Chen Neural Network Synthesiser of Pause Duration for Man

darin TexttoSp eech Electronic Letters Vol No pp

Shaw Hwa Hwang and SinHorng Chen Neuralnetworkbased F TexttoSp eech Synthesiser for

Mandarin IEE proceedings Vision Image and Signal Processing Vol No pp

Shaw Hwa Hwang and SinHorng Chen A Proso dic Mo del of Mandarin Sp eech and its Application

to Pitch Level Generation for TexttoSp eech In Proceedings of IEEE ICASSP Vol pp

ShawHwa Hwang YihRu Wang and SinHorng Chen A Mandarin TexttoSp eech System In

Proceedings of ICSLP

Hector Javkin Kazue Hata Lucio Mendes Steven Pearson Hisayo Ikuta Abigail Kaun Gregory

DeHaan Alan Jackson Beatrix Zimmermann Tracy Wise Caroline Henton Merrilyn Gow

Kenji Matsui Nariyo Hara Masaaki Kitano DerHwa Lin and ChunHong Lin A Multi

lingual TexttoSp eech System In Proceedings of ICASSP pp

GwoHwa Ju WernJun Wang ChiShi Liu and WenHsing Lai Yi Yi Duomaichong Jifa

Yuyin Bianmaqi Wei Jiagou zhi Jishi Zhongwen Wenju Fan Yuyin Xitong A Realtime Imple

mentation of Chinese TexttoSp eech System Based on the Multipulse Excited Sp eech Co der

TL Technical Journal Vol No pp

Ronald Kaplan and Martin Kay Regular Mo dels of Phonological Rule Systems Computational

Linguistics Vol pp

J Kelly and L Gerstman An Articial Talker Driven from Phonetic Input Journal of the

Acoustical Society of America Vol Supplement p S

Dennis Klatt Review of TexttoSp eech Conversion for English Journal of the Acoustical Society

of America Vol pp

Kimmo Koskenniemi TwoLevel Morphology a General Computational Model for WordForm

Recognition and Production PhD thesis University of Helsinki Helsinki

R Kurzweil The Kurzweil Reading Machine ATechnical Overview In Science Technology and

the Handicapped M R Redden and W Schwandt editors pp Washington American

Asso ciation for the Advancement of Science Rep ort R

Samuel C Lee Shilin Xu and Bailing Guo Micro computergenerated Chinese Sp eech Computer

Processing of Chinese and Oriental Languages Vol No pp

LinShan Lee ChiuYu Tseng and Ming Ouhyoung The Synthesis Rules in a Chinese Textto

Sp eech System IEEE Transactions on Acoustics Speech and Signal Processing Vol No

pp

LinShan Lee ChiuYu Tseng and ChingJiang Hsieh Improved Tone Concatenation Rules in

a Formantbased Chinese TexttoSp eech System IEEE Transactions on Speech and Audio

Processing Vol No pp

S M Lei Digital Synthesis of Mandarin Speech Using its Special Characteristics Masters thesis

National Taiwan University

Mark Y Lib erman and Janet B Pierrehumb ert Intonational Invariance under Changes in Pitch

Range and Length In Language Sound Structure M Arono and R Oehrle editors pp

Cambridge Massachusetts MIT Press

Philip Lieb erman Intonation Perception and Language MIT Press Cambridge Massachusetts

Wen C Lin and TzienTsai Luo Synthesis of Mandarin by Means of Chinese Phonemes and

Phonemespairs JIFH Computer Processing of Chinese and Oriental Languages Vol No

pp

ChiShi Liu MingFei Tsai YenHuei Hsyu ChanChou Lu and ShiowMin Yu Yi Xianxing Yuce

Bianma Wei Hechengqi de Zhongwen Wenju Fan Yuyin Xitong A Chinese TexttoSp eech

Based on LPC Synthesizer TL Technical Journal Vol No pp

ChiShi Liu GwoHwa Ju WernJun Wang HsiaoChuan Wang and WenHsing Lai A New

Sp eech Synthesizer for TexttoSp eech System Using Multipulse Excitation with Pitch Predic

tor In International Conference on Computer Processing of Chinese and Oriental Languages

pp

W K Lo and P C Ching Phonebased Sp eech Synthesis with Neural Network and Articulatory

Control In Proceedings of ICSLP

Shinan Lu Shiqian Qi and Jialu Zhang Hecheng Yanyu Zirandu de Yanjiu Study on the Natu

ralness of Synthetic Sp eech Acta Acustica Vol No pp

Yuhang Mao Jinfa Huang and Guozhen Zhang A Chinese Mandarin Sp eech Output System

European Conference on Speech Technology Vol pp

Mehryar Mohri and Richard Sproat An Ecient Compiler for Weighted Rewrite Rules In th

Annual Meeting of the Association for Computational Linguistics Morristown pp

Asso ciation for Computational Linguistics

K Nakata and T Mitsuoka Phonemic Transformation and Control Asp ects of Synthesis of Con

nected Sp eech Journal of Radio Research Laboratories Vol pp

Ming OuhYoung ChinJiang Shie ChiuYu Tseng and LinShan Lee A Chinese TexttoSp eech

System Based up on a Syllable Concatenation Mo del In IEEE ICASSP pp

Ming OuhYoung A Chinese TexttoSpeech System Masters thesis National Taiwan University

Fernando Pereira and Michael Riley Sp eech Recognition by Comp osition of Weighted Finite Au

tomata CMPLG archive pap er

Fernando Pereira Michael Riley and Richard Sproat Weighted Rational Transductions and their

Application to Human Language Pro cessing In ARPA Workshop on Human Language Tech

nology pp Advanced Research Pro jects Agency March

Gordon E Peterson William SY Wang and Eva Sivertsen Segmentation Techniques in Sp eech

Synthesis Journal of the Acoustical Society of America Vol pp

Janet Pierrehumb ert and Mary Beckman Japanese Tone Structure The MIT Press Cambridge

Massachussetts

Janet Pierrehumb ert and David Talkin Lenition of h and Glottal Stop In Papers in Laboratory

Phonology II G Do cherty and R Ladd editors pp Cambridge University Press

Janet Pierrehumb ert The Phonology and Phonetics of English Intonation PhD thesis MIT

Ralph K Potter George A Kopp and Harriet C Green Visible Speech van Nostrand New York

S Ohman Word and Sentence Intonation A Quantitative Mo del Technical rep ort Department

of Sp eech Communication Royal Institute of Technology KTH

Delin Qin and N C Hu A New Metho d of Synthetic Chinese Sp eech on a Unlimited Vo cabulary

In Cybernetics and Systems Proceedings of the Ninth European Meeting on Cybernetics and

Systems Research pp

Hongmo Ren Linguistically Conditioned Duration Rules in a Timing Mo del for Chinese In UCLA

Working Papers in Phonetics Ian Maddieson editor pp UCLA

Jiong Shen Beijinghua Shengdiao de Yinyu he Yudiao Pitch Range and Intonation of the Tones

of Beijing Mandarin In Beijing Yuyin Shiyan Lu Acoustic Studies of Beijing Mandarin pp

Beijing University Press

XiaoNan Susan Shen The Prosody of Mandarin Chinese University of California Press

XiaoNan Susan Shen Tonal Coarticulation in Mandarin Journal of Phonetics Vol pp

Bo Shi A Chinese Sp eech Synthesisbyrule System In Speech Hearing and Language Work in

Progress UCL Number pp University College London

C T Shie A Chinese TexttoSpeech System Based on Formant Synthesis Masters thesis Na

tional Taiwan University

Chilin Shih and Benjamin Ao Duration Study for the ATT Mandarin TexttoSp eech System

In Proceedings of The Second ESCAIEEE Workshop on Speech Synthesis New Paltz NY

USA pp ESCAAAAIIEEE

Chilin Shih and Benjamin Ao Duration Study for the Bell Lab oratories Mandarin TexttoSp eech

System In Progress in Speech Synthesis Jan van Santen Richard Sproat Joseph Olive and

Julia Hirschb erg editors New York Springer

Chilin Shih and Mark Y Lib erman A Chinese Tone Synthesizer Technical rep ort ATT Bell

Lab oratories

Chilin Shih The Prosodic Domain of Tone Sandhi in Chinese PhD thesis University of California

San Diego

Chilin Shih The Phonetics of the Chinese Tonal System Technical rep ort ATT Bell Lab orato

ries

Chilin Shih Tone and Intonation in Mandarin In Working Papers of the Cornel l Phonetics

Laboratory Number Stress Tone and Intonation pp Cornell University

Chilin Shih Study of Vowel Variations for a Mandarin Sp eech Synthesizer In EUROSPEECH

pp

Richard Sproat and Michael Riley Compilation of Weighted FiniteState Transducers from Decision

Trees In th Annual Meeting of the Association for Computational Linguistics Morristown

pp Asso ciation for Computational Linguistics

Richard Sproat Chilin Shih William Gale and Nancy Chang A Sto chastic FiniteState Word

Segmentation Algorithm for Chinese Computational Linguistics Vol No

Richard Sproat A FiniteState Architecture for Tokenization and GraphemetoPhoneme Conver

sion for Multilingual Text Analysis In Proceedings of the EACL SIGDAT Workshop Susan

Armstrong and Evelyne Tzoukermann editors Dublin Ireland pp Asso ciation for

Computational Linguistics

Richard Sproat Multilingual Text Analysis for TexttoSp eech Synthesis In Proceedings of the

ECAI Workshop on Extended Finite State Models of Language Budap est Hungary Eu

rop ean Conference on Articial Intelligence

J Q Stewart An Electrical Analog of the Vo cal Tract Nature Vol pp

Ching Y Suen Computer Synthesis of Mandarin In IEEE ICASSP pp

ChiuYu Tseng An Acoustic Phonetic Study on Tones in Mandarin Chinese PhD thesis Brown

University

Jan P H van Santen Assignment of Segmental Duration in TexttoSp eech Synthesis Computer

Speech and Language Vol No pp

Michelle Q Wang and Julia Hirschb erg Automatic Classication of Intonational Phrase Bound

aries Computer Speech and Language Vol pp

HsiaoChuan Wang and TaiHwei Hwang An Initial Study on the Sp eech Synthesis of Taiwanese

Hokkian Computer Processing of Chinese and Oriental Languages Vol No pp

WernJun Wang WenHsing Lai GwoHwa Ju ChunHsiao Lee and ChiShi Liu Knowledgebased

Proso dic Rule Generation in Mandarin Synthesizer In Workshop Notes of IEEE International

Workshop on Intel ligent Signal Processing and Communication Systems pp

Reiner WilhelmsTricarico and Joseph Perkell Biomechanical and Physiologically Based Sp eech

Mo deling In Progress in Speech Synthesis Jan van Santen Richard Sproat Joseph Olive and

Julia Hirschb erg editors pp New York Springer

C H Wu C H Chen and S C Juang Yi CELP Wei Jichu zhi Wenju Fan Yuyin Zhong Yunlu

Xunxi zhi Chansheng Yu Tiaozheng An CELPbased Proso dic Information Mo dication and

Generation of Mandarin TexttoSp eech In ROCLING VIII pp

Jun Xu and Baozong Yuan New Generation of Chinese TexttoSp eech System In Proceedings

TENCON IEEE Region Conference on Computer Communication Control and Power

Engineering pp

Yi Xu Contextual Tonal Variation in Mandarin Chinese PhD thesis The University of Connecti

cut

Shunan Yang and Yi Xu An Acousticphonetic Oriented System for Synthesizing Chinese Speech

Communication Vol pp

David Yarowsky Homograph Disambiguation in TexttoSp eech Synthesis In Progress in Speech

Synthesis Jan van Santen Richard Sproat Joseph Olive and Julia Hirschb erg editors pp

New York Springer

Jialu Zhang Shiqian Qi and Ge Yu Assessment Metho ds of Sp eech Synthesis Systems for Chinese

In Proceedings of the XIII International Congress of Phonetic Sciences pp

Jialu Zhang Acoustic Parameters and Phonological Rules of a TexttoSp eech System for Chinese

In IEEE ICASSP pp

Jialu Zhang Hanyu Yuyin Xingneng Pingce Jieguo Results of Chinese Sp eech Synthesizer Evalu

tion China Computerworld pp March

Qiaofeng Zhou and Lianhong Cai Simulation of Stress in Chinese TTS System Unpublished

manuscript

KeCheng Zhou and Trevor Cole A Chip Designed for Chinese TexttoSp eech Synthesis Journal

of Electrical and Electronics Engineering Australia Vol No pp

KeCheng Zhou A Chinese TexttoSp eech Synthesis System Using the Logarithmic Magnitude

Filter Journal of Electrical and Electronics Engineering Australia Vol No pp