<<

. Acoust. Soc. Jpn. () 20, 3 (1999)

Machine readable system for Chinese dialects spoken in Taiwan

Chiu-yu Tseng and Fu-chiang Chou

Institute of Linguistics (Preparatory Office), Academia Sinica, Nankang Taipei 115, Taiwan, Republic of China

(Received 4 August 1998)

Since existing ASCII versions of phonetic transcription systems appear to aim at transcribing European languages only, they prove to be insufficient to accommodate based tonal languages such as Chinese. An ASCII encoding of phonetic transcription system for speech database of Chinese was designed. Major characteristics of the proposed design are : 1. Though based on the principles of the International Phonetic Alphabet (IPA), the current design includes two levels of transcription, namely, segmental and prosodic. The conse- quence is a more elaborate system than the IPA or its equivalent. 2. The proposed system specifically aims to transcribe three major Chinese dialects spoken in Taiwan, namely, Mandarin, Taiwanese and Hakka. The consequence is a more language-dependent system rather than a general system as the IPA. Keywords : Chinese, , Transcription, PACS number : 43. 72.-

display the full range of IPA symbols, or one may 1. INTRODUCTION design' an alphabetic and/or numeric representation A well-transcribed speech database is crucial to of the IPA symbols instead. There exist many both basic speech research and the development of alphabetic equivalent systems for speech databases related speech technologies. When constructing a such as ARPABET and KLATTBET (for American transcription system, there are many levels of tran- English), Edinburgh' Machine Readable Phonetic scription that may be considered and included in the Alphabet (for. British English), SAMPA3,4) (for system. For example, orthographic transcription, European languages) and OGI's WORLDBET (for citation-phonemic representation, broad phonetic all languages).5) Since we set the prerequisite of transcription, narrows phonetic transcription, our system for three specific Chinese dialects, name- acoustic-phonetic transcription, prosodic transcrip- ly, the three major Chinese dialects spoken in Ta- tion and even representation of non-linguistic but iwan, it seems feasible to avoid possible notational significant phenomena.1) The nature of our complexity by keeping all symbols distinct across proposed system is an ASCII version of broad the intended target languages as did OGI's WORLD phonetic transcription system with additional BET. The other reason is that each phonetic sym- prosodic labels. bol in WORLDBET is represented as two ASCII The citation-phonemic level of transcription con- characters. In order to maintain the two-character tains the output string derived from the format, an extra space is required after those sym- orthographic form (by lexical access, by letter-to- bols that have only one character. While this sound rules, or both).2) There are various alterna- process may be convenient to the computer, it could tives to represent with symbols. One be somewhat less convenient to human transcribers. may develop a platform that possesses the facility to Therefore, we chose to adopt the SAMPA system

215 J. Acoust. Soc. Jpn. (E) 20, 3 (1999)

originally designed for major European languages, events, such as F0 peaks. The former theoretical and adjusted it to a language-specific set of alpha- orientation, i.e., the use of boundaries, resulted in betic phoneme symbols for the target Chinese dia- approaches that used categories lects spoken in Taiwan. proposed by Ref. 6) to process suprasegmental infor- Though based on the IPA, the proposed system mation, such as intonation phrase, phonological chose not to provide symbols to represent informa- phrase, phonological word, , and syllable. tion as would the of the IPA. The result Alternatively, it could mark the more traditional is a less sophisticated system for segmental symbols ; units of "minor tone-unit" and "major tone-unit," something we define as a broad phonetic system. as in the MARSEC database.7 The latter theoreti- That is, though the system aimed at enabling a cal orientation, i.e., the occurrence of isolated transcriber to represent tokens of speech from our prosodic events, resulted in the marking of these informants, the transcription is nonetheless occurrences of high and low tones of various kinds. restricted to phonetic categories rather than labeling The recently formulated ToBI transcription system8) finer phonetic details. Consequently, the tran- is the most well-known system of this kind, and scriber would somewhat be forced to label the apparently works for non-tonal languages such as speech tokens with a set of phonetic symbols avail- English and Japanese, where the prosodic units are able without the tools to represent IPA annotated at the "break index" level rather than the information. Our rationale is that since ortho- "tone" level. We chose to adopt a ToBI-like frame- graphic text is readily available for our working work for the prosodic transcription of our system, speech data, we have all the information necessary but needed to modify it to suit our target languages. to determine what the intended speech should be. Needless to say, the modification is somewhat elabo- The task of our transcriber is to listen to the col- rate due to the intrinsic differences between intona- lected speech data and make direct reference between tion languages and tonal languages. the intended citation string from text to the actual The paper is organized as followed Section 2 utterance produced. In some sense, the transcrip- describes the reported broad phonetic transcrip- tion can be seen as part transcription and part tion. ; Section 3 the prosodic transcription. The mapping to text. Moreover, the designed broad experiments are described in Section 4. phonetic representation at this stage also does not 2. BROAD PHONETIC possess the capability to label common phenomena of running speech, such as place assimilation, conso- TRANSCRIPTION nant deletion and reduction. We are fully To establish a broad phonetic transcription sys- aware that once running speech is incorporated into tem for a specific language, one needs to first define our speech database, we will need to expand the the phoneme set and corresponding symbols. The system to accommodate running speech transcrip- most likely choice would be the IPA symbols, which tion in the future. We will also need to include has long been established as the system that is tonal notations in our system in the future since all sufficiently equipped to represent sound systems of three Chinese dialects possess lexical tones at the known languages of the world. Unfortunately, syllabic level. there is no uniformly implemented coding standard Leaving tonal representations for future work, we for the IPA symbols until efforts such as ISO 10646/ chose to focus our attention to transcribe and label is accepted internationally. The reported our speech database at the prosodic level. Since system is designed to function with the SAMPA there are less clear acoustic correlates to prosodic guidelines, but tailored to suit the target languages phenomena, this level is less straightforward than mentioned. phonemic annotation, and the units segmented un- SAMPA is a phonetic transcription coding that avoidably tend to tilt towards the particular chosen uses normal ASCII characters as replacements for theoretical framework. A basic distinction may be IPA symbols. Mapping of the IPA symbols onto drawn between a prosodic labeling system that ASCII codes was set up, using up to range 37-126, annotates the boundaries of units (analogous to the 7-bit printable ASCII characters only. Originally method used in phonemic annotation) and a system designed as a phonemic/broad-phonetic transcrip- that annotates the occurrence of isolated prosodic tion system for European languages, both SAMPA

216 . TSENG and . CHOU : MACHINE READABLE PHONETIC TRANSCRIPTION FOR CHINESE "p" notation to represent the voicelessbilabial pair and subsequently proposed -SAMPA (Extended SAMPA) represent an international collaborative in Mandarin. Voicing is therefore not represented. basis for a standard machine-readable encoding of All other notations for the remaining obstruents phonetic notations. Guidelines for coding (map- follow the same principle. This somewhat arbi- ping) of languages to be transcribed are also avail- trary use of notation is maintained when the system able. The reported system was designed by follow- is expanded to include Taiwanese and Hakka where ing these guidelines, especially maintaining the fea- voicing contrasts also exist. That is, though the ture that the resulted transcription is uniquely pars- voiced counterparts do exist for bilabials and velars able as specified in SAMPA. As SAMPA in Taiwanese and Hakka, making it a three-way specified, we also observed the IPA transcription contrast instead, "p" and "" still represent the convention of leaving no space between a string of voiceless bilabial pair which differs in aspiration successive symbols unless syllable boundaries occur. while an upper case "B" is used to represent the However, we note that the basic form of SAMPA voiced bilabial. The same rule also applies to velar "" and postalveolar "DZ" was designed as a phonemic or broad-phonetic tran- . scription system at the segmental level. That leaves Another kind of contrast that exists in the tonal and prosodic notations inadequate for our and in Mandarin is "retroflex". The purpose. We therefore designed a parallel set of notation "'" is used to represent "retroflex". For prosodic notations on separate levels of representa- the voiceless alveolo-palatal which in IPA tion, which we believe enhances the function and is a curly-tailed c, we propose it to be transcribed as "\" ; its aspirated counterpart "s\" capacity significant for the proposed system. More . This use of details of the design of the prosodic aspects of the notation "\" follows the principles of extension in proposed system will be discussed in later sections of SAMPA. However, the original contrast of voic- this paper. As we intend to expand the system the ing between "s" and "z" is replaced to represent future, we also hope that our current system will be aspiration in our system. Again, this principle is adopted as the standardized international ASCII constant throughout our system. As for the only system for the three Chinese target languages. voiced in Mandarin, namely, the voiced The principles of SAMPA are quite simple all retroflex fricative, the upper case "i" is used to IPA symbols that coincide with lower-case letters of denote voicing. This is also constant throughout a regular keyboard remain the our system. Table 1 shows the mapping of conso- same ; all other symbols are re-coded within the nants of all three Chinese dialects. ASCII range 37-126. In the process of developing : our system, we began by considering Mandarin at There are eight vowels in Mandarin. Seven of first, then expanded it by adding extra symbols for them already exist in SAMPA. The only not-so- Taiwanese and Hakka. The mapping is described common one is an apical vowel. The notation "U" as follows : is used to represent the apical vowel. This vowel : occurs only after 3 fricatives (s, Z) and 4 We began our design with Mandarin, the official affricates (dz,dz' ts, ts'), and does not occur with Chinese spoken language, and then expanded it to other consonants. When the consonant is include Taiwanese and Hakka. As a result, the retroflexed, this vowel is also retroflexed and is order of target languages affects the outcome of the represented by adding retroflex after the apical system. The consonants of Mandarin Chinese are vowel as in "C". As for Taiwanese and Hakka, traditionally considered to comprise 18 obstruents three more notations were added. The notation "~" represents of the preceding vowel ; (6 , 6 affricates and 6 fricatives) and 3 sono- rants (2 nasals, 1 liquid). Unlike English in which the notation "?" ; and the notation " }" the obstruents are classified in pairs differing in both the preceding stop consonants unreleased as in "p }, voicing and aspiration, all Mandarin obstruents are }, }." Other features in Taiwanese include the voiceless differing only in aspiration, except for one syllabic bilabials and velar nasals. The syl- voiced retroflex fricative. In order to accommodate labification is represented by adding a "^" after the SAMPA principle to adapt the keyboard and to these as in "^, ^" respectively. use one symbol for minimal pairs, we adopt the "b" Table 2 shows the mapping of vowels of all three

217 J. Acoust. Soc. Jpn. (E) 20, 3 (1999)

Table 1 The mapping between IPA and SAMPA-T Table 2 The mapping between IPA and SAMPA-T for consonants. (T mean character in Taiwanese, for vowels and some special endings. (T mean means Hakka) character in Taiwanese, H means Hakka. SE1: finals with a vowel-nasalizing ending, SE, : finals with a stop ending, SE, : final with a glottal stop ending)

sub-unit has a unique code (k, Mc, Fc) and the code of a syllable (Sylc) can be calculated according to this formula :

Chinese dialects. With this scheme, each syllable can be represented The system proposed above can be used for two by an unsigned short integer (0-32, 767), a basic kinds of labeling. The basic form is to form a unit in computer. This coding is straightforward citation-phonemic string in a syllabic unit to repre- and quite self-explanatory. The contextual infor- sent the pronunciation of each syllable. The other mation can also be extracted from the code directly. form is to use the symbols for the broad phonetic This is very useful for the programming in speech transcription in segmental labeling. Additionally, research. Table 3 is the coding for each sub-unit. we designed a coding system for used in For tonal representations, we adopt the Chinese these dialects. Each syllable is divided into three conventional use of numerical system in the sub-units, namely, initial, medial and final. Each phonemic level which may appear to be arbitrary for

218 C. TSENG and F. CHOU : MACHINE READABLE PHONETIC TRANSCRIPTION FOR CHINESE

Table 3 Coding for initial sub-units, medial sub- offered a slightly more elaborate scale of break units and final sub-units. indices from 0 to 5. As a result, the following six boundaries were proposed instead, i.e., reduced syl- labic boundary (0), normal syllabic boundary (1), minor-phrase boundary (2), major-phrase boundary (3), breath group boundary (4), and prosodic group boundary (5). The speech segments between the break indices then form a set of five units, namely, prosodic units, minor prosodic phrase, major prosodic phrase, breath group and prosodic group.9) The most noted difference between our system and ToBI lies in the tonal and prosodic tiers. ToBI was originally designed for English, an intonation lan- guage. It consists of labels for distinctive pitch events, transcribed as a sequence of high (H) and low () tones marked with diacritics to indicate their intonation functions. Whereas when dealing with tonal languages, the interaction between lexical tone and intonation, both of which involve deliber- ate manipulation of fundamental frequency patterns and therefore both of which are pitch events, is not only more complex but also not well understood, users who are unfamiliar with tones of Chinese. yet. It is thus more difficult to construct notations But until more dialects are included, the numerical like the tone tier in ToBI for these languages. We system seems most easily acceptable and therefore proposed to label the speech data in more detail at should be adequate for the time being. the prosodic domain while leaving the tonal aspects for the time being. Our reason was again the fact 3. PROSODIC TRANSCRIPTION that text for our speech data was readily available The prosodic transcription of our system is re- for reference. When the break indices of an utter- presented on a separate level following ToBI-like ance is labeled, the volume, rate, pitch level and notations. ToBI (for Tones and Breaks Indices) is pitch range of each prosodic unit can be calculated a system for transcribing intonation patterns and and labeled in an automatic way according to the other aspects of the prosody of English utterances. physical signals. The perceived changes of above It was devised by a group of speech scientists from parameters can be manually added on when the various disciplines (electrical engineering, psychol- transcriber has noted the significant changes in the ogy, linguistics, etc.) in pursuit of diverse research utterances. Another subjective added-on parame- purposes. Various technological goals that includ- ter is the level of emphasis. This level of informa- ed sharing prosodically transcribed databases across tion is perceptual in nature and may or may not research sites also desired a commonly agreed-upon correspond directly to the physical signals. A tran- standard of prosodic elements. The tone and scriber is asked to decide the emphasis level of the break-index tiers represent the core prosodic part of speech tokens based on subjective perception. the ToBI system. The difference in the break-index When the contents of the prosodic transcription is tier between ToBI and the prosodic level of our decided, a more standardized method of representa- system is rather little. In the ToBI system, the tion would be the next feasible step. We believe the break-index tier marks the prosodic grouping within Java Speech Markup Languages (JSML)10) could be an utterance by labeling the end of each word for its a good choice for this purpose. There are two subjective strength in association with the following elements in JSML, namely, empty elements and word on a scale from 0 (for the strongest perceived container elements. An empty element has only conjoining) to 4 (for the most disjointed bound- one tag and is suitable for the representation of aries). Our system followed the same rationale but break indices. A container element has a balanced

219 J. Acoust. Soc. Jpn. (E) 20, 3 (1999)

Table 4 The meaning of levels for the prosodic tags.

start tag and end tag and is suitable for the represen- to discuss the standard used for transcription. tation of the other factors. These tags are inserted After several such sessions, a set of one hundred into the phonemic representation of the syllables sentences was labeled by each transcriber for com- sequence to form the prosodic transcription. For parison. The comparison was focused on the con- example : sistency of break indices. The transcription tool is dz•_in1 tien1 tien1 ts•_i4 h @ n3 hau3 The proposed segmental system appears sufficient The example shows that "dz•_inl tienl" is a minor for transcribing speech variations. Major variation prosodic phrase and "tienl ts•_i4" is emphasized at a of actual speech comes from the fact that a large moderate level. When the break index is "1"(nor- number of Mandarin speakers in Taiwan are bilin- mal syllabic boundary), it will not be marked to gual. They are native speakers of Taiwanese who reduce the number of tags used. In other words, acquired Mandarin to the point that Mandarin normal syllabic boundary will be unmarked. This could be their stronger language. The carry-over Qf unmarked convention is held constant for all other their Taiwanese to Mandarin could be prosodic parameters whenever the perceived level is quite pronounced ; their Mandarin speech 'contains the normal one. The meaning of the levels for each segmental and suprasegmental variations that Man- marker are listed in Table 4. darin phonology would not accommodate. One of our aims was to be able to transcribe these varia- 4. EXPERIMENTS tions that definitely occurred in our speech data and The proposed system is evaluated on the basis of subsequently collapsing them in the development of a Mandarin speech corpus that is designed to be our labeling tools for Mandarin. Our system was phonetically and prosodically rich. There are designed to represent the presence and absence of the about 600 short paragraphs in the corpus. To test features that correspond to these variations. For the segmental labeling system, the major task was to example, there are three contrastive pairs of fricative verify the capability of transcription of speech varia- in Mandarin that are distinguished by the presence tions at the segmental level. To test the prosodic or absence of retroflex. But for native speakers of labeling system, the major task was to define a the Taiwanese where no retroflex occur, retroflex standard for the transcriber and at the same time usually does not occur when they speak Mandarin. maintain the consistency between the transcribers. This phenomenon can be easily distinguished by the Another important issue that should be included in presence of symbol " " which represents the feature the investigation was the convenience factor for both retroflex, for examples, "dz" vs. "dz'", "ts" vs. "ts'", humans and the computer. The corpus was labeled and "s" vs. "s.'". Another similar example is the by two transcribers. At first, the two transcribers phenomenon of rounding. Native speakers of Ta- labeled a small set of identical speech data in order iwanese tend to pronounce the "uo" or "ou" combi-

220 C. TSENG and F. CHOU : MACHINE READABLE PHONETIC TRANSCRIPTION FOR CHINESE

Fig. 1 An example of the segmental and prosodic transcription. nation with less rounding on the "u" part. To major function of the break indices is to the transcribe this variation would be to drop the "u" in speech flow into smaller units in the hope to form a the transcription. For examples, intended "uo" hierarchical structure of prosody. We proposed a could be realized as "", intended "iou" realized as top-down spotting procedure for the labeling of "io" . While conventional syllable based coding break indices. At first we spotted all the breath system uses all but intended syllables as coding groups (B4) in an entire paragraph, then search the units, in the case of Mandarin a total of 408 syllables prosodic groups (B5) among these boundaries. only, such realized variations in actual speech could The second step was to spot the prosodic changes not be accounted for. Our proposed transcription within a breath group. The change that ac- system makes it possible to include actual speech companied a short pause was marked as B3 ; the variations to be accounted for in the recognition others were marked as B2. The last step was to spot process. For example, the syllable "sual" is not a the reduced syllabic boundaries (BO) that ac- legitimate syllable in Mandarin phonology, but companied the reduced syllables. The unmarked native speakers of Taiwanese pronounce it as an boundaries were normal syllabic boundaries (B 1). un-retroflexed substitution of "s'ual " (brush). The details are described below: That is, the syllable "sual" could in fact be an 1. B4 and B5 : actual realization of an intended "s'ual" (brush). B4 is used to indicate the boundary of breath Our coding scheme provided different codes for group that was originally proposed by Lieberman.11) these variations, for example 18360 for "s'ual" and This boundary is a physiological effect that is 17360 for "sual." These codes are internal repre- caused by breathing exemplified by decrease at the sentation of Mandarin syllables in the computer and levels of pitch and energy. To detect the reset of therefore will not cause any extra burden on the pitch and energy is no difficult task. However, it is human users. This ability to represent phonetic less distinct to detect B5. In writing, a paragraph variation in actual speech is very useful for speech can be identified not by but by a distinct recognition and synthesis. For example, in order format that involves specific spacing at the begin- for the computer to better recognize Mandarin ning, leaving off the remaining of a line, and begin- speech, we are able to collapse the actual realization ning a new line with specific spacing again. The of speech with their intended target to boost the same phenomenon occurs in reading out para- robustness of our recognition program. graphs. Our question is : what would the cue for For the prosodic transcription, the standard for the marking of such a boundary be? In our obser- the labeling of break indices was evaluated. The vation, it is marked by the lengthening of the pause

221 J. Acoust. Soc. Jpn. (E) 20, 3 (1999) between the two breath groups. In our definition, In our design, we also intend to spot the reduced this "paragraphing" in the reading out process is syllables in contraction, a phenomenon that occurs termed prosodic group, a unit in speaking that is frequently in spontaneous speech. (However, our equal or larger than a breath group. In our experi- transcription showed that the collected read speech ments, the transcribers were asked to spot the corpus almost does not contain such examples. prosodic group according to their perception not by One reason could be the somewhat careful speech measuring the duration of pause. Our purpose is to style of our informants.) find the correlation between perception and the Table 5 is the comparison of the break indices prosodic parameters. labeled by two transcribers. Statistical analyses of 2. B2, B3 : the pauses are shown in Table 6. The left panel of After marking the B4 and B5, a paragraph was Table 5 represents independent labeling results of segmented into many breath groups. The tran- the proposed criteria ; the right panel represents the scribers were asked to detect irregular boundaries labeling results of the same set of data after the within a breath group. The perceived boundaries transcribers compared notes of criteria used. We may be caused by sudden changes in pitch, duration find while consistency between transcribers increases and energy, or it may be caused by the insertion of after discussion, the types of less identifiable cate- short pause. The boundaries that are perceived by gories still maintains. Most of the inconsistency the pause are marked as B3, and the others are occurred in B1 vs. B2 and B4 vs. B5. A total of 204 marked as B2. boundaries were labeled as B1 by transcriber A, but 3. B0: labeled as B2 by transcriber B. Furthermore, 48

Table 5 The break indices labeled by two transcribers (A and B) before (the left) and after (the right) the exchange of notes for labeling.

Table 6 The Mean and Std (in ms) for the pause of different break indices before (the left) and after (the right) the exchange of notes for labeling.

222 C. TSENG and F. CHOU : MACHINE READABLE PHONETIC TRANSCRIPTION FOR CHINESE boundaries were labeled as B5 by transcriber A, but Extension of SAMPA, Internet WWW page, at labeled as B4 by transcriber B. This could mean URL :

223