<<

Development of a Corpus for Southern Thai Dialect Speech Recognition: Design and Text Preparation

Sittichok Aunkaew † Montri Karnjanadecha ‡ Wutiwiwatchai * †‡Department of Computer Engineering, Faculty of Engineering of Songkla University, Hatyai, 90112, *National Electronics and Computer Technology Center (NECTEC) 112 Phahonyothin Rd., Klong Nueng, Klong Luang, Pathumthani 12120, Thailand [email protected]†, [email protected]‡, [email protected]*

Abstract enough that between users of the dialects can be problematic. This paper describes our progress on the development of a corpus, and offer resources, for Southern Thai dialect. The existing LOTUS corpus for standard Thai speech recognition, developed by NECTEC, is unable to fulfill our needs since the Southern Thai dialect is different from standard Thai in many ways including pronunciation, its lexicon, and grammar. Thus, our aim is to design and prepare transcriptions of recorded read speech and broadcast news to build a Southern Thai Dialect Continuous Speech Recognition corpus.

Keywords: Southern Thai Dialect, Speech Recognition, Speech Corpus, Language Resources

1 Introduction Figure 1. The Southern Thai Dialect in the Tai-Kadai (adapted Southern Thai Dialect (STD) or Dambro (Thai from [1]) pronunciation: [p ʰaːsǎː t ʰajtâ ːj]) is a member of the subgroup of the Tai group of the Tai–Kadai language family [1][2]. Figure 1 shows the family tree of the Tai-Kadai language. STD is spoken in the 14 southern of Thailand and also in Bang Saphan, in the central of . A small number of Tai speakers can also be found in some border states of , such as , , , and [3][4][5]. The map in Figure 2 shows the distribution of the Tai-Kadai language. The written form of STD utilizes the standard Thai , which is mainly used by the Central Thai Dialect (CTD). There are three main differences between CTD and STD: 1) pronunciation, 2) the lexicon, and 3) Figure 2. Distribution of the Tai-Kadai language grammar [4][6]. These differences are significant family [5] In a broad sense, STD is the native dialect for the building a complete STD speech recognition South of Thailand, but there are some local system, which prompted our work. differences in the dialect. For example, people in Three resources are necessary for building a STD Phuket and Phung–nga speak somewhat speech recognition system. Section 2 examines differently from those in Songkhla and the STD characteristics, section 3 discusses . The Central Southern Thai Dialect sentence selection, and section 4 describes the (CSTD), which is commonly known as Nakorn resulting speech corpus consisting of transcribed Sri Thammarat dialect [4][7], is spoken in read speech and broadcast news recordings. Chumphon, , Nakorn Si Thammarat, Conclusions are drawn in section 5. Narathiwat, Pattani, Phang–Nga, Phatthalung, Phuket, , Satun, Songkhla, , Trang, and Yala. Our corpus is targeted at CSTD, 2 STD Characteristics with the vocabulary and dictionary obtained from When STD is written with the standard Thai the studies described in [8][9]. alphabet, it consists of 24 : 18 single Building an automatic speech recognition system vowels and 6 , shown in Table 1. for STD is challenging due to the existence of There are seven tones (see Figure 3), and 27 different regional variants of Thai. One important single with 18 units commonly system is the Large vOcabulary Thai continUous appearing in all dialects (*) and 9 appearing only Speech Recognition (LOTUS) corpus developed in some minor local dialects, as shown in Table 2. by the National Electronics and Computer There are 16 clustered consonants made up of 6 Technology Center (NECTEC) [10]. However, commonly appearing units (*) and 10 units the LOTUS corpus was designed for CTD speech appearing only in some minor local dialects, as recognition, so cannot be fully utilized for STD. Also, there are no publicly available resources for shown in Table 3 [4][7][8][9].

Table 1. of vowels between IPA and the Thai language Tongue Height/Advancement Font Central Back IPA i i: ɯ ɯː u u: Close Thai i (อิ) i: (อี) i (อึ) i: (อื) u (อุ) u: (อู) IPA e e: ɤ ɤ: o o: Mid Thai e (เอะ) e: (เอ) ə (เออะ) ə: (เออ) o (โอะ) o: (โอ) IPA ɛ ɛ: a a: ɔ ɔ: Open Thai ɛ (แอะ) ɛ: (แอ) a (อะ) a: (อา) ɔ (เอาะ) ɔ: (ออ) IPA ia ʔ ia ɯaʔ ɯa ua ʔ ua Diphthongs Thai iia (เอียะ) ia (เอีย) vva (เอือะ) va (เอือ) uua (อัวะ) ua (อัว)

1 /1/ high–rising–falling [453] such as [: ]. 1 high–rising [45] such as [khat ]. /2/ high–level (falling at the end) [443] 2 2 such as [kha: ], [kha:t ]. 3 /3/ mid–rising–falling [342] such as [pi: ]. 3 mid–rising [34] such as [pat ]. 4 4 /4/ mid–level [33] such as [pa: ], [pa:t ]. 5 /5/ low–rising–falling [231] such as [kha: ]. 6 6 /6/ low–rising [23] such as [kha: ], [kha:t ]. 7 7 /7/ low–falling [21] such as [kha: ], [khat ].

Figure 3. Seven tones in the Central Southern Thai dialect (adapted from [7], [8] and [9]) Table 2. Comparison of single consonants between IPA, CTD, and STD IPA CTD STD Example b b [ บ] b [ บ] * [baj 3] = [leaf] d d [ ฎ, ด] d [ ฎ, ด]* [daj 3] = [ladder] f f [ ฝ, ฟ] f [ ฝ, ฟ] [faj 1] = [mole] h h [ ห, ฮ] h [ ห, ฮ]* [haj 1] = [jar] j j [ ญ, ย] j [ ญ, ย]* [:n 5] = [flabby] k k [ ก] k [ ก]* [kaj 4] = [chicken] kh kh [ข, ค, ฆ] kh [ข, ค, ฆ]* [kh om 5] = [sharp] l l [ ล, ฬ] l [ ล, ฬ]* [la:n 5] = [yard] m m [ ม] m [ ม]* [ma: 5] = [come] n n [ ณ, น ] n [ ณ, น ]* [na: 5] = [field] ŋ ŋ [ง, หง ] ŋ [ง, หง ] [ŋa: 5] = [sesame] ɲ – ñ [หญ , ย, ญ]Nasal [ñaj 1] = [big] p p [ ป] p [ ป]* [paj 3] = [go] 5 ph ph [ ผ, พ, ภ] ph [ ผ, พ, ภ]* [ph aŋ ] = [break] r r [ ร] r [ ร] [ra:n 5] = [prune/cut off] s s [ ซ, ศ, ษ, ส] s [ ซ, ศ, ษ, ส]* [saj 1] = [wear] t t [ ฏ, ต] t [ ฏ, ต]* [taj 3] = [climb] th th [ฐ, ฑ, ฒ, ถ, ท, ธ] th [ฐ, ฑ, ฒ, ถ, ท, ธ]* [th a: ʔ6] = [pan] tɕ c [ จ] c [ จ]* [cↄ:k 4] = [glass] tɕʰ ch [ฉ, ฌ, ช] ch [ฉ, ฌ, ช]* [ch om 5] = [praise] w w [ ว] w [ ว]* [wa:n 5] = [bottom] ʔ ʔ [อ] ʔ [อ]* [ʔↄ:k 4] = [out] – – bh [บ]Aspirated [bh a7ra 6] = [Budu] – – dh [ด] Aspirated Plosive [dh aŋ5] = [nose bridge] – – ʔ˜ [อ] Nasal [ʔ˜ ↄ:k 4] = [White-breasted Sea Eagle] – – ʔj [ อย ] [ʔjↄ:k 3] = [sit] – – g [ ฆ] [gom 5] = [pour the water from the boiled rice]

Table 3. Comparison of clustered consonants between CTD and STD CTD STD CTD STD CTD STD CTD STD STD pr [ ปร ] pr [ ปร ] tr [ ตร ] tr [ ตร ] kw [ กว ] kw [ กว ]* – mr [ มร ] ʂtr [ สต ] pl [ ปล ] pl [ ปล ]* thr [ ทร ] – khr [ คร ] khr [ คร ] – ml [ มล ] phr [ พร ] phr [ พร ] kr [ กร ] kr [ กร ] khl [ คล ] khl [ คล ]* – br [ บร ] phl [ พล ] phl [ พล ]* kl [ กล ] kl [ กล ]* khw [ ขว , คว ] khw [ ขว , คว , ฟ, ฝ]* – bl [ บล ]

The is the basic pronunciation CCVVCT [6][7]. unit in STD, with each one containing at least a For example, “เล /le: 5/” (which means “sea”), single (C), a (V), and a (T). “แล็บ /l ɛ:p6/” (“nail”), “ โรงบาล /ro:ng 5 ba:n 3/” T The STD inherent syllable types are CVV , 6 (“hospital”), and “ โพรก /phro:k /” (“tomorrow”). CVC T, CVVC T, CCVC T, CCVV T, CCVC T, and

3 Sentence Selection Table 5. Mapping in LOTUS and CSTD The maturity and scientific basis of a speech LOTUS CSTD (C i) Cf Thai corpus is of great important in speech technology th th t^ ฐ, ฑ, ฒ, ถ, ท, ธ [10]. It must provide all of the linguistic features c c t^ จ and prosody for the language, and meet the design requirements of speech recognition ch ch t^, ch^ ฉ, ฌ, ช systems. A high-quality, large-scale, and w w w^ ว diversified corpus will significantly promote the z z y^ อ development and application of speech recognition technology. – zj – อย Corpuses used for speech recognition must – bh – บ (Plosive) include training and testing sets for the acoustic – dh – ด (Plosive) model. Building the corpus will include corpus design, text and sentence selection, speaker kr kr – กร selection, recording device and room preparation, kl kl – กล recording, and speech labeling. kw kw – กว CSTD text is written like Thai, but some words are homographs rather than homophones which is khl khl – ขล , คล inconvenient during transcription. For easier khr khr – คร processing of dialect text, we describe STD khw khw – , characters, vowels, and tones with the Latin ขว, คว, ฝ ฟ characters, in the same way as in the LOTUS pr pr – ปร corpus. In addition, we use extra characters for pl pl – ปล sentence transcription, which are shown in phr phr – พร Tables 4, 5 and 6. phl phl – พล tr tr – ตร Table 4. Mapping phonemes in LOTUS and thr thr – ทร CSTD – mr – มร LOTUS CSTD (C ) C Thai i f – ml – มล b b p^ บ br br – บร d d t^ ฎ, ด bl bl – บล f – f^ ฝ, ฟ fr fr – ฟร h h – ห, ฮ fl fl – ฟล j j j^ ญ, ย dr dr – ดร k k k^ ก kh kh k^ ข, ค, ฆ Table 6. The vowel phonemes of both corpuses l l n^, l^ ล, ฬ LOTUS CSTD Thai i, ii i, ii อิ, อี m m m^ ม v, vv v, vv อึ, อื n n n^ ณ, น e, ee e, ee เอะ , เอ ng ng ng^ ง, หง q, qq q, qq เออะ , เออ (Nasal) – jh – หญ , ย o, oo o, oo โอะ , โอ p p p^ ป x, xx x, xx แอะ , แอ ph ph p^ ผ, พ, ภ a, aa a, aa อะ , อา r r n^ ร @, @@ @, @@ เอาะ , ออ s s t^, s^ ซ, ศ, ษ, ส ia, iia ia, iia เอียะ , เอีย t t t^ ฏ, ต va, vva va, vva เอือะ , เอือ ua, uua ua, uua อัวะ , อัว

3.1 Text Selection 4 STD Speech Corpus

The recorded sentences were collected from a We are currently collecting speech corpus data. STD dictionary [8][9], research papers, social The steps in the process are described in this network groups and webpages focus on STD, and section. Once sentence and speaker selection are by converting text from the LOTUS corpus. completed, the speech data is recorded in the Symbols such as the comma (,) and exclamation sound–proof room shown in Figure 4. Dynamic mark (!) were easily removed from the selected Close-talk microphone (TEXLEX H–41) and sentences. However, one complication for the dynamic Unidirectional (SONY F–720) micro- STD is how to separate a paragraph into phones are used for the recording, and the sentences and segment a sentence into words. waveform is sampled at 48 kHz and saved in Word and sentence definitions are major issues 16-bit PCM waveform format. Each audio file is for the dialect. also down–sampled to 16 KHz and similarly Currently, we spend a lot of time manually saved. We plan to collect speech samples from segmenting the sentences because the existing 100 native STD speakers (50 males and 50 Thai grapheme–to– (G2P) software [11] females) who are students, staff, officers, and does not support many words from the STD lecturers at Rajamangala University of . vocabulary. For example: By the end of the project expect to have collected approximately 100 hours of recordings. ป้าวหนวยหนีสวย ‰ ป้า |ว|หน |วย |หนี|สวย | (incorrect) ‰ ป้าว |หนวย|หนี|สวย | (correct) (Meaning: “This bag is beautiful.”)

3.2 Defining The Transcription

Our STD transcription uses either /C i–V–T/ or /C i–V–Cf–T/, where C i denotes the initial consonant (single and clustered/double consonants), V is the vowel (short and long vowels), C f represents the final consonant (single consonants), and T denotes one of the seven tones. The mappings are shown in Tables 4, 5, and 6, Figure 4. The sound–proof room while T appears in Figure 3. For instance, the STD word “ เบีย ” (which means “money”) is In addition to in-house recordings, broadcast news speech is being collected from transcribed as /b–iia–z^–4/, and “โรงบาล ” (“hospi- STD–speaking television channels. We record tal”) is transcribed as /r–oo–ng^–5 b–aa–n^–3/. each day’s news broadcast using a DTECH Since consonants can be either initial or final, UTV332 USB TV box, tuned to a local cable every final consonant is suffixed by a '^' symbol channel. Each video recording lasts between 25 to differentiate it from an initial consonant. minutes to an hour and the recorded files are The following example illustrates the saved in MPEG–2 format so that each is transcription of the CTD sentence “ กระเป๋าใบนีสวย ” approximately 255 MB in size. To prepare the into CSTD: data for transcription, the audio information is extracted from the MPEG–2, and each sentence ป้าว |หนวย |หนี|สวย ‰ p–aa–w^–4|n–uua–j^–1| is manually. These segments are saved as 16 KHz n–ii–z^–2|s–uua–j^–1| samples in 16 bit PCM wave files. We have already recorded 35 hours of news, and The “|” symbol is used to separate adjacent hope to reach a total of 50 hours. words. The involves the labeling of each speech sentence, using the open source tool WaveSurfer [14]. Figure 5 shows a screenshot taken while an utterance was being transcribed. [3] Ruangsuk K. and Kalaya T. “Tonal Border- line between Central Thai and Southern Thai: Variation by Age Group,” Songklanakarin J. of Social Sciences & Humanities , 15(5):723–737, 2009. [4] Chantas T. Phasa Lae Wattanatam Pak-Dai; Language and Culture . Odeon Store, , 1993. [5] . . http:// http://en.wikipedia.org/wiki/Southern_Thai_lang uage, 2013.

Figure 5. Speech labeling using WaveSurfer [6] Wilaisak K. Phasa Thai thin . Kasetsart Uni- versity, Bangkok, 2001.

[7] Khamnuan N. Current Thai Dialects in Na- 5 Conclusions and Future Work korn Si Thammarat . Srinakharinwirot University, Songkhla, 1987. We have presented the design of a STD corpus [8] The Southern Thai Studies Foundation. containing read speech and broadcast news, and Southern Thai Dialect Dictionary 2525 , 4 th . The described its current states. We plan to complete Southern Thai Studies Foundation, Songkhla, the recording of the read speech by 100 speakers, The Institute for Southern Thai Studies, Thaksin totaling 100 hours of data, by December 2013. At University. 2003. present 35 hours of audio/video data taken from [9] Nakorn Si Thammarat, Rajabhat University. broadcast news has been recorded, and we expect Southern Thai Dialect Dictionary 2550 , Mind to have collected 50 hours by October 2013. Our Media, Bangkok, 2008. work includes the finalization of sentence [10] Kasuriya, S., Sornlertlamvanich, V., selection, recording of more read speech, and the Cotsomrong, P., Kanokphara, S., and Thatphi- completion of speech transcription. thakkul, N. “Thai Speech Corpus for Speech Recognition”, Proc. Of Oriental-COCOSDA Acknowledgment 2003 , Singapore, October, 2003. [11] Tarsaku, P., Sornlertlamvanich, V., and The authors would like to thank NECTEC for Thongprasirt, R. “Thai Grapheme-to-Phoneme supplying information on their LOTUS corpus, using Probabilistic GLR Parser,” Proc Eu- Dr. Chantas Tongchuay for linguistic suggestions rospeech , 2:1057–1060. 2001. about the Southern Thai dialect, Ajarn Somporn [12] Syntrillium Software. Cool Edit Pro, v2.1, Khumtong for pointing us towards useful STD, 2003. and Dr.Keerati Inthavisas for his review of this [13] Freemake Software. Freemake Video Con- manuscript. verter, v4.0.3.1, 2013. [14] Karnjanadecha M., Kimsawad P., References Chukumnird W., and Vaithayavanich K., “An Automatic Speech Transcriber for the Thai [1] Lewis, M. Paul, Gary F. Simons, and Charles Speech Corpus Project,” Proc. of the 3rd Inter- D. Fennig (eds.). : of the national Symposium on Communications and World , 7th edition. Dallas, Texas: SIL Interna- Information Technology , 2:551–556, Songkhla, tional, 2013. Thailand, Sept. 3–5, 2003. [2] John, F. Hartmann. “A Model for The Alignment of Dialects in Southwestern Tai,” JSS , 68 (1): 72–86, 1980.