Development of a Corpus for Southern Thai Dialect Speech Recognition: Design and Text Preparation
Total Page:16
File Type:pdf, Size:1020Kb
Development of a Corpus for Southern Thai Dialect Speech Recognition: Design and Text Preparation Sittichok Aunkaew † Montri Karnjanadecha ‡ Chai Wutiwiwatchai * †‡Department of Computer Engineering, Faculty of Engineering Prince of Songkla University, Hatyai, Songkhla 90112, Thailand *National Electronics and Computer Technology Center (NECTEC) 112 Phahonyothin Rd., Klong Nueng, Klong Luang, Pathumthani 12120, Thailand [email protected]†, [email protected]‡, [email protected]* Abstract enough that mutual intelligibility between users of the dialects can be problematic. This paper describes our progress on the development of a corpus, and offer language resources, for Southern Thai dialect. The existing LOTUS corpus for standard Thai speech recognition, developed by NECTEC, is unable to fulfill our needs since the Southern Thai dialect is different from standard Thai in many ways including pronunciation, its lexicon, and grammar. Thus, our aim is to design and prepare transcriptions of recorded read speech and broadcast news to build a Southern Thai Dialect Continuous Speech Recognition corpus. Keywords: Southern Thai Dialect, Speech Recognition, Speech Corpus, Language Resources 1 Introduction Figure 1. The Southern Thai Dialect in the Tai-Kadai language family (adapted Southern Thai Dialect (STD) or Dambro (Thai from [1]) pronunciation: [p ʰaːsǎː t ʰajtâ ːj]) is a member of the Thai language subgroup of the Tai group of the Tai–Kadai language family [1][2]. Figure 1 shows the family tree of the Tai-Kadai language. STD is spoken in the 14 southern provinces of Thailand and also in Amphoe Bang Saphan, in the central province of Prachuap Khiri Khan. A small number of Tai speakers can also be found in some border states of Malaysia, such as Kedah, Kelantan, Penang, and Perak [3][4][5]. The map in Figure 2 shows the distribution of the Tai-Kadai language. The written form of STD utilizes the standard Thai alphabet, which is mainly used by the Central Thai Dialect (CTD). There are three main differences between CTD and STD: 1) pronunciation, 2) the lexicon, and 3) Figure 2. Distribution of the Tai-Kadai language grammar [4][6]. These differences are significant family [5] In a broad sense, STD is the native dialect for the building a complete STD speech recognition South of Thailand, but there are some local system, which prompted our work. differences in the dialect. For example, people in Three resources are necessary for building a STD Phuket and Phung–nga speak somewhat speech recognition system. Section 2 examines differently from those in Songkhla and the STD characteristics, section 3 discusses Phatthalung. The Central Southern Thai Dialect sentence selection, and section 4 describes the (CSTD), which is commonly known as Nakorn resulting speech corpus consisting of transcribed Sri Thammarat dialect [4][7], is spoken in read speech and broadcast news recordings. Chumphon, Krabi, Nakorn Si Thammarat, Conclusions are drawn in section 5. Narathiwat, Pattani, Phang–Nga, Phatthalung, Phuket, Ranong, Satun, Songkhla, Surat Thani, Trang, and Yala. Our corpus is targeted at CSTD, 2 STD Characteristics with the vocabulary and dictionary obtained from When STD is written with the standard Thai the studies described in [8][9]. alphabet, it consists of 24 vowels: 18 single Building an automatic speech recognition system vowels and 6 diphthongs, shown in Table 1. for STD is challenging due to the existence of There are seven tones (see Figure 3), and 27 different regional variants of Thai. One important single consonants with 18 units commonly system is the Large vOcabulary Thai continUous appearing in all dialects (*) and 9 appearing only Speech Recognition (LOTUS) corpus developed in some minor local dialects, as shown in Table 2. by the National Electronics and Computer There are 16 clustered consonants made up of 6 Technology Center (NECTEC) [10]. However, commonly appearing units (*) and 10 units the LOTUS corpus was designed for CTD speech appearing only in some minor local dialects, as recognition, so cannot be fully utilized for STD. Also, there are no publicly available resources for shown in Table 3 [4][7][8][9]. Table 1. Comparison of vowels between IPA and the Thai language Tongue Height/Advancement Font Central Back IPA i i: ɯ ɯː u u: Close Thai i (อิ) i: (อี) i (อึ) i: (อื) u (อุ) u: (อู) IPA e e: ɤ ɤ: o o: Mid Thai e (เอะ) e: (เอ) ə (เออะ) ə: (เออ) o (โอะ) o: (โอ) IPA ɛ ɛ: a a: ɔ ɔ: Open Thai ɛ (แอะ) ɛ: (แอ) a (อะ) a: (อา) ɔ (เอาะ) ɔ: (ออ) IPA ia ʔ ia ɯaʔ ɯa ua ʔ ua Diphthongs Thai iia (เอียะ) ia (เอีย) vva (เอือะ) va (เอือ) uua (อัวะ) ua (อัว) 1 /1/ high–rising–falling [453] such as [kha: ]. 1 high–rising [45] such as [khat ]. /2/ high–level (falling at the end) [443] 2 2 such as [kha: ], [kha:t ]. 3 /3/ mid–rising–falling [342] such as [pi: ]. 3 mid–rising [34] such as [pat ]. 4 4 /4/ mid–level [33] such as [pa: ], [pa:t ]. 5 /5/ low–rising–falling [231] such as [kha: ]. 6 6 /6/ low–rising [23] such as [kha: ], [kha:t ]. 7 7 /7/ low–falling [21] such as [kha: ], [khat ]. Figure 3. Seven tones in the Central Southern Thai dialect (adapted from [7], [8] and [9]) Table 2. Comparison of single consonants between IPA, CTD, and STD IPA CTD STD Example b b [ บ] b [ บ] * [baj 3] = [leaf] d d [ ฎ, ด] d [ ฎ, ด]* [daj 3] = [ladder] f f [ ฝ, ฟ] f [ ฝ, ฟ] [faj 1] = [mole] h h [ ห, ฮ] h [ ห, ฮ]* [haj 1] = [jar] j j [ ญ, ย] j [ ญ, ย]* [ja:n 5] = [flabby] k k [ ก] k [ ก]* [kaj 4] = [chicken] kh kh [ข, ค, ฆ] kh [ข, ค, ฆ]* [kh om 5] = [sharp] l l [ ล, ฬ] l [ ล, ฬ]* [la:n 5] = [yard] m m [ ม] m [ ม]* [ma: 5] = [come] n n [ ณ, น ] n [ ณ, น ]* [na: 5] = [field] ŋ ŋ [ง, หง ] ŋ [ง, หง ] [ŋa: 5] = [sesame] ɲ – ñ [หญ , ย, ญ]Nasal [ñaj 1] = [big] p p [ ป] p [ ป]* [paj 3] = [go] 5 ph ph [ ผ, พ, ภ] ph [ ผ, พ, ภ]* [ph aŋ ] = [break] r r [ ร] r [ ร] [ra:n 5] = [prune/cut off] s s [ ซ, ศ, ษ, ส] s [ ซ, ศ, ษ, ส]* [saj 1] = [wear] t t [ ฏ, ต] t [ ฏ, ต]* [taj 3] = [climb] th th [ฐ, ฑ, ฒ, ถ, ท, ธ] th [ฐ, ฑ, ฒ, ถ, ท, ธ]* [th a: ʔ6] = [pan] tɕ c [ จ] c [ จ]* [cↄ:k 4] = [glass] tɕʰ ch [ฉ, ฌ, ช] ch [ฉ, ฌ, ช]* [ch om 5] = [praise] w w [ ว] w [ ว]* [wa:n 5] = [bottom] ʔ ʔ [อ] ʔ [อ]* [ʔↄ:k 4] = [out] – – bh [บ]Aspirated Plosive [bh a7ra 6] = [Budu] – – dh [ด] Aspirated Plosive [dh aŋ5] = [nose bridge] – – ʔ˜ [อ] Nasal [ʔ˜ ↄ:k 4] = [White-breasted Sea Eagle] – – ʔj [ อย ] [ʔjↄ:k 3] = [sit] – – g [ ฆ] [gom 5] = [pour the water from the boiled rice] Table 3. Comparison of clustered consonants between CTD and STD CTD STD CTD STD CTD STD CTD STD STD pr [ ปร ] pr [ ปร ] tr [ ตร ] tr [ ตร ] kw [ กว ] kw [ กว ]* – mr [ มร ] ʂtr [ สต ] pl [ ปล ] pl [ ปล ]* thr [ ทร ] – khr [ คร ] khr [ คร ] – ml [ มล ] phr [ พร ] phr [ พร ] kr [ กร ] kr [ กร ] khl [ คล ] khl [ คล ]* – br [ บร ] phl [ พล ] phl [ พล ]* kl [ กล ] kl [ กล ]* khw [ ขว , คว ] khw [ ขว , คว , ฟ, ฝ]* – bl [ บล ] The syllable is the basic pronunciation CCVVCT [6][7]. unit in STD, with each one containing at least a For example, “เล /le: 5/” (which means “sea”), single consonant (C), a vowel (V), and a tone (T). “แล็บ /l ɛ:p6/” (“nail”), “ โรงบาล /ro:ng 5 ba:n 3/” T The STD inherent syllable types are CVV , 6 (“hospital”), and “ โพรก /phro:k /” (“tomorrow”). CVC T, CVVC T, CCVC T, CCVV T, CCVC T, and 3 Sentence Selection Table 5. Mapping phonemes in LOTUS and CSTD The maturity and scientific basis of a speech LOTUS CSTD (C i) Cf Thai corpus is of great important in speech technology th th t^ ฐ, ฑ, ฒ, ถ, ท, ธ [10]. It must provide all of the linguistic features c c t^ จ and prosody for the language, and meet the design requirements of speech recognition ch ch t^, ch^ ฉ, ฌ, ช systems. A high-quality, large-scale, and w w w^ ว diversified corpus will significantly promote the z z y^ อ development and application of speech recognition technology. – zj – อย Corpuses used for speech recognition must – bh – บ (Plosive) include training and testing sets for the acoustic – dh – ด (Plosive) model. Building the corpus will include corpus design, text and sentence selection, speaker kr kr – กร selection, recording device and room preparation, kl kl – กล recording, and speech labeling. kw kw – กว CSTD text is written like Thai, but some words are homographs rather than homophones which is khl khl – ขล , คล inconvenient during transcription. For easier khr khr – คร processing of dialect text, we describe STD khw khw – , characters, vowels, and tones with the Latin ขว, คว, ฝ ฟ characters, in the same way as in the LOTUS pr pr – ปร corpus. In addition, we use extra characters for pl pl – ปล sentence transcription, which are shown in phr phr – พร Tables 4, 5 and 6. phl phl – พล tr tr – ตร Table 4. Mapping phonemes in LOTUS and thr thr – ทร CSTD – mr – มร LOTUS CSTD (C ) C Thai i f – ml – มล b b p^ บ br br – บร d d t^ ฎ, ด bl bl – บล f – f^ ฝ, ฟ fr fr – ฟร h h – ห, ฮ fl fl – ฟล j j j^ ญ, ย dr dr – ดร k k k^ ก kh kh k^ ข, ค, ฆ Table 6.