Development of a Corpus for Southern Thai Dialect Speech Recognition: Design and Text Preparation Sittichok Aunkaew † Montri Karnjanadecha ‡ Chai Wutiwiwatchai * †‡Department of Computer Engineering, Faculty of Engineering Prince of Songkla University, Hatyai, Songkhla 90112, Thailand *National Electronics and Computer Technology Center (NECTEC) 112 Phahonyothin Rd., Klong Nueng, Klong Luang, Pathumthani 12120, Thailand
[email protected]†,
[email protected]‡,
[email protected]* Abstract enough that mutual intelligibility between users of the dialects can be problematic. This paper describes our progress on the development of a corpus, and offer language resources, for Southern Thai dialect. The existing LOTUS corpus for standard Thai speech recognition, developed by NECTEC, is unable to fulfill our needs since the Southern Thai dialect is different from standard Thai in many ways including pronunciation, its lexicon, and grammar. Thus, our aim is to design and prepare transcriptions of recorded read speech and broadcast news to build a Southern Thai Dialect Continuous Speech Recognition corpus. Keywords: Southern Thai Dialect, Speech Recognition, Speech Corpus, Language Resources 1 Introduction Figure 1. The Southern Thai Dialect in the Tai-Kadai language family (adapted Southern Thai Dialect (STD) or Dambro (Thai from [1]) pronunciation: [p ʰaːsǎː t ʰajtâ ːj]) is a member of the Thai language subgroup of the Tai group of the Tai–Kadai language family [1][2]. Figure 1 shows the family tree of the Tai-Kadai language. STD is spoken in the 14 southern provinces of Thailand and also in Amphoe Bang Saphan, in the central province of Prachuap Khiri Khan. A small number of Tai speakers can also be found in some border states of Malaysia, such as Kedah, Kelantan, Penang, and Perak [3][4][5].