A Free Kazakh Speech Database and a Speech Recognition Baseline

Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia A Free Kazakh Speech Database and a Speech Recognition Baseline Ying Shi∗, Askar Hamdulla†, Zhiyuan Tang∗, Dong Wang∗‡, Thomas Fang Zheng∗ ∗ Center for Speech and Language Technologies, Research Institute of Information Technology, Department of Computer Science and Technology, Tsinghua University, China † Key Laboratory of Signal and Information Processing, Xinjiang University ‡ Corresponding Author E-mail: [email protected] Abstract—Automatic speech recognition (ASR) has gained is to construct speech recognition systems for five minor significant improvement for major languages such as English languages in China (Tibetan, Mongolia, Uyghur, Kazakh and and Chinese, partly due to the emergence of deep neural Kirgiz). However, our ambition is beyond that scope: we hope networks (DNN) and large amount of training data. For minority languages, however, the progress is largely behind the main to construct a full set of linguistic and speech resources for stream. A particularly obstacle is that there are almost no large- the 5 languages, and make them open and free for research scale speech databases for minority languages, and the only few purposes. We call this the M2ASR Free Data Program. All the databases are held by some institutes as private properties, far data resources, including the ones published in this paper, are from open and standard, and very few are free. Besides the released through the website of the project1. speech database, phonetic and linguistic resources are also scarce, including phone set, lexicon, and language model. In this paper, we report our progress on Kazakh resource In this paper, we publish a speech database in Kazakh, a construction. We will release a large-scale speech database major minority language in the western China. Accompanying and associated resources including the phone set, lexicon and this database, a full set of phonetic and linguistic resources are language model (LM). The speech database consists of about also published, by which a full-fledged Kazakh ASR system can 78 hours of speech signals recorded by 96 speakers. These be constructed. We will describe the recipe for constructing a baseline system, and report our present results. The resources resources include all required to establish a Kazakh ASR are free for research institutes and can be obtained by request. system. We will also publish a Kazakh ASR baseline system, The publication is supported by the M2ASR project supported so that researchers on Kazakh ASR can have a benchmark to by NSFC, which aims to build multilingual ASR systems for evaluate their research. minority languages in China. The rest of the paper is organized as follows: Section II presents some work on free speech databases, and Section III I. INTRODUCTION describes the characteristics of Kazakh. Section IV presents Recently, automatic speech recognition (ASR) has gained database that we will release. The construction of the Kazakh significant improvement, partly due to the emergence of deep ASR baseline is reported in Section V. The conclusions plus neural networks (DNN) and increasing amounts of training the future work are presented in Section VI. data [1], [2], [3]. For the major languages such as Chinese II. FREE DATABASES FOR ASR and English, large-scale ASR systems have been deployed by several big companies to provide ubiquitous service via There are several famous speech databases, mostly are numerous applications [4], [5]. in English, such as WSJ[6], Switchboard[7], TIMIT[8]. For These significant achievements, however, are less seen in Chinese RAS 863 corpus [9] is mostly used. These databases minority languages. There are many reasons including com- can be publicly obtained, though most of them are distributed plex economic and educational issues, but an obstacle is the with commercial licences which are often expensive. lack of open, free, and standard resources. For many minor There have been some initial attempts towards free speech languages, there are almost no large-scale speech databases, databases. For example, the EU-supported AMI/AMID project released all the data (both speech and video recordings and the only few databases are held by several institutes as 2 private properties, far from open and standard, and very few on meetings), the VoxForge project provides a platform are free. Besides the speech database, phonetic and linguistic to publish GPL-based annotated speech audio. OpenSLR is another platform to publish open resources for speech and resources are also scarce, including phone set, lexicon, and 3 language model. language research, including the LibriSpeech . The Sprak- banken database4 for Swedish, Norwegian, Danish. In Chi- Recently, we started a multilingual minor-lingual ASR (M2ASR) project, supported by the national natural science 1http://m2asr.cslt.org foundation of China (NSFC). The project is a three-party col- 2http://www.voxforge.org/ laboration, including Tsinghua University, Northwest National 3http://www.openslr.org/12/ University, and Xinjiang University. The aim of this project 4http://www.nb.no/sbfil/talegjenkjenning/ 978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017 Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia nese, CSLT@Tsinghua University has released a free database Fig. 1 shows all the Kazakh alphabets in different written THCHS30 [10], that contains about 30 hours of reading forms. There are 33 phones in Kazakh, 9 of them are vowels speech. A Kaldi recipe was also released to build a complete and 24 consonants. The vowels can be further categorized into Chinese ASR system with THCHS30. two groups: 4 front vowels and 5 back vowels, as shown For minor languages, public and free databases are still rare. in Table I. Note that the character ‘v’ (in Latin) shown in The AP16-OLR challenge5 released a database consisting of Fig. 1 is used only as a functional letter used to indicate the seven oriental languages, including Mandarin, Cantonese, In- pronunciation change of the following vowel character (see doesian, Japanese, Russian, Korean, Vietnamese. This database below), and is not pronounced by itself. is free for the challenge participants, but still expensive for ج ز ي ك ق ل م ن ڭ other researchers. Recently, the Babel project has released Arabic several very cheap speech databases for minority languages, Latin N n m l q k y z j including Assamese, Bengali, Cantonese, Georgian, Pashto, Tagalog, Turkish. Each of the database is just several dollars, Cyrillic Ң Н М Л Қ К Й З Ж ء ا ءا ب ۆ گ ع د ە nearly free. Arabic Last year, CSLT published a free speech database for Uyghur. THUGY20 [11]. This database consists of 20 hours Latin E d G g V b A a v speech signals recorded by more than 340 speakers. Accom- Cyrillic Е Д Ғ Г В Б А Ə None ۇ ءۇ ف ح ھ چ ش ى ءى panying to this database, a complete set of resources that can Arabic be used to construct a full-fledged Uyghur ASR system was also published. The database and the associated resources can Latin i e x c H h f U u be downloaded from CSLT’s webiste6, and the recipe can be Cyrillic І ÖÓЧ Һ Х Ф Ү Ұ 7 و ءو پ ر س ت ۋ downloaded from github . Arabic Supported by the M2ASR project, we will publish another set of free databases and associated resources. Currently, the Latin w t s r p O o data ready for publish include the phone set, lexicon and Cyrillic У Т С Р П Ө О speech data of Tibetan, as well as the phone set and lexicon of Mongolia. The Kazakh speech database and the associated resources are part of the release. More information of the Fig. 1. Kazakh characters in Arabic, Latin and Cyrillic. release can be found in the M2ASR project website. TABLE I III. KAZAKH LANGUAGE FRONT VOWELS AND BACK VOWELS IN KAZAKH. This section reviews some properties of Kazakh, and the Type Written form difficulties in ASR. Front vowel A(va) O(vo) U(vu) i(ve) - Back vowel a o u e E A. Characters and phones Kazakh is one of the Turkic languages and belongs to Altai language family. Most of the speakers reside in Kazakstan, B. Vowel harmony in Kazakh Mongolia, and the Xinjiang Uygur Autonomous Region in According to Kramer[12], “Vowel harmony is the phe- China. There are three written forms for Kazakh alphabets, nomenon where potentially all vowels in adjacent moras of Arabic, Latin and Cyrillic. The Arabic characters are mostly syllables within a domain like the phonological of morpho- used by the Kazakh people living in China. The Cyrillic logical word systematically agree with each other with regard characters are widely used by the Kazakh people who live to one or more articulatory features.” This means that vowel in Kazakstan and Mongolia. The Latin characters were used harmony sets a constraint on which vowels may be found near for Kazakh people in China in 1964-1984, but it has been each other in a word or word group. substituted for Arabic characters. With the wide spreading Vowel harmony may result in departure of the true pronun- of online communication tools (e.g., WeChat), the young ciation from the spelling form which will hence the errors generation is used to using Latin characters for easier input. in the lexicon. These errors can be corrected by the vowel In most cases, pronunciation of Kazakh is strictly regularized, harmony rules. Unfortunately, the vowel harmony rules about following the rule reading as writing. This means that the Kazakh have not been fully investigated by researchers so far, characters and phones are largely the same, and so whenever and what we have known are just two rules.

A Free Kazakh Speech Database and a Speech Recognition Baseline

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support