Free Linguistic and Speech Resources for Tibetan
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia Free Linguistic and Speech Resources for Tibetan Guanyu Li*, Hongzhi Yu†, Thomas Fang Zheng‡, Jinghao Yan§,Shipeng Xu£ *Key Laboratory of National language intelligent processing, Gansu Province, Northwest Minzu University, Lanzhou, China E-mail: [email protected] Tel: +86-13809316272 †Key Laboratory of National language intelligent processing, Gansu Province, Northwest Minzu University, Lanzhou, China E-mail: [email protected] Tel: +86-13909318273 ‡Center for Speech and Language Technologies, Tsinghua University, Beijing, China E-mail: [email protected] Tel: +86-13801012234 §Key Laboratory of National language intelligent processing, Gansu Province, Northwest Minzu University, Lanzhou, China E-mail: [email protected] Tel: +86-18067528276 £Key Laboratory of National language intelligent processing, Gansu Province, Northwest Minzu University, Lanzhou, China E-mail: [email protected] Tel: +86-18394170747 Abstract— Tibetan is an important low-resource language in (Tibetan, Mongolia, Uyghur, Kazak and Kirgiz), and make China. A key factor that hinders the speech and language the resources open and free for research purposes. In this research for Tibetan is the lack of resources, particularly free paper, we report our progress on Tibetan resource ones. This paper describes our recent progression on Tibetan construction, including the phone set, the lexicon, resource construction supported by the NSFC M2ASR project, transcriptions and speech databases. All the resources are including the phone set, lexicon, as well as the transcription of available on the project webpage (http://m2asr.cslt.org), and a large scale speech corpus. Following the M2ASR free data can be obtained by either free download or delivery on program, all the resources are publicly available and free for request. researchers. We also release a small Tibetan speech database Note that there are 3 Tibetan dialect areas in China: U- that can be used to build a proto type Tibetan speech Tsang, Amdo, Kham. People in the three areas use the same recognition system. written form, but pronounce very differently. In U-Tsang, the most popular dialect is the Lhasa Tibetan, and in Amdo, I. INTRODUCTION the Xiahe Tibetan (or Labrang Tibetan) is the mostly Tibetan language is a key member in the family of minor influential. The M2ASR project focuses on the two dialects languages in China. It belongs to the Sino-Tibetan language as they are spoken by most of Tibetan people. In the family, the Tibeto-Burman subgroup. The speakers are following sections, we will first briefly summarize the about 6 million people, mainly distributed in China (Tibet, written and pronunciation system of Tibetan, and then Qinghai, Gansu, Sichuan and Yunnan provinces), India, propose our work on resource construction for the two Bhutan, and Nepal. Compared to the major languages such dialects respectively. as English and Mandarin, the research for Tibetan is far II. CHARACTERS, SYLLABLES AND WORDS IN from extensive, on both linguistics and speech processing. TIBETAN A key factor that hinders the research is that the resources are very limited and far from being standard. For example, Tibetan scripts are written in alphabets. From view of the lexicon is still in a small scale, and large-scale speech written form, there are 30 consonant letters and 4 vowel databases are very rare. Most seriously, most of the signs in Tibetan (note all dialects are the same in writing). resources are held by individual institutes, with very limited Each syllable is a combination of several consonant letters sharing and openness. and a vowel sign. Words are comprised of one or several The Multilingual Minorlingual Automatic Speech syllables. In the Tibetan script, syllables and words are Recognition (M2ASR) project aims to change the situation. written from left to right, and are separated by the same An ambition of this project is to construct a full set of delimiter “﹒” (called ཙག (/tsheg/) in Tibetan). language and speech resources for 5 minor languages Supported by the National Natural Science Foundation of China under Project (Grant No.61633013) and the Natural Science Foundation of Gansu Province(Grant No. 1506RJZA268) 978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017 Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia Each syllable involves a radical consonant letter, and Tibetan. The phone clusters of the consonants are presented other consonant letters could be appended to the radical in Table Ⅰ, where the consonant /f/ only appears in foreign consonant as superscript, subscript, prescript, postscript and words. The vowels in Lhasa Tibetan are listed in table Ⅱ post-postscript to form a syllable (Fig 1). A syllable must [1][2][3]. There is a long vowel form for each of the 8 contain a vowel sign, but a vowel sign corresponds to a cardinal vowels; 5 vowels (/ɛ/, /e/, /ø/, /i/ and /y/) have a sound /a/ can be omitted. In general, the vowel glottalized form and 3 vowels (/e/, /i/ and /y/) have a signs ◌ི, ◌ུ, ◌ེ, ◌ོ sound /i/, /u/, /e/, /o/ respectively, but nasalized form. Additionally, there are 4 consonants (/k/, exceptions also exist, as their pronunciations can be /m/ and /p/) that can be augmented to the end of a vowel to changed following some regular rules. Note that in all the form a coda, and there are two compound vowels: /au/ and dialects of Tibetan, two syllables may be pronounced the /iu/. same but each syllable has only a single pronunciation. In Table Ⅰ Clusters Of Lhasa Tibetan Consonants other words, there are many homophones but no polyphones Apical Labial bilabial labiodental retroflex PalatalVelar glottal in Tibetan. alveolar velar unaspirated p t c K ʔ Plosive voiceless aspirated pʰ tʰ cʰ kʰ unaspirated f ʦ ʈʂ ʨ Affricate voiceless aspirated ʦʰ ʈʂʰ ʨʰ Fricative unaspirated s ʂ ɕ h nasal voiced unaspirated m n ȵ Ŋ approximant voiced unaspirated j w Lateral fricative ɬ Lateral approximant voiced unaspirated l Fig 1 Constitution Of Syllable trill voiced unaspirated r The radical consonant, the prescript and superscript Table Ⅱ Vowels Of Lhasa Tibetan consonants together form the initial part of a syllable, and characters Post consonant tongue vowels long glottalized nasalized the vowel sign, the postscript and post-postscript consonants position rounded k m p u ŋ altogether form the final part. In ancient Tibetan, there are front low a a: ak am ap au aŋ medium many consonant compositions. These consonant front low ɛ ɛ: ɛˀ medium compositions are largely preserved in the modern Xiahe front high e e: eˀ ẽ em ep eŋ front medium rounded ø ø: øˀ dialect, however in the Lhasa dialect, most consonant high front high i i: iˀ ĩ ik im ip iu iŋ compositions sound just like a single constant. Another front high rounded y y: yˀ ỹ distinction between the Lhasa dialect and the Xiahe dialect back medium rounded o o: ok om op oŋ is that the former is tonal (four tones in total: 43, 44, 12 and back high rounded u u: uk um up uŋ 113[1]) while the latter is toneless. For these reasons, the two dialects sound very different and should be treated For speech recognition, we made a slight modification to different in resource construction. construct the ASR phone set. The first modification is that we merge a cardinal vowel with its corresponding long III. RESOURCES CONSTRUCTION FOR LHASA vowel form. The second modification is to treat /au/ and TIBETAN /iu/ as two single phones rather than splitting them into In this section, we describe our work on resources ingredient phones. To ease the text processing with construction for the Lhasa dialect. The resources include computers, all these phones (IPAs) are transformed into the phones set, the syllable lexicon, the word lexicon, text Latin letter, as shown in Table Ⅲ. database and speech database. Table Ⅲ Phones And Latin Transformation Of Lhasa Tibetan A. Phone set IPA Latin IPA Latin IPA Latin IPA Latin IPA Latin IPA Latin IPA Latin IPA Latin The phone set involves small and distinct pronunciation c c l l s s ŋ ng ʨ tx ɛ Ec øˀ edb ỹ yy units. These units are related to the consonants and vowels cʰ ch m m t t ȵ nn ʨʰ txh ɛˀ Ecb i i o o in the written form, and more reflect the true pronunciation. We follow the seminal work by Ge Sang Ju Mian [1] and h h n n tʰ th ɕ x ˀ ab e E iˀ ib u u define 29 consonants and 8 cardinal vowels in Lhasa j j p p tʂ ts ʂ ss f f eˀ Eb ĩ ii au au 978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017 Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia k k pʰ ph ʦʰ tsh ʈʂ ds ɬ lh ẽ Ee y y iu iu is also added to the word lexicon for it is often used to represent the foreign pronunciation /hua/. kʰ kh r r w w ʈʂʰ dsh a a ø Ed yˀ yb D. Text and transcription B. Syllable lexicon We collected a 100M text database of Tibetan for A syllable lexicon involves a set of syllables whose language modeling. From this huge database, we also pronunciations are defined. There are more than 8000 extract a transcription that will be used to collect the initial possible syllables in Tibetan, including syllables for foreign speech database that involves 50 hours of speech signals. words. We construct the syllable lexicon by constructing a The sources of the text include Tibetan newspapers, official text corpus involving 420,000 sentences (including both documents and sentences of spoken Tibetan. The sentences written and spoken), and then selected the most frequent in the text database are transformed into phones and syllables from this corpus.