Toward Asian Speech Translation: The Development of Speech and Text Corpora for Vietnamese language

Thang Tat Vu 1,2 , Khanh Tang Nguyen 2, Le Thanh Ha 2, Mai Chi Luong 2, Satoshi Nakamura 1 1 NICT - National Institute of Information and Communications Technology, Japan 2 IOIT – Institute of Information Technology, [email protected], {ntkhanh, htle, lcmai}@ioit.ac.vn, [email protected]

Abstract facilitate the development of related fundamental information communication technologies. This paper presents the development of Viet- namese speech and text resources required in The research in Vietnamese speech and lan- building speech-to-speech translation systems. guage processing has been underdeveloped due The speech corpora for LVCSR consisted of to the unavailability of required corpora. In this read-speech, broadcast, and BTEC Test cor- paper, we present our efforts to develop the cor- pus. The corpus for speech synthesis was col- pora required in building speech-to-speech trans- lected with two voices of broadcasters. The lation systems between Vietnamese and another monolingual corpus consisted of more than 8 language, for instance, English or any other million sentences extracted from a Vietnamese Asian language in A-STAR project. This is the news website during years of 2003-2009. In result of the collaboration between NICT from addition, a multilingual corpus was con- Japan and IOIT from Vietnamese within A- structed with 100,000 sentences. We also con- ducted several experiments on these corpora, STAR project. and achieved reasonable results, 0.5826 A speech-to-Speech translation system is a BLEU score of MT, 13.85% WER of SR, and complex application that contains three impor- high quality of synthetic speech. tant components: speech recognition (SR), ma- chine translation (MT) of text, and speech syn- 1 Introduction thesis (SS). Each component, moreover, needs corresponding resources for its training and test- Vietnamese is the national and official language ing phases. In particular, a speech corpus is re- of Vietnam. It is the mother tongue of 86% Viet- quired for acoustic modeling, and a text corpus namese population and considered as a second (monolingual) is required for language modeling language by other ethnic minorities of Vietnam. of SR component. MT component needs not only Vietnamese was identified as part of the Mon- a monolingual corpus for language modeling, but Khmer branch of the Austro-Asiatic language also a multilingual corpus for training translation family, a family includes Khmer, spoken in models. A speech corpus is also required for the Cambodia, and various tribal and regional lan- SS component. Therefore, we are in the process guages, such as the Munda languages spoken in of developing these resources. eastern India, as well as others in southern China. In the next section, we briefly review Viet- Recently, since Muong language is found to be namese phonology, and the word-to-phonemes more closely related to Vietnamese, a group Vie- tool used to generate the pronunciation diction- tic was established as new subgroup of Mon- ary. Section 3 discusses our efforts in building Khmer, which also includes other minority lan- text corpora including both monolingual and bi- guages having the same characteristics as Viet lingual corpora. Section 4 describes the speech and Muong languages [1]. corpora for speech research. A read-speech, a From 2008, Vietnamese became one lan- broadcast, and BTEC Test corpus are constructed guage in A-STAR project (the Asian Speech for SR purposes. The other read-speech corpus is Translation Advanced Research). This project is also collected for SS purposes. In Section 5, we an Asian consortium that is expected to advance present several experiments on these resources. the state-of-the-art in multi-lingual man-machine These experiments are the preliminary results of interfaces in the Asian region [2]. The basic in- Vietnamese SR, SS and MT systems. Conclu- frastructure is to accelerate the development of sions are finally drawn in Section 6. large-scale spoken language corpora in Asia and 2 Pronunciation Dictionary Table 1. Structure of Vietnamese syllable Word-to-phoneme (W2P) is very important for Tone (6) SR, SS, and MT systems since pronunciation [Initial] Final (155) dictionary cannot lack in constructing resources (22) [Onset] (1) Nucleus (16) [Coda] (8) and in further researches of speech and language. However, it is a problem since phoneme can oc- cur in many forms in Vietnamese text, for in- W2P rules There are several rules to discompose the sylla- stance, a diphthong / ihe/ can variously appear as ble into parts, and recognize exactly the phoneme “ia”, “iê”, “yê”, and “ya”. Moreover, the same in each part. We summarize these rules in four text such as “a”, is used for several phonemes tables in Appendix A. such as / ´/, /a/ and /a°/. Therefore, the sylla- Since foreign words normally appear in ble structure needs to be carefully considered to Vietnamese text, the W2P tool should handle determine the pronunciation of Vietnamese their pronunciations. While constructing the mo- words. nolingual corpus as described in the next section, 2.1 Syllable structure we extracted the foreign word vocabularies that appear more than a certain threshold. The pro- Vietnamese is a syllabic tonal language. The to- nunciation of these words is manually con- tal of pronounceable syllables is about 19,000 structed, and referenced by the W2P tool. but the syllables used in practice are around 6,500 and only 2,400 without tone [4, 5]. In Vi- etnamese text, a word is often composed of sev- 3 Text Resources eral syllables separated by blank spaces. The 3.1 Mono-lingual Text Corpus word-to-phoneme process, therefore, consists of several syllable-to-phoneme processes. The Vietnamese text corpus was collected using The structure of a Vietnamese syllable is de- web crawling techniques on a Vietnamese news picted in Table 1. It also shows the number of websites: www.tuoitre.com.vn , which includes phonemes in syllable’s parts. A syllable is the economy news, entertainment, sports, culture, combination of Tone, Initial, Onset, Nucleus and general articles and others. Since a webpage al- Coda. The Initial, Onset and Coda are optional ways follows some standard format, we can cor- and they are consonant or semi-vowel. The Nu- rectly extract the content of the articles while cleus always exists, and it is vowel or diphthong. leaving out the redundant information such as Therefore, a syllable usually appears in form HTML tags, and text of advertisements. of CwVC , whereas C is consonant, V is vowel, The normalization is conducted subsequently and w is an optional semi-vowel. with some tasks such as converting the raw text to UTF8 encoding, standardizing the position of 2.2 W2P tool tone marks, splitting a paragraph to sentences, The Vietnamese W2P tool is a rule-based system converting the digits to numbers in text, and to- with several rules to generate the word pronun- kenizing the sentences. ciations. It converts a syllable to in three steps: After the normalization, we had acquired (1) realize tone marks, and get the syllable with- more than 5 Gigabytes of news from the year out tone, (2) discompose the syllable into small 2003 to 2008 . The number of sentences is more parts: Initial, Onset, Nucleus, and Coda, and (3) than 8 million. Subsequently, we extracted the recognize the associated phoneme for each part. vocabularies using words that appear more than a certain threshold out from the text. About 40,000 Realize tone marks vocabulary entries were extracted. The reason for Tone is a super-segmental feature, and it is very applying the threshold is to remove words with important in building Vietnamese words. There spelling mistakes and uncommon words. are six lexical tones in Vietnamese writing. Ex- Since it is common for foreign words to ap- cept the level tone, five marks are used on top of pear in Vietnamese text, the vocabularies contain or under of characters to indicate tones. The hundreds of common words from foreign lan- mark /`/ is used for Falling tone, /~/ for Broken guages. We manually construct the pronunciation tone, /?/ for Curve tone, /’/ for Rising tone, and dictionary for these words, which are useful for /./ for Drop tone. It is a simple process of W2P W2P tool as described in previous section. tool to realize and remove tone marks. 3.2 Bi-lingual Text Corpus 4.1.2 Broadcast corpus There is no publicly available bilingual corpus The corpus here consists of several story reading, for the Vietnamese NLP research community. VOV mailbag, news reports, and colloquy from An English-Vietnamese parallel corpus using the radio station “ of Vietnam”. web mining techniques was reported but its qual- To build this corpus, we downloaded sound ity is not clear [6]. files from internet, and converted all into wave In the task of constructing a bilingual corpus, format with 16 kHz sampling rate. Then, a si- it is very important to follow the intended usage lence detector was used to cut long audio files of the corpus. We make it clear that our corpus is into many small utterances. Each utterance usu- for general purpose. However, it is very difficult ally contains approximately 10 syllables. Six to construct a corpus that could meet all theoreti- people heard about 50,000 utterances and more cal requirements. We collect the bilingual text than 23,000 good utterances were selected for according to the following principles: making the corresponding transcriptions. 1. The bilingual text is for better serving to a In the radio station, there are a limited num- Vietnamese-to-English MT with Vietnamese ber of broadcasters. The corpus contains only as source language but also tradeoff with the about 30 broadcasters and visitors. Their voices quality of English-to-Vietnamese translation. do not cover all variations of Vietnamese. The 2. We collect some news reportage and trans- good news is that the corpus covers all of pho- late into English. Besides, translated scien- nemes and most of Vietnamese syllables. The tific documents, policy papers and its transla- number of distinct syllables with tone is around tion of government, and some book or essays 5000 , and without tone the number is about with good translation were also included. 2100 . 3. Text should be published recent years.

Currently, the number of sentence-pairs in the 4.1.3 BTEC Test corpus multilingual corpus is up to 100 thousand. The ATR basic travel expression corpus (BTEC) has served as the primary source for developing 4 Speech Resources broad-coverage speech translation systems [2]. Bilingual travel experts collected the sentences 4.1 Corpora for Speech Recognition from Japanese/English sentence pairs in travel Our target is to record a large Vietnamese speech domain “phrasebooks”. Under the A-STAR pro- corpus including read-speech, broadcast, and ject, there are also plans to collect synonymous BTEC Test speech corpus. sentences for Vietnamese following the BTEC standard. Currently, BTEC test set was selected

4.1.1 Read-Speech Corpus for recording as the BTEC test corpus. There Sentence Selection were 42 speakers ( 21 males and 21 females) and From the monolingual corpus, we extracted pho- each speaker uttered the same 510 BTEC sen- netically balanced sentences using the greedy tences, resulting in 21,420 utterances. search algorithm [3]. The number of selected 4.2 Corpus for Speech Synthesis sentences is 9200 . This set covers all of tri- phone, and all of the Vietnamese syllables. This process was the same as that used for the selection for SR corpus. The difference here is Recording that, since a speaker has to utter all of the se- Recording was done in a soundproof room, using lected sentences, we limited the number of sen- microphone Sennheiser HMD410-6, with 48 kHz tences and utterance length. From the text data of sampling rate and 16 bit per sample. Sentences the monolingual corpus described above, we ex- are prompted to the speaker for reading. An in- tracted only 3060 sentences. structor monitored and facilitated during the re- The recording conditions were as the same as cording. the conditions used to record the SR read-speech A Total of 42 speakers (21 male, and 21 fe- corpus. The goal is about 6 hours of read-speech males) were involved in the recordings of about of one male and one female broadcaster with 30 hours of read-speech, whereas each speaker standard Northern dialect. Each speaker contrib- contributes 460 utterances. It takes about three utes about 3 hours of speech or 3060 sentences. hours to finish the recordings for each speaker. Each speaker takes about 20 hours to finish the The current corpus contained 19,320 utterances. recordings. Table 2. Vietnamese phonemes in IPA framework [10] was used for training the models Type List Num and decoding. More than 500 English sentences b, m, f, v, t, tí , d, n, z, ¸, s, Í, in the same domain are selected for testing. Us- Initial 22 c, ˇ , , l , k, ≈, , ©, h, ? ing the phrase-based method, the BLEU score is 0.5826 . Onset uª 1 vowel i, }, u, e, {, {°, o, ´, ´, °a, a°, ø, ø° 5.2 Speech Recognition Experiment

cleus 16

u diphthong ihe, uho, }h{

N The Vietnamese LVCSR system was developed consonant m, n, , p, t, k 8 using the ATR SR toolkit. The Broadcast corpus

Coda semi-vowel uª, iª was used as the training data, and the BTEC Test Table 3. Consonant classification corpus was used as the test set. Places Labio- Delta- Labial Palatal Velar Glottal Pronunciation Dictionary Manner delta Alveolar Based on the W2P tool as described in section 2, Voiceless p t ˇ c k ? Unaspired we can build a pronunciation dictionary which Voiceless also includes some common foreign words.

Stop tí Aspired Voiced b d Feature Extraction Nasal m n A Hamming window was applied for each of 20

Voiceless f s Í ≈ h ms frame with 10 ms overlap. Feature vectors for each frame were 25 dimensional including 12 - Voiced v z ¸ ©

Fricative order MFCC, delta of MFCC and log power. Approximant uª l iª Segmentation Table 4. Vowel classification Acoustic models of our previous study [4] were Front back back applied for the forced alignment process. It is un-rounded Un-rounded rounded efficient and less time consuming for labeling. Close i } u Acoustic modeling Open-mid e { {° o Mouth Manner Open ´ ´° a a° ø ø° A shared state in HMM topology was obtained Diphthongs ihe uho }h{ using the algorithm based on the MDL optimiza- tion criterion. All tri-phone models were gener- ated with five Gaussian mixtures per state. 5 Experiments Language Modeling 5.1 Machine Translation Experiment The word bigram and the trigram were trained using text that included both the monolingual Statistical machine translation (SMT) has domi- corpus and the BTEC sentences nated in the machine translation field. In this ex- periment, we present a preliminary experiment of Recognition Accuracy English-Vietnamese translation with one of the Using a trigram language model, five Gaussian most popular SMT frameworks – MOSES [10]. mixtures per HMM state, the performance of the Vietnamese speech recognizer achieved a WER Preparing tools and data of 13.85% on the BTEC Test corpus. In Vietnamese, words are comprised of several syllables separated by blank spaces in text. 5.3 Speech Synthesis Experiment Hence, the basic of Vietnamese language proc- In this experiment, we apply the HSS approach essing is to segment text into words. There are to Vietnamese synthesis, using the male voice of tools such as JVnSegmenter [7]. the synthesis corpus. Training models Segmentation The language model is trained using SRILM [8] We apply the forced alignment using SR model on the monolingual corpus in Section 3.1. We in previous section. After manually revising the used 15,000 sentence-pair corpus related to legal segmented labels, the annotated with contextual domain in Section 3.2 to build the translation information as described in Appendix B were model. Word alignment is acquired by using automatically generated. GIZA++ [9] and the phrase-based MOSES Table 5. Results of the MOS test were randomly selected. With two types of au- Mean opinion score dio, the number of stimuli was 160. We con- Synthesis speech 3.23 ducted the test with twelve subjects who had Natural speech 4.55 normal hearing. The speech signals were played in random order in the tests. The MOS result in (a) Table 5 implied that the quality of natural speech is from good to excellent, and the quality of syn- (b) thetic speech is from fair to good.

(a) 6 Conclusion The Development of Speech and Text Corpora for Vietnamese language was presented. The speech corpora for speech recognition consisted (b) of read-speech, broadcast, and the BTEC Test corpus. The corpus for synthesis was also col- lected with two voices of one male and one fe- Figure 1: Pitch contours and spectrograms of (a) male broadcaster. The monolingual text corpus natural, and (b) synthetic speech. consisted of more than 8 million sentences ex- tracted from a Vietnamese news website during Decision tree years of 2003-2009. In addition, a bilingual cor- In order to carry out decision tree-based context pus was constructed with 100 thousand sen- clustering, some questions were determined to tences. In addition, a rule-based W2P tool was cluster the phonemes. They were derived from developed to generate the required lexicon for phonetic characteristics as summarized in Tables the MT, SR and SS engines. We conducted sev- 2, 3 and 4. Afterwards, these questions were ex- eral experiments with these corpora, and achieve tended to include all the contextual information reasonable result for the MT, SR, and SR engines, as described in Appendix B. 0.5826 BLEU score in the MT experiment, Objective Test 13.85% WER in the SR experiment, and high quality and intelligibility of synthetic speech in Figure 1 shows a comparison of F0 patterns and both objective and subjective tests. These results spectrogram between synthesized and original are fundamental or baseline results for further speech signals for a given sentence that is not research in Vietnamese speech and language included in the training database but uttered by processing fields. the speaker who recorded the database. It can be noticed that the generated spectrogram and F0 Appendix A contour are quite close to the natural pattern. Vietnamese Initial, Onset, Nuclei and Coda in International Phonetic Alphabet Intelligibility Test The test is used to estimate the intelligibility of Table A1 . Initial phonemes synthetic speech signals. We conducted this test IPA Vietnamese text Example with ten subjects who had normal hearing. For b b bun bã each subject, forty short sentences, from three to d d ñy ñà seven word length, were randomly selected and t t tan tác synthesized. They were asked to listen to each tí th th ơm th o utterance only once and write down what they ˇ tr tr c tr c c ch ch heard to avoid training effects in determining ch ó t k (before / i, e, ´/ ) kiêu kỳ words. The result shows that 100% of the utter- k c (before /u, o, a, {, }/) cu cnh ances were intelligible to subjects. q (before /uª/) quây qun MOS Test m m mưt mà As a further subjective evaluation, MOS tests n n no nê were used to measure the quality of synthetic nh nh anh nh n ngh (before /i, e, ´/ ) ngh i, ngh ê speech in comparison with natural one. The rated levels were bad (1), poor (2), fair (3), good (4), ng ng ng ày and excellent (5). In this test, eighty sentences f ph ph t ph i v v vi vã S x xa xôi Appendix B d d dãi Vietnamese contextual information z gi gi i gi ang, • g (before /i, ihe/) gì, gìn, ging Phoneme level: - Two preceding, current, two succeeding phonemes l l long lanh - Position in current syllable (forward, backward) Í s sm s a • Syllable level ¸ r ra rung - Tone types of two preceding, current, two succeed- ≈ kh kh ông kh í ing syllables © gh (before /i, e, ´/) gh , gh i, gh e - Number of phonemes in preceding, current, suc- g gà gô ceeding syllables

h h hi h - Position in current word (forward, backward)

p p sa-pa, pê-ñan - Stress-level - Distance to {previous, succeeding} stressed syllable Table A2. Onset phonemes • Word level IPA Vietnamese text Example - Part-of-speech of {preceding, current, succeeding} o (before /a, a°, ´/) ho h on words uª u huy t un - Number of syllables in {preceding, current, suc- ceeding} words Table A3. Neucleus phonemes - Position in current phrase

IPA Vietnamese text Example - Number of content words in current phrase {before, after} current word y (before / u/) su y, ngu y i - Distance to {previous, succeeding} content words i tinh t ích - Interrogative flag for the word; e ê bnh b ch • Phrase level: ´ e ngh e, v e, - Number of {syllables, words} in {preceding, cur- ´° a (before ch /k/, nh / /) sách, x anh rent, succeeding} phrases;

u u súng, v ui - Position of current phrase in utterance; • o ô ô-tô, l ô nh ô Utterance level: - Number of {syllables, words, phrases} in the utter- ø o cn c on ance; ø° o (before /k, /) vòng, t óc } ư rng r c References { ơ nơm n p

{° â ân c n [1] SEAlang “Project: Mon-Khmer languages. The Vietic Branch,” http://sealang.net/mk/vietic.htm a a lan c an [2] S. Nakamura, E. Sumita, T. Shimizu, S. Sakti, S. Sakai, ă mt ăn n ăn a° J. Zhang, A. Finch, N. Kimura, and Y. Ashikari, “A- a (before u /uª/, y /iª/) lau t ay STAR: Asia speech translation consortium,” In Proc. ia (no onset, no coda) kia , th ìa , b ia ASJ Autumn Meeting, Japan, 2007

iê (no onset) tiê n t in [3] J. Zhang, S. Nakamura. “ An efficient algorithm to ihe search for a minimum sentence set for collecting yê (after /uª/before/uª, iª/) uyn chu yn speech database ,” In Proc. ICPhS, Barcelona, Spain, ya (after /uª/) khu ya 2003, pp. 3145–3148. ua (no coda) vua ch úa uho [4] T.T Vu, T.K. Nguyen, H.S. Le, C.M. Luong, " Viet- uô mun, tut namese tone recognition based on MLP neural net- ưa (no coda) mưa v a work ," Proc. Oriental COCOSDA 2008. }h{ ươ ươ ng b ưng [5] T.T Vu, D.T. Nguyen, M.C. Luong, J.P Hosom, "Viet- namese large vocabulary continuous speech recogni- Table A4. Coda phonemes tion ," Proc. INTERSPEECH, 2005, pp. 1689-1692. IPA Vietnamese text Example [6] B.V. Dang, Q.B. Ho, “ Automatic construction of Eng- p p lp cp lish-Vietnamese parallel corpus through web mining ,” t t lt nh t Proc RIVF 2007. [7] T.C. Nguyen, H.X. Phan, Jvnsegmenter: A java-based m m ñom dó m Vietnamese word segmentation tool, 2007, n n mà n, s ơn http://jvnsegmenter.sourceforge.net/ ch (after /i,e, ´° /) thí ch s ch k [8] A. Stolcke, SRILM – an extensible language modeling c ñưc vi c toolkit. In Pro. ICSLP, 2002 nh (after /i,e, ´° /) mì nh , bá nh [9] F. J. Och, Giza++: Training of statistical translation ng vùng vng models, http://www.fjoch.com/GIZA++.html [10] P. Koehn et al., “Moses: Open source toolkit for uª o (after /´,a /) le o ca o statistical machine translation,” ACL, 2007

u kê u c u [11] K. Tokuda, H. Zen, A.W. Black, “ An HMM-based y (after /{°,a°/) mâ y ba y iª speech synthesis system applied to English ,” Proc. of i nó i r i 2002 IEEE SSW, Sept. 2002