TOWARDS LANGUAGE PRESERVATION: PRELIMINARY COLLECTION AND VOWEL ANALYSIS OF INDONESIAN ETHNIC SPEECH DATA

Auliya Sani1,2, Sakriani Sakti1, Graham Neubig1, Tomoki Toda1, Adi Mulyanto2, Satoshi Nakamura1

1Nara Institute of Science and Technology, Japan 2Bandung Institute of Technology, {auliya-f,ssakti,neubig,tomoki,s-nakamura}@is.naist.jp {13509067@std,adi@}.stei.itb.ac.id

ABSTRACT On the other hand, the global, borderless economy and information communication technologies have a great impact Multilingualism in Indonesia gradually faces a state of catas- on the way of communication. People have to be able to com- trophe. Although several projects have been initiated for municate well with others who speak different languages. As cultural preservation, the available technology that could an international language, English has become the most spo- support communication between elders and younger peo- ken language in the world with more than 1.8 billion speakers. ple within indigenous communities, as well as with people Thus, in modern Indonesia, along with the campaign of - outside the community, is still very rare in Indonesia. This hasa Indonesia, English is also promoted starting at primary presents the first step of long-term development of education. speech-to-speech translation system from ethnic languages Although using a common language, such as Indonesian to English/Indonesian, which is collection and analysis of as official Indonesia language, or English as a world language Indonesian ethnic speech corpora. Here, we will first focus helps the Indonesian people to face globalization, multilin- on the two largest ethnic groups in Indonesia: and gualism in Indonesia faces a state of catastrophe. Currently, Sundanese. of 726 languages, 146 are endangered, at risk of falling out of Index Terms— Language preservation, Indonesian eth- use, generally because there are few surviving speakers. If a nic languages, speech data collection, vowel analysis language loses all of its native speakers, it becomes extinct. In the near future, more and more languages will be endangered. 1. INTRODUCTION

Indonesia is an archipelago comprising approximately 17500 islands inhabited by hundreds of ethnic groups with more than 237 million people (based on Census 2010)1. The two largest ethnic groups are Javanese and Sundanese living in Fig. 1. Speech translation between ethnic languages to En- Island. Different ethnic groups speak various different glish/Indonesian. languages. One of the bridges that binds the people together Several projects have been initiated for cultural preser- is the usage of Bahasa Indonesia, the national language. It vation, which can prevent the endangered language from be- is a common language formed from hundreds of languages ing lost, some examples include holding language congress, spoken in the Indonesian archipelago, and was coined by In- documenting the words, making rules for public servant to donesian nationalists in 1928. It further became a symbol speak in ethnic languages on a given day, etc. Nevertheless, of national identity during the struggle for independence in the available technology that can support communication 1945. Compared to most other languages, which have a high between elders and younger people within indigenous com- density of native speakers, only small proportion of Indone- munities, as well as with people outside the community, is sia’s large population speaks Bahasa Indonesia as a mother still very rare in Indonesia. As a result, indigenous com- tongue while the great majority of people speak it as a second munities still face isolation due to language and cultural language with varying degrees of proficiency. To promote the barriers. Our long-term goal is to establish an infrastruc- usage of the , the government makes a ture of speech-to-speech translation from ethnic languages strong campaign to use Indonesian in daily life. to English/Indonesian (See Fig. 1). This technology enables 1Badan Pusat Statistik (Central Bureau of Statistic) – http://bps.go.id communication between two people who speak different languages. Therefore, speech translation technology is sig- Fig. 3 [1]. Hanacaraka and Sunda are described in nificant to indigenous communities in Indonesia to overcome Standard Ver. 6.0. language barrier, cross the cultural gap, and to face globaliza- tion. This paper presents the first step towards developing speech technology, which is collection and analysis of In- donesian ethnic speech corpora. As preliminary study, we start with the two largest ethnic groups in Indonesia: - vanese and Sundanese. Eventhough these languages are not yet endangered, the speakers of Javanese and Sundanese are greatly reduced recently. In the next section, we briefly de- scribe the overview of Javanese and Sundanese languages. The existing Indonesian data will be described in Section 3, and the current development of Javanese and Sundanese speech corpus will be described in Section 4. In Section 5, Fig. 2. Javanese : Carakan letters. the analysis of vowels in standard Indonesian, Javanese, and Sundanese is presented. Finally, we draw our conclusions in Section 6.

2. JAVANESE AND SUNDANESE LANGUAGES CHARACTERISTICS

2.1. Written Script Although the Indonesian language is infused with highly distinctive accents from different ethnic languages, there are Fig. 3. : Ngalagena letters. many similarities in patterns across the archipelago. Mod- ern Indonesian is derived from Malay dialect, which was the of Southeast Asia. In earliest records, Malay 2.2. Phoneme Set inscriptions are syllable-based and written in script, The Indonesian phoneme set consist of 10 vowels (including however modern Indonesian is currently phonetic-based and ) and 22 consonants [2]. The vowels include /a/ written in Roman script. It uses only 26 letters as with the (like “a” in “father”), // (like “ee” in “knee”), // (like “oo” in case of the English/Dutch . “moon”), // (like “e” in “bed”), /@/ (a sound, like “e” On the other hand, some of ethnic groups in Indonesia still in “lantern”), // (like “o” in “boss”), and four diphthongs, use their own transcription in daily life. The two largest eth- /ay/, /aw/, /oy/, and /ey/. nic groups in Indonesia, Javanese and Sundanese, are counted The Javanese and Sundanese phoneme sets are similar to in that category. Even in elementary school education, the those of Indonesian. The Sundanese phoneme set contains subject of learning this language is still given. Javanese tran- 7 vowels and 21 consonants. Sundanese has no diphthongs, scription is called Aksara Hanacaraka and Sunda transcrip- but instead has another vowel /eu/. Sundanese also does not tion called Aksara Sunda. Aksara means transcription in In- have consonant /f/ and /z/. However, influenced by the use of donesia. These transcriptions derive from Pallawa transcrip- Indonesian, nowadays Sundanese covers these consonants as tion from South India. Hanacaraka consists of 20 basic let- well. But still, Sundanese does not have /kh/ and /sy/ as in ters (consonants with vowel a) called Carakan, 20 letters to Indonesian. Similar to Sundanese, the Javanese phoneme set make basic consonant of each Carakan letter called Pasangan, has no diphthongs and did not have /f/ and /z/ previously. In 5 vowels, transcription for numbers and for foreign words addition, Javanese has /E/ and /O/ (like “a” in “saw”). In total, and honorifics, as well as called Sandhangan. To the Javanese phoneme set contains 8 vowels and 23 conso- make a different syllable, Sandhangan is added to change the nants. phoneme. Fig. 2 shows Carakan letters of Javanese script2. Unlike Indonesian and Sundanese, Javanese has many Hanacaraka is already included in Unicode (A980-A9DF). rules in reading especially for vowels. For example, when Similar to Hanacaraka, Aksara Sunda also has basic letters, the letter “a” lies in the end of word, it is called an open vowels, punctuation to change phoneme, and basic punctu- syllable and sometimes spoken as /O/. In the other hand, if the ation. Ngalagena, basic letters in Aksara Sunda, has been letter “a” meets consonant (closed syllable), it is spoken as registered in Unicode (1B80-IBBF). Ngalagena is shown in /a/. The vowel articulation pattern indicates the first two reso- 2The official site of Aksara Jawa – http://hanacaraka.fateback.com/ nances of vocal tract, F1 for height and F2 for backness. The Fig. 4. Articulatory pattern for vowels and consonants of Javanese, Sundanese, and Standard Indonesian. comparison of Indonesian, Javanese, and Sundanese vowel ing in a total of 44,000 speech utterances, which amounted articulation patterns is shown in Fig. 4 (left side). Consonants to around 43.35 hours of speech. In this experiments, we use can be made by changing the articulation area, style, and vo- the clean speech data with standard Indonesian, Javanese and cal chord condition. The comparison of Indonesian, Javanese, Sundanese accents. and Sundanese’s consonant articulatory patterns can be seen in Fig. 4 (right side). 4. COLLECTION OF JAVANESE AND SUNDANESE SPEECH DATA 3. EXISTING INDONESIAN DATA RESOURCE 4.1. Text Corpus Indonesian speech corpora were developed by the R&D Di- Two documents are collected from newspaper, Mangle3 on- vision of PT. Telekomunikasi Indonesia (R&D TELKOM) in line collection for Sundanese and Djaka Lodang magazine for collaboration with ATR as a continuation of the APT (Asia Javanese. The initial forms of these documents contain num- Pacific Telecommunity) project [3, 4]. bers, punctuation, abbreviations, acronyms, names, and for- A raw text source for the daily news task has already eign words. These documents were then converted to another been generated by an Indonesian student [5]. The source was documents by: compiled from “KOMPAS” and “TEMPO”, which are cur- • rently the longest and most widely read Indonesian newspa- converting all upper case letters into lower case per and magazine, respectively. This source consists of more • removing punctuation than 3160 articles, with around 600,000 sentences. R&D • changing numbers into words TELKOM further processed the raw text source to generate • select short sentences (max. 20 words for each sen- a clean text corpus. tence) From this raw text data, phonetically-balanced sentences From this text data, we then selected phonetically-balanced were selected by using the greedy search algorithm [6]; pro- sentences by using the greedy search algorithm[6]; this pro- ducing a total of 3168 sentences. Then, clean and telephone duced a total of 230 sentences as shown in Table 1. speech were recorded, simultaneously, at sampling frequen- cies of 16 and 8 kHz, respectively, by R&D TELKOM in 4.2. Speech Recording Bandung, Java Island, Indonesia. There were a total of 400 For recording, 20 native speakers were selected, 10 native speakers (200 males and 200 females). Four main accents speakers (5 males and 5 females) of Java ethnicity and the were covered: Batak, Java, Sunda, and standard Indonesian (without accent). Each speaker uttered 110 sentences, result- 3Majalah Sunda Online – http://www.majalah-mangle.com vowel /e/ and /@/ from standard Indonesian. The first chart Table 1. Javanese and Sundanese Text Corpora (shown in Fig. 5) was obtained from female speaker data and Attributes Javanese Sundanese the second chart (shown in Fig. 6) from male. These charts Number of Sentence 230 230 show vowels’ location in F1 and F2 scale in Hertz. From Number of Words 2999 4262 both charts we can see the difference in vowels’ location from Vocabulary Size 1529 1728 each language. The lower the F1 value, the higher (closer) the Number of Name Words 24 19 vowel. For both /e/ and /@/, Sundanese vowels lie on the top of Number of Foreign Words 2 3 other languages, followed by Javanese vowels and Indonesian vowels at the bottom. Even for the same vowels, the formant other 10 native speakers (5 males and 5 females) of Sunda values of each language are located in a slightly different area. ethnicity. Each speaker was asked to read prepared text of the The reasonable explanation of this phenomenon is the style of 230 sentences. Speech was recorded in two different places, speaking from each ethnic group itself. They use different di- in Indonesia and Japan. Speech was recorded in a quiet room. alects in speaking (further details can also be found in [8]). Recording materials were a Sony ECM-674 and a Sennheiser Furthermore, the position of Javanese /E/ lies in middle posi- HMD 280 Pro. Speech was recorded into WAV file at 44.1 tion closed to Indonesian /e/, while the position of Sundanese kHz sampling frequency 16 bit (sample size). Because the /eu/ lies in central-high position. But if we look entirely, the texts were taken from two different ethnic languages, file pattern from vowels is similar to articulatory pattern for vow- names have to be discriminated. For Sundanese speech els in Fig. 4. file, the label is EEEXXX F/M L C news Y Y Y Y.wav where: 200 • @ EEE is the code for ethnic languages, Jaw for Java and @ eu eueueu @ Snd fo Sunda, e e ee eu@ 400 e e eu @ @ e ee eu @ e e • e eu @@ XXX is the order of speaker (in this matter 001-010), e e @ @ e e e • e E @ F is female speaker and M for male speaker, 600 e E EE E E @ @ e EE @ F1 (Hz) @@@ • E @@ L is another code for ethnic languages, J for Java and S E @ e e @@ @ @ e @ for Sunda, e @ @ 800 e ee @ • @ @ C news means this speech built by reading news text @ clearly, and 1000 • 3000 2375 1750 1125 500 YYYY is the order of speech. F2 (Hz) Fig. 5. Vowel distribution of /e/, /@/, /E/, and /eu/ depending on 5. VOWEL ANALYSIS OF JAVANESE AND F1 and F2 for Javanese, Sundanese, and Indonesian female SUNDANESE speakers. 200 As described in Section 2.2, there is a difference between eu @ eu @ @ vowel articulatory pattern in Indonesian, Javanese, and Sun- eu eu @ 400 e e e eu @ e e ee @ @@@ e e e e @ danese. This section describes the experiment in comparing e E e e @@@ @ @ @ e E @ @@ E E @ @ Indonesian, Javanese, and Sundanese vowel based on the e E e @ @ e e E e @ e e Ee e @@@ 4 600 e data that has been collected. Here, Praat tools are used E e e @

F1 (Hz) e to create formant charts. Two speakers (female and male) with heavy accents were chosen from each language. Then, 800 the segmented phoneme data from these female and male speakers was processed separately. Here, we selected only 1000 3000 2375 1750 1125 500 the phoneme with previous nasal context which is commonly F2 (Hz) used in these languages. From the F1 and F2, the formant Fig. 6. Vowel distribution of /e/, /@/, /E/, and /eu/ depending charts were obtained (similar with experiments described on F1 and F2 for Javanese, Sundanese, and Indonesian male in [7]). In all charts, Indonesian vowels from experiment speakers. are marked by black font while Javanese vowel are marked by blue font. In the first two charts, Sundanese vowels are Next, we examine /a/, /o/, /O/ of Javanese vowels, in com- marked in purple. parison with vowels /a/ and /o/ from standard Indonesian. The First, we examine /e/, /@/, /E/ of Javanese vowels and /e/, first chart (shown in Fig. 7) was obtained from female speaker /@/, /eu/ in Sundanese vowels. In comparison, we also include data and the second chart (shown in Fig. 8) from male. Sim- 4Praat: Doing Phonetics by Computer – ilar tendencies as before, both charts show that the formant http://www.fon.hum.uva.nl/praat/ values of /o/ from Javanese are generally lower, meaning that and study how to build a speech-to-speech translation system 400 for Indonesian ethnic languages in rapid way by the use of o o oo existing Indonesian speech recognition. oo 600 O oO O O o O o O O O O 7. REFERENCES o 800

F1 (Hz) a a a a a o [1] I. Baidillah, U.A. Darsa, O. Abdurahman, T. Permadi, aaaa a a o a 1000 aaa G. Gunardi, A. Suherman, T. Ampera, H.S. Purba, D.T. a Nugraha, and D. Sutisna, Direktori Aksara Sunda untuk Unicode, Government of West Java, Bandung, Indone- 1200 3000 2375 1750 1125 500 F2 (Hz) sia, 2008. Fig. 7. Vowel distribution of /a/, /o/, and /O/ depending on F1 [2] H. Alwi, S. Dardjowidjojo, H. Lapoliwa, and A.M. and F2 for Javanese and Indonesian female speakers. Moeliono, Tata Bahasa Baku Bahasa Indonesia (In- donesian Grammar), Balai Pustaka, Jakarta, Indonesia,

400 2003. oo o O O oo o [3] S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and o o Oo oooooo o 600 o O O S. Nakamura, “Recent progress in developing Indone- O aa a aa a aaaaaaa a sian large-vocabulary corpora and LVCSR system,” in a aa Proc. MALINDO, Cyberjaya-Selangor, Malaysia, 2008, 800

F1 (Hz) pp. 40–45.

1000 [4] S. Sakti, P. Hutagaol, A.A. Arman, and S. Naka- mura, “Indonesian speech recognition for hearing- and speaking-impaired people,” in Proc. ICSLP, Jeju Island, 1200 3000 2375 1750 1125 500 Korea, 2004, pp. 1037–1040. F2 (Hz) [5] F. Tala, A Study of Stemming Effects on Information Fig. 8. Vowel distribution of /a/, /o/, and /O/ depending on F1 Retrieval in Bahasa Indonesia, Ph.D. thesis, The In- and F2 for Javanese and Indonesian male speakers. formation and Language System (ILPS) Group, Infor- matics Institute, University of Amsterdam, Amsterdam, the vowels lie higher than those of Indonesian speech. On the Netherland, 2003. other hand, the position of Javanese /O/ lies in middle position [6] J. Zhang and S.Nakamura, “An efficient algorithm to closed to Indonesian /o/. However, the formant values of /a/ search for a minimum sentence set for collecting speech from Javanese and standard Indonesian lie almost in the same database,” in Proc. ICPhS, Barcelona, Spain, 2003, pp. region. 3145–3148. Comparing the charts obtained from female and male speakers, the positions of vowels are different. However, [7] H. Wang and V.J. Van Heuven, “Acoustical analy- anatomically-related acoustic differences in the speech of sis of English vowels produced by Chinese, Dutch, males and females are well-established in the literature and American speakers,” in Linguistics, edited by [9, 10]. The vocal tract of adult females is shorter than J. M. van de Weijer and B. Los (Benjamins, Amster- that of adult males, causing higher F0 and higher formant dam/Philadelphia), 2006, pp. 237–248. frequencies. [8] E.V Zanten and V.J.V Heuven, “The Indonesian vowels as pronounced and perceived by Toba Batak, Sundanese 6. CONCLUSION and Javanese speaker,” Bijdragen tot de Taal-, Land- en Volkenkunde, vol. 140, no. 4, pp. 497–521, 1984. We have presented the development of Indonesian ethnic speech corpus, which includes Javanese and Sundanese lan- [9] G. Fant, “A note on vocal tract size factors and non- guages. An experiment to analyze the vowels between Ja- uniform F-pattern scalings,” STL-QPSR, vol. 7, no. 4, vanese, Sundanese and standard Indonesian has also been pp. 22–30, 1966. done. The results reveal that even for the same vowels, [10] R.L. Diehl, B. Lindblom, K.A. Hoemeke, and R.P. Fa- the formant values of each language are located in slightly hey, “On explaining certain male-female differences in different area. In other words, the vowels in Javanese and the phonetic realization of vowel categories,” Journal of Sundanese are acoustically higher than standard Indonesian. Phonetics, vol. 24, pp. 187–208, 1996. In future work, we will utilize these ethnic language corpora