Arxiv:2011.02128V1 [Cs.CL] 4 Nov 2020
Total Page:16
File Type:pdf, Size:1020Kb
Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis Sashi Novitasari1, Andros Tjandra1, Sakriani Sakti1;2, Satoshi Nakamura1;2 1Nara Institute of Science and Technology, Japan 2RIKEN Center for Advanced Intelligence Project AIP, Japan fsashi.novitasari.si3, tjandra.ai6, ssakti,[email protected] Abstract Even though over seven hundred ethnic languages are spoken in Indonesia, the available technology remains limited that could support communication within indigenous communities as well as with people outside the villages. As a result, indigenous communities still face isolation due to cultural barriers; languages continue to disappear. To accelerate communication, speech-to-speech translation (S2ST) technology is one approach that can overcome language barriers. However, S2ST systems require machine translation (MT), speech recognition (ASR), and synthesis (TTS) that rely heavily on supervised training and a broad set of language resources that can be difficult to collect from ethnic communities. Recently, a machine speech chain mechanism was proposed to enable ASR and TTS to assist each other in semi-supervised learning. The framework was initially implemented only for monolingual languages. In this study, we focus on developing speech recognition and synthesis for these Indonesian ethnic languages: Javanese, Sundanese, Balinese, and Bataks. We first separately train ASR and TTS of standard Indonesian in supervised training. We then develop ASR and TTS of ethnic languages by utilizing Indonesian ASR and TTS in a cross-lingual machine speech chain framework with only text or only speech data removing the need for paired speech-text data of those ethnic languages. Keywords: Indonesian ethnic languages, cross-lingual approach, machine speech chain, speech recognition and synthesis. 1. Introduction abau, Bataks, Bugisnese, Balinese, Acehnese, Sasak, Makasarese, Lampungese, and Rejang (Lauder, 2005). The Indonesia, which has some of the world’s most diverse reli- remaining 713 languages have a total population of only gions, languages, and cultures (Abas, 1987; Bertand, 2003; 41.4 million speakers, and the majority of these have very Hoon, 2006), consists of approximately 17,500 islands with small numbers of speakers (Riza, 2008). For example, 386 300 ethnic groups and 726 native languages (Tan, 2004). languages are spoken by 5,000 or less; 233 have 1,000 Roughly ten percent of the world’s languages are spoken speakers or less; 169 languages have 500 speakers or less; in Indonesia, making it one of the most multilingual na- and 52 have 100 or less (Gordon, 2005). These lan- tions in the world. In the amidst of such a large number of guages are facing various degrees of language endanger- Bahasa Indonesia local languages, , the national language, ment (Crystal, 2000). Several attempts have addressed pre- Ba- functions as a bridge that connects Indonesian people. serving ethnic languages, including national projects on the hasa Indonesia is a unity language, which was coined by use of ethnic languages in schools. Unfortunately, the avail- Indonesian nationalists in 1928 and became a symbol of able technology that could support communication within national identity during the struggle for independence in indigenous communities as well as with people outside the 1945. Since then, the Indonesian language is increasingly villages is limited. Indigenous communities face the digi- being spoken as a second language by the majority of its tal divide and isolation due to cultural barriers. Languages population. The decision to choose Indonesian as a unity continue to be threatened. language is one great success story of national language policy (Sneddon, 2003; Paauw, 2009). Speech-to-speech translation (S2ST) technology (Naka- arXiv:2011.02128v1 [cs.CL] 4 Nov 2020 Worldwide globalization is encouraging people to learn and mura, 2009; Sakti et al., 2013), which is innovative and speak languages that are prominent in global communities. essential, enables people to communicate in their native Bahasa Indonesia is now more commonly spoken as a first languages. S2ST recognized the speech of the source lan- language. Some younger Indonesians are also speaking En- guage into the text, translate the text to the target lan- glish as a second language. Although using Bahasa In- guage, and synthesizes back to speech waveforms. This donesia as the unity language is helping them face glob- overall technology involves research in machine transla- alization, multilingualism in Indonesia faces a catastrophe. tion (MT), automatic speech recognition (ASR), and text- The number of speakers of Indonesian ethnic languages is to-speech synthesis (TTS). However, the advanced devel- decreasing. It is predicted that Indonesia might shift from opment of these technologies relies heavily on supervised a multilingual nation to a monolingual society, threatening training and a broad set of language resources, including the existence of ethnic languages (Cohn and Ravindranath, speech and corresponding transcriptions of the source and 2014). target languages. Unfortunately, the amount of available Indonesian ethnic language data is limited, and preparing a Among its 726 ethnic languages, only thirteen have more large amount of paired speech and text data is expensive. than a million speakers, accounting for about 70% of the total population in Indonesia. These ethnic languages in- Recently, a machine speech chain framework was proposed clude Javanese, Sundanese, Malay, Madurese, Minangk- for the semi-supervised development of ASR and TTS sys- tems (Tjandra et al., 2019b). This framework was mo- of the Austronesian language family. The only difference tivated by the human speech chain mechanism (Denes et is that Indonesia (a former Dutch colony) adopted the Van al., 1993), which is a feedback loop phenomenon between Ophuysen orthography in 1901; Malaysia (a former British speech production and a hearing system that occurs when colony) adopted the Wilkinson orthography in 1904. In humans speak. In fact, humans do not separately learn 1972, the governments of Indonesia and Malaysia agreed to speak and listen using supervised training with a large to standardize “improved” spelling, which is now in effect amount of paired data. By simultaneously listening and on both sides. Even so, modern Indonesian and modern speaking, they monitor their own volume and articulation Malaysian are as different from one another as are Flemish and gradually improve their speaking capability, making it and Dutch (Tan, 2004). consistent with their intentions. Many words in the Indonesian vocabulary reflect the histor- In the machine speech chain, both ASR and TTS compo- ical influence of the various colonial cultures that occupied nents are pre-trained in supervised training using a lim- and influenced the archipelago. Indonesian words have bor- ited amount of labeled data. By establishing a feedback rowed heavily from Indian Sanskrit, Chinese, Arabic, Por- closed-loop between the ASR (listening component) and tuguese, Dutch, and English (Jones, 2007). Unlike Chinese, the TTS (speaking component), both components can as- it is not a tonal language; it has no declensions or conjuga- sist each other in unsupervised learning. Therefore, they tions. It has no changes in nouns or adjectives for different can be trained without requiring a large number of speech- gender, number, or case. Verbs do not take different forms text paired data. Previous machine speech chain studies to show number, person, or tense. A time adverb or ques- (Tjandra et al., 2019b; Tjandra et al., 2018; Tjandra et al., tion word can be placed at either the front or the end of 2019a), however, only utilized the framework for monolin- sentences. Since plural is often expressed by reduplication, gual model training and unsupervised training with a large Indonesian sentences have reduplication words. It is also number of unlabeled or unpaired data. The framework a member of the agglutinative language family, denoting a remains unutilized for a cross-lingual task with a limited complex range of prefixes and suffixes that are attached to number of unpaired data, such as ethnic languages. base words that can result in very long words (Sakti et al., In this work, we utilize the machine speech chain frame- 2004). work in a cross-lingual way to construct speech recognition The standard Indonesian language is mostly used in such and synthesis for the following four ethnic Indonesian lan- formal written settings as books, newspapers, and televi- guages: Javanese, Sundanese, Balinese, and Bataks. Al- sion/radio news broadcasts. Although the earliest records though scant research has addressed the development of in Malay inscriptions are syllable-based and written in Ara- ASR for Indonesian ethnic languages, one study developed bic script, modern Indonesian is phonetic-based written in ASR for those languages using a statistical approach with Roman script (Alwi et al., 2003). It only uses 26 letters, as a hidden Markov model and a Gaussian mixture model in the English/Dutch alphabet. (HMM-GMM) (Sakti and Nakamura, 2014). However, no ASR construction with a sequence-to-sequence deep- 2.2. Indonesian Ethnic Languages learning approach has been made. Furthermore, no pre- In this study, we chose to work with four ethnic languages: vious TTS study exists for Indonesian ethnic languages. Javanese, Sundanese, Balinese, and Bataks. Since a large The previous study also still requires