Persian Sentences to Phoneme Sequences Conversion Based on Recurrent Neural Networks

Open Comput. Sci. 2016; 6:219–225 Research Article Open Access Yasser Mohseni Behbahani*, Bagher Babaali, and MussaTurdalyuly Persian sentences to phoneme sequences conversion based on recurrent neural networks DOI 10.1515/comp-2016-0019 personal computers, smart phones, and internet-based Received June 21, 2016; accepted August 2, 2016 services. This technology is generally divided into two main subsystems: Natural Language Processing (NLP) and Abstract: Grapheme to phoneme conversion is one of the speech synthesis [4, 5]. The first subsystem extracts pre- main subsystems of Text-to-Speech (TTS) systems. Con- defined linguistic and contextual characteristics from the verting sequence of written words to their corresponding input text. In TTS systems the output of NLP is a pronunci- phoneme sequences for the Persian language is more chal- ation form represented by sequence of phonemes. The sec- lenging than other languages; because in the standard or- ond subsystem takes sequence of phonemes; and extracts thography of this language the short vowels are omitted voice parameters which then be used to synthesize the and the pronunciation of words depends on their positions speech signal. The basic units of written text and pronun- in a sentence. Common approaches used in the Persian ciation are called grapheme and phoneme, respectively. commercial TTS systems have several modules and com- Hence extracting sequence of phonemes from written text plicated models for natural language processing and ho- is called Grapheme-to-Phoneme conversion (GTP). mograph disambiguation that make the implementation The subsystem for natural language processing de- harder as well as reducing the overall precision of system. pends highly on the structure of the language TTS system In this paper we define the grapheme-to-phoneme conver- is developing for. In the TTS system, the written form of sion as a sequential labeling problem; and use the mod- sentence is, first, converted to phoneme sequence which ified Recurrent Neural Networks (RNN) to create a smart shows how the written text must be pronounced. Extract- and integrated model for this purpose. The recurrent net- ing these phonemes from the written form of words is a works are modified to be bidirectional and equipped with challenging task. The first and also most common way Long-Short Term Memory (LSTM) blocks to acquire most to convert grapheme to phoneme sequence is using a of the past and future contextual information for decision lexicon [6–8]. In this method a large table of all possi- making. The experiments conducted in this paper show ble written words of a language with their corresponding that in addition to having a unified structure the bidirec- phoneme sequences is used. Number of words in the lex- tional RNN-LSTM has a good performance in recognizing icon is limited and also words of each language are grad- the pronunciation of the Persian sentences with the preci- ually augmenting. As a result pronunciation of unfamiliar sion more than 98 percent. words outside the training dataset is a potential problem for TTS systems based on lexicon. Moreover finding the pronunciation of words in lexicon can be exhaustive and 1 Introduction affects the overall performance of system. In addition to these drawbacks, process of the Arabic script-based lan- The conversion of written text to speech waveform is called guages (such as Persian) has more complexity than the “Text-to-Speech (TTS)” which has been being developed languages with the Roman script (like English): rapidly for many languages in the past decade [1–3]. TTS – There are 32 letters in the Persian language alpha- systems are widely used in different platforms including bet. Its phonemic system has 23 consonant phonemes and 6 vowels (3 short vowels /a/=’@ ’, /e/=’ @ ’, /o/=’@’; and 3 long vowels /i/=’ø’, /u/=’ð@’, /a/=’@’). Omis- *Corresponding Author: Yasser Mohseni Behbahani: Speech sion of vowels in the standard orthography of the Processing Laboratory of the Sharif University of Technology, Iran, E-mail: [email protected] Persian language causes homograph ambiguity. There Bagher Babaali: Department of Computer Science of the University are many Persian words with the same written form of Tehran, Iran, E-mail: [email protected] but different pronunciations and meanings. For ex- MussaTurdalyuly: Institute of Information and Computational ample, words ‘ ’ (man) and ‘ ’ (died); or ‘ ’ Technologies, Almaty, Kazakhstan, E-mail: [email protected] XQÓ XQÓ I © 2016 Yasser Mohseni Behbahani et al., published by De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. 220 Ë Yasser Mohseni Behbahani et al. (sat down, meeting) and ‘I ’ (did not wash) have In this paper we define the process of converting grapheme to phoneme in the Persian language as a se- same written forms when their short vowels (/a/=’@’, quence to sequence mapping problem. In contrast to the /e/=’ @’, /o/=’@’) are omitted. The sentence in which previous approaches which utilize several complex mod- these words are used determines their senses and pro- ules, we pursue a comprehensive and integrated model us- nunciations. To distinguish them a homograph disam- ing Recurrent Neural Networks (RNN) that takes the Per- biguation unit is required in the GTP system. sian sentences (sequences of letters) as input and con- – In the Persian language the role and pronunciation of verts them to the sequences of phonemes. The experi- a word changes based on its position in a sentence. ments in this research work show that recurrent networks For instance, the pronunciation of a word might end have high precision in this task. The rest of this paper is or- up with short vowel /e/=’ @’ (Kasr-e-Ezafe). ganized as follows: In section 2, we introduce the problem of sequence to sequence learning. The recurrent network These problems reduce the flexibility and efficiency of is a method to model the sequential labeling problems; lexicon in grapheme to phoneme conversion for the Per- which is discussed in this section. Moreover, in section 2 sian TTS more than other languages. The Persian commer- we describe the LSTM units that increase the memory of cial TTS products consist of several modules connected to RNN to model long-term patterns. In section 3, first the each other sequentially in such way that error of each mod- optimum implementation of architecture of LSTM (CUR- ule propagates through subsequent ones; and has neg- RENNT toolkit) is introduced; and later the experiments ative influences on their performance. As shown inthe conducted by this toolkit are presented. Section 4 is dedi- Fig. 1, these systems use a huge lexicon that has several cated to the conclusion. pronunciation forms for each word according to its part of speech. It means that system must also determine the part of speech of word in the sentence in order to disam- 2 Sequential labeling biguate homograph words. Unlike the English language, the Persian has a scrambling structure; and depending on The feed-forward deep neural networks have shown to be a the style of the writer and writing technique the position powerful method to model different problems of artificial and order of words can be altered. Hence determining of intelligence such as speech recognition [9, 10] as well as part of speech is not an easy task and any mistake can de- document analysis and image recognition [11, 12] within grade the final precision of system. Furthermore, this unit the past few years. These networks can model a huge must discern if a word is pronounced with Kasr-e-Ezafe. amount of training data in parallel. Nonetheless, deep neural networks are limited to the problems in which the length and dimensions of input and output data is known; and are unable to model sequential labeling problems with variable input length. Recently, many challenges like question answering, speech recognition, document analysis, and grapheme to phoneme conversion can (or must) be defined as sequence to sequence mapping problems. Hence, sequential learning is an important approach in the domain of artificial intelligence; and many applica- tions such as natural language processing and process of DNA sequences are based on it. Depending on the applica- tions and problems, sequential labeling is categorized in 4 different groups [13]: sequential prediction, sequential generation, sequential recognition, and sequential decision making. It is essential to have knowledge on the structure and mathematical definitions of these groups in order to perceive and use sequential labeling properly. Neural networks can be used for sequential process- Figure 1: The general Framework of grapheme-to-phoneme ap- ing in two ways. The first way is to eliminate the element proaches based on lexicon of time using sliding window technique [14] and gather the Persian sentences to phoneme sequences conversion based on recurrent neural networks Ë 221 input sequence divided into overlapping windows. The op- ers, respectively. The activation vectors of these layers are timum width of the window depends on the nature of the given in (1), (2), and (3) respectively. problem; and also this technique is highly sensitive to the x n fx n x n ... x n gt time shifts in the input sequence. Hence, the performance [ ] = 1[ ], 2[ ], , K[ ] (1) of sliding window is not satisfying in the sequential learning [15]. The second way is to use recurrent connections in t h[n] = fh1[n], h2[n], ... , hM[n]g (2) neural networks and define the problem as mapping be- tween two temporal sequences. This way led to the cre- t ation of recurrent neural networks. In this paper we use y[n] = fy1[n], y2[n], ... , yL[n]g (3) the recurrent neural network for the Persian grapheme to t Where denotes transpose. The connection weights of phoneme conversion.

Load more