<<
Home , Gaf

Open Comput. Sci. 2016; 6:219–225

Research Article Open Access

Yasser Mohseni Behbahani*, Bagher Babaali, and MussaTurdalyuly Persian sentences to phoneme sequences conversion based on recurrent neural networks

DOI 10.1515/comp-2016-0019 personal computers, smart phones, and internet-based Received June 21, 2016; accepted August 2, 2016 services. This technology is generally divided into two main subsystems: Natural Language Processing (NLP) and Abstract: Grapheme to phoneme conversion is one of the speech synthesis [4, 5]. The first subsystem extracts pre- main subsystems of Text-to-Speech (TTS) systems. Con- defined linguistic and contextual characteristics from the verting sequence of written words to their corresponding input text. In TTS systems the output of NLP is a pronunci- phoneme sequences for the is more chal- ation form represented by sequence of phonemes. The sec- lenging than other languages; because in the standard or- ond subsystem takes sequence of phonemes; and extracts thography of this language the short vowels are omitted voice parameters which then be used to synthesize the and the pronunciation of words depends on their positions speech signal. The basic units of written text and pronun- in a sentence. Common approaches used in the Persian ciation are called grapheme and phoneme, respectively. commercial TTS systems have several modules and com- Hence extracting sequence of phonemes from written text plicated models for natural language processing and ho- is called Grapheme-to-Phoneme conversion (GTP). mograph disambiguation that make the implementation The subsystem for natural language processing de- harder as well as reducing the overall precision of system. pends highly on the structure of the language TTS system In this paper we define the grapheme-to-phoneme conver- is developing for. In the TTS system, the written form of sion as a sequential labeling problem; and use the mod- sentence is, first, converted to phoneme sequence which ified Recurrent Neural Networks (RNN) to create a smart shows how the written text must be pronounced. Extract- and integrated model for this purpose. The recurrent net- ing these phonemes from the written form of words is a works are modified to be bidirectional and equipped with challenging task. The first and also most common way Long-Short Term Memory (LSTM) blocks to acquire most to convert grapheme to phoneme sequence is using a of the past and future contextual information for decision lexicon [6–8]. In this method a large table of all possi- making. The experiments conducted in this paper show ble written words of a language with their corresponding that in addition to having a unified structure the bidirec- phoneme sequences is used. Number of words in the lex- tional RNN-LSTM has a good performance in recognizing icon is limited and also words of each language are grad- the pronunciation of the Persian sentences with the preci- ually augmenting. As a result pronunciation of unfamiliar sion more than 98 percent. words outside the training dataset is a potential problem for TTS systems based on lexicon. Moreover finding the pronunciation of words in lexicon can be exhaustive and 1 Introduction affects the overall performance of system. In addition to these drawbacks, process of the script-based lan- The conversion of written text to speech waveform is called guages (such as Persian) has more complexity than the “Text-to-Speech (TTS)” which has been being developed languages with the Roman script (like English): rapidly for many languages in the past decade [1–3]. TTS – There are 32 letters in the Persian language alpha- systems are widely used in different platforms including . Its phonemic system has 23 consonant phonemes

and 6 vowels (3 short vowels /a/=’@ ’, /e/=’ @ ’, /o/=’@’;  and 3 long vowels /i/=’ø’, /u/=’ð@’, /a/=’@’). Omis- *Corresponding Author: Yasser Mohseni Behbahani: Speech sion of vowels in the standard orthography of the Processing Laboratory of the Sharif University of Technology, Iran, E-mail: [email protected] Persian language causes homograph ambiguity. There Bagher Babaali: Department of Computer Science of the University are many Persian words with the same written form of Tehran, Iran, E-mail: [email protected] but different pronunciations and meanings. For ex- MussaTurdalyuly: Institute of Information and Computational ample, words ‘ ’ (man) and ‘ ’ (died); or ‘   ’ Technologies, Almaty, Kazakhstan, E-mail: [email protected] XQÓ XQÓ I‚‚ © 2016 Yasser Mohseni Behbahani et al., published by De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. 220 Ë Yasser Mohseni Behbahani et al.

(sat down, meeting) and ‘I‚ ‚ ’ (did not wash) have In this paper we define the process of converting grapheme to phoneme in the Persian language as a se- same written forms when their short vowels (/a/=’@’, quence to sequence mapping problem. In contrast to the /e/=’ @’, /o/=’@’) are omitted. The sentence in which previous approaches which utilize several complex mod- these words are used determines their senses and pro- ules, we pursue a comprehensive and integrated model us- nunciations. To distinguish them a homograph disam- ing Recurrent Neural Networks (RNN) that takes the Per- biguation unit is required in the GTP system. sian sentences (sequences of letters) as input and con- – In the Persian language the role and pronunciation of verts them to the sequences of phonemes. The experi- a word changes based on its position in a sentence. ments in this research work show that recurrent networks For instance, the pronunciation of a word might end have high precision in this task. The rest of this paper is or- up with short vowel /e/=’ @’ (Kasr-e-Ezafe). ganized as follows: In section 2, we introduce the problem of sequence to sequence learning. The recurrent network These problems reduce the flexibility and efficiency of is a method to model the sequential labeling problems; lexicon in grapheme to phoneme conversion for the Per- which is discussed in this section. Moreover, in section 2 sian TTS more than other languages. The Persian commer- we describe the LSTM units that increase the memory of cial TTS products consist of several modules connected to RNN to model long-term patterns. In section 3, first the each other sequentially in such way that error of each mod- optimum implementation of architecture of LSTM (CUR- ule propagates through subsequent ones; and has neg- RENNT toolkit) is introduced; and later the experiments ative influences on their performance. As shown inthe conducted by this toolkit are presented. Section 4 is dedi- Fig. 1, these systems use a huge lexicon that has several cated to the conclusion. pronunciation forms for each word according to its part of speech. It means that system must also determine the part of speech of word in the sentence in order to disam- 2 Sequential labeling biguate homograph words. Unlike the English language, the Persian has a scrambling structure; and depending on The feed-forward deep neural networks have shown to be a the style of the writer and writing technique the position powerful method to model different problems of artificial and order of words can be altered. Hence determining of intelligence such as speech recognition [9, 10] as well as part of speech is not an easy task and any mistake can de- document analysis and image recognition [11, 12] within grade the final precision of system. Furthermore, this unit the past few years. These networks can model a huge must discern if a word is pronounced with Kasr-e-Ezafe. amount of training data in parallel. Nonetheless, deep neural networks are limited to the problems in which the length and dimensions of input and output data is known; and are unable to model sequential labeling problems with variable input length. Recently, many challenges like question answering, speech recognition, document analy- sis, and grapheme to phoneme conversion can (or must) be defined as sequence to sequence mapping problems. Hence, sequential learning is an important approach in the domain of artificial intelligence; and many applica- tions such as natural language processing and process of DNA sequences are based on it. Depending on the applica- tions and problems, sequential labeling is categorized in 4 different groups [13]: sequential prediction, sequential generation, sequential recognition, and sequential deci- sion making. It is essential to have knowledge on the struc- ture and mathematical definitions of these groups in order to perceive and use sequential labeling properly. Neural networks can be used for sequential process- Figure 1: The general Framework of grapheme-to-phoneme ap- ing in two ways. The first way is to eliminate the element proaches based on lexicon of time using sliding window technique [14] and gather the Persian sentences to phoneme sequences conversion based on recurrent neural networks Ë 221 input sequence divided into overlapping windows. The op- ers, respectively. The activation vectors of these layers are timum width of the window depends on the nature of the given in (1), (2), and (3) respectively. problem; and also this technique is highly sensitive to the x n {x n x n ... x n }t time shifts in the input sequence. Hence, the performance [ ] = 1[ ], 2[ ], , K[ ] (1) of sliding window is not satisfying in the sequential learn- ing [15]. The second way is to use recurrent connections in t h[n] = {h1[n], h2[n], ... , hM[n]} (2) neural networks and define the problem as mapping be- tween two temporal sequences. This way led to the cre- t ation of recurrent neural networks. In this paper we use y[n] = {y1[n], y2[n], ... , yL[n]} (3) the recurrent neural network for the Persian grapheme to t Where denotes transpose. The connection weights of phoneme conversion. these layers are given in (4).

win win wh wh wout wout = ( ij )M K , = ( ij)M M , = ( ij )L M (4) 2.1 Recurrent Neural Networks × × × In the mode of forward pass, the activation value of neuron The structure of RNN is like other neural networks except j of hidden layer is obtained using (5) and (6): they have at least one cyclical synaptic connection in their K M ′ ∑︁ (︁ in )︁ ∑︁ h ′ hidden neurons to save and use the pattern information hj [n] = wjk × xk [n] + (wjm×hm [n − 1]) (5) of sequences. Unlike traditional Feed-forward Neural Net- k=1 m=1 works (FNN) and Multilayer Perceptron (MLP), RNN not ′ only maps input to output but also takes entire history of hj [n] = θj(hj [n]) (6) input sequence into account. In recurrent networks learn- Where θj is the nonlinear activation function of neuron j in ing is a temporal process; and the values of neurons in the hidden layer. The value of output neuron l is calculated the output layer depend on the memory of inputs. In the using (7). basic architecture, these networks can be considered as a M ∑︁ out MLP in which each neuron in hidden layers is equipped yl [n] = (wlm ×hm[n]) (7) with feedback loop. Learning procedure is done by the m=1 same back propagation algorithms used in feed-forward Even though recurrent networks have a good performance networks. Fig. 2 shows the structure of recurrent neural in solving sequential labeling problems, the standard networks. structure of these networks has three drawbacks [16, 17]: – RNN does not use the future context to label current input of sequence. – RNN has limited capability for save and use pattern in- formation in durations more than several time steps. – Non-parallel weight updating in RNN model makes the training procedure of these networks very slow and time consuming.

These drawbacks prevented recurrent networks to be widely used by scientist in the fields of sequential label- ing and temporal classification.

2.2 Bidirectional Recurrent Neural Networks

Figure 2: The general structure of recurrent neural networks In order to make decision and map each element of the in- put sequence to output the recurrent networks exploit only the previous pattern information and neglect later context. Like the feed-forward networks, recurrent networks The future pattern of input sequence has as much valu- K has input, hidden, and output layers. Imagine we have , able information as its past pattern does. To resolve this M, and L neurons in the input, hidden, and output lay- 222 Ë Yasser Mohseni Behbahani et al. problem, at the same time as the previous pattern is cir- fective approach was the LSTM architecture which is pro- culated in the network; the future context must be pre- posed by Hochreiter and Schmidhuber [22]. Like flip-flap sented to it in both training and testing stages. There ex- units in the digital computers, Hochreiter used memory ist several ways to use the future pattern information of blocks with feedback connections. Each block consists of sequence among which the approach proposed by Schus- several memory cells equipped with feedbacks and input, ter and Paliwal [18] is more suitable. Baldi used their ap- output, and forget gates. These gates are responsible for proach to analyze bioinformatics information [19, 20]. In writing, reading, and resetting the memory cells of block; this approach, two parallel recurrent networks are defined and enable the network to save and access the pattern in- with joint output layer. The input sequence is given back- formation in the long period of times. In the LSTM architec- wards and forwards to each network. In this architecture, ture, the traditional neurons in hidden layer are replaced the recurrent network can use both past and future pattern by these memory blocks. Fig. 4 shows a schematic of a to make decision and map the current duration of input LSTM memory block with one memory cell. sequence to output. Fig. 3 depicts the structure of bidirec- tional recurrent neural network.

Figure 3: The structure of bidirectional recurrent neural networks Figure 4: A long-short term memory block with one memory cell

Using network in both directions; and improving the 2.3 Long-Short Term Memory Blocks memory of neurons result Bidirectional Recurrent Neu- ral Networks with Long-Short Term Memory (BRNN-LSTM) The recurrent networks can hold the pattern information which uses more pattern information than standard recur- of sequence for a short period of time. The pattern infor- rent networks. Graves implemented the structure of BRNN- mation used by a recurrent network is limited to few time LSTM as a CPU-based and one threaded library called steps; and information farther than ten steps cannot be RNNLib [16, 23]. considered in the current decision. The effect of a time step and relevant pattern information of a certain input quickly becomes traceless and after 10 steps will have no role in the 3 Experiments and results decision making and mapping of the current input [15, 16]. By analyzing error stream in the back propagation error, Bidirectional recurrent neural network which is equipped Hochreiter and et al., found that in standard recurrent net- with LSTM memory blocks are highly accurate for different works the effect of propagated error decreases exponen- sequential labelling problems; like speech and handwrit- tially by passing time [21]. Several attempts had been done ing recognition [16, 23]. However, there are time dependen- to increase the memory of these networks; but the most ef- cies among elements of the sequence in the training proce- Persian sentences to phoneme sequences conversion based on recurrent neural networks Ë 223

Table 1: Grapheme of letters and their corresponding pronunciation candidates

Letter Pronunciation Candidates H -p-p/--po-pp-pp/-ppe-ppo-pen-pon- À -g-g/-ge-go-gg-gg/-gge-ggo-gen-gon--

@ -/-e-o-a-@-NoPro-@/-@e-@o-@a-a@-aye-@aye-/lef- € -s-s/-se-so-ss-ss/-sse-sso-sen-son-sin- dure; and parallel implementation of recurrent networks In this paper we have two major experiments. In both is not possible. As a result, these networks have not at- experiments the Persian written corpus is used. 80%, 10%, tracted the researchers in these fields because the training and 10% of this corpus are respectively dedicated to train, procedure is very slow and time consuming. The machine validation, and test sets. In the first experiment we consid- learning and signal processing laboratory of the University ered each sentence as a separate sequence. In the second of Munich has resolved this problem [17]. The researchers experiment, each 10 sentences of the corpus is set as the in this laboratory have dramatically accelerated the train- input sequence; and one neuron is added to the output ing procedure by using mini-batch learning method. Their layer to indicate end of sentence. The results are shown proposed approach is implemented as a C++ GPU-based in the Tables 2 and 3. After each epoch the errors in test toolkit; called CURRENNT in which the training data is di- and validation sets are reduced and their weights are con- vided into small segments and the weights are updated in verged. The better results in the second experiment were a parallel mode. In addition, CURRENNT can be run on expected; because its model is trained with more pattern CPU in the sequential mode; but to fully use the technology information than the model in the first experiment (10 sen- embedded in this toolkit, the computers equipped with tences compared to one sentence). GPUs of NVIDIA are recommended. This toolkit can also be Like speech recognition systems to calculate the error, used in both Linux and Windows operating systems. The the actual output phoneme sequence is compared with the structure of model is specified by user in the JavaScript Ob- expected phoneme sequence (true); and number of inser- ject Notation (JSON) format. The training, validation, and tions, deletions, and substitutions must be counted. The test data are given to the CURRENNT in the NetCDF format. error is calculated according to 10: In addition to the BRNN-LSTM, this toolkit can be used True Ins Del Sub Error − ( + + ) to create feed-forward networks or combination of feed- (%) = Total (8) forward and BRNN-LSTM. In this research several experiments were conducted In this equation Total is total number of phonemes; and to convert grapheme sequence to the phoneme sequence True, Ins, Del, and Sub are number of true, inserted, for the Persian language using CURRENNT. To train the deleted, and substituted phonemes, respectively. Fig. 5 model two corpuses are used: FarsDat [24] with more than shows the structure of BRNN-LSTM used in experiments. 500,000 words; and labeled section of the Persian writ- The output neuron which has the maximum value is cho- ten corpus [25] which has over 10 million labeled words. sen as the pronunciation of the current input letter. Since The goal is to map a sequence of graphemes to a sequence consonants are given in the input sequence, the compe- of phonemes. The grapheme of each letter is represented tition must be among the neurons corresponding to the with a number as the one-dimensional input feature vec- current letter; and the local maximum determines the pronunciation of that letter. For instance, when the let- tor. Hence, the input layer of the network has only one neuron; while the output layer has 180 neurons. Each out- ter ‘¨’ is given to the model, only the value of output put layer is dedicated to one of the pronunciation candi- neurons of pronunciation candidates: -q-q/-qe-qo-qq-qq/- dates of the Persian language. It is worth mentioning that qqe-qqo-qen-qon-qeyn- are compared to each other. there are approximately 230 pronunciation candidates in the Persian; but some of them are rarely used (or even not used) in the Persian. Table 1 shows some letters (input of 4 Conclusion the model); and their corresponding pronunciation candi- dates (output of model). In this paper we first reviewed the phonological charac- teristics of the Persian language as well as its structure of 224 Ë Yasser Mohseni Behbahani et al.

Table 2: Training the model with one-sentence sequences

Epochs Training Error (%) Validation Error (%) Test Error (%) 1 24.80 15.27 15.32 20 3.23 3.14 3.16 40 1.10 1.18 1.14

Table 3: Training the model with ten-sentences sequences

Epochs Training Error (%) Validation Error (%) Test Error (%) 20 1.39 1.44 1.39 40 1.18 1.24 1.20 60 1.07 1.14 1.10 80 0.99 1.07 1.03 100 0.94 1.02 0.98

articulation. We then found that the performance of cur- rent grapheme to phoneme systems are highly influenced by the phonological structure of the Persian. Therefore, we defined the problem of grapheme to phoneme conver- sion as a supervised sequential labeling. We chose BRNN equipped with LSTM memory blocks to model the labeling problem. BRNN-LSTM has been shown to have an excel- lent performance in processing sequential data and learn- ing long time pattern information. We used CURRENNT which is GPU-based optimum implementation of recurrent neural networks. Our experimental results showed more than 98% accuracy in the Persian grapheme to phoneme conversion. Moreover, the model is integrated and has none of complexities of the pervious methods.

Acknowledgement: The authors would like to thank the NVIDIA Company for the donated GPU model: Tesla K40. All experiments in this paper are conducted using this GPU.

References

[1] The festival speech synthesis system, http://www.cstr.ed.ac. uk/projects/festival/. [2] Cepstral text-to-speech system, http://www.cepstral.com/en/ demos. [3] Hmm-based speech synthesis system (hts), http://hts.sp. nitech.ac.jp/. [4] Onaolapo J., Idachaba F., Badejo J., Odu T., Adu O., A simplified overview of text-to-speech synthesis, Proceedings of the World Congress on Engineering, 2014, 1. [5] Isewon I., Oyelade J., Oladipupo O., Design and implementation The structure of neural network in the experiments Figure 5: of text to speech conversion for visually impaired people, Inter- national Journal of Applied Information Systems, 2014, 7(2), 25- Persian sentences to phoneme sequences conversion based on recurrent neural networks Ë 225

30. [15] Graves A., Schmidhuber J., Framewise phoneme classification [6] Latacz L., Mattheyses W., Verhelst W., Speaker-specific pro- with bidirectional lstm and other neural network architectures, nunciation for speech synthesis, Text, Speech, and Dialogue, Neural Networks, 2005, 18(5), 602-610 Springer, 2013, 501-508 [16] Graves A., et al., Supervised sequence labelling with recurrent [7] Bagshaw P. C., Phonemic transcription by analogy in text-to- neural networks, Springer, 2012, 385 speech synthesis: Novel word pronunciation and lexicon com- [17] Weninger F., Introducing currennt: The munich open-source pression, Computer Speech & Language, 1998, 12(2), 119-142 cuda recurrent neural network toolkit, Journal of Machine Learn- [8] Chen G., Khudanpur S., Povey D., Trmal J., Yarowsky D, Yilmaz ing Research, 2015, 16, 547-551 O., Quantifying the value of pronunciation lexicons for keyword [18] Schuster M., Paliwal K. K., Bidirectional recurrent neural net- search in lowresource languages, Acoustics, Speech and Signal works, Signal Processing, IEEE Transactions, 1997, 45(11), 2673- Processing (ICASSP), 2013 IEEE International Conference, IEEE, 2681 2013, 8560-8564 [19] Baldi P., Brunak S., Frasconi P., Soda G., Pollastri G., Exploiting [9] Dahl G. E., Yu D., Deng L., Acero A., Contextdependent pre- the past and the future in protein secondary structure predic- trained deep neural networks for largevocabulary speech recog- tion, Bioinformatics, 1999, 15(11), 937-946 nition, Audio, Speech, and Language Processing, IEEE Transac- [20] Baldi P., Brunak S., Frasconi P., Pollastri G., Soda G., Bidirec- tions, 2012, 20(1), 30-42 tional dynamics for protein secondary structure prediction, Se- [10] Hinton G., Deng L., Yu D, Dahl G. E., Mohamed A., Jaitly N., Se- quence Learning, Springer, 2001, 80-104 nior A., Vanhoucke V., Nguyen P., Sainath T. N., et al., Deep neu- [21] Hochreiter S., Bengio Y., Frasconi P., Schmidhuber J., Gradient ral networks for acoustic modeling in speech recognition: The flow in recurrent nets: the diflculty of learning long-term depen- shared views of four research groups, Signal Processing Maga- dencies, 2001. zine, IEEE, 2012, 29(6), 82-97 [22] Hochreiter S., Schmidhuber J., Long short-term memory, Neural [11] Krizhevsky A., Sutskever I., Hinton G. E., Imagenet classification computation, 1997, 9(8), 1735-1780 with deep convolutional neural networks, Advances in neural [23] Graves A., Supervised sequence labelling with recurrent neural information processing systems, Curran Associates, Inc., 2012, networks, PhD thesis, Technical University Munich, 2008. http: 1097-1105 //d-nb.info/99115827X. [12] Ciresan D., Meier U., Schmidhuber J., Multi-column deep neural [24] Bijankhan M., Sheikhzadegan J., Roohani M., Farsdat the networks for image classification, Computer Vision and Pattern speech database of farsi spoken language, PROCCEDINGS AUS- Recognition (CVPR), 2012 IEEE Conference, IEEE, 2012, 3642- TRALIAN CONFERENCE ON SPEECH SCIENCE AND TECHNOLOGY, 3649 1994. [13] Sun R., Giles C. L., Sequence learning: from recognition and pre- [25] Bijankhan M., Sheykhzadegan J., Bahrani M., Ghayoomi M., diction to sequential decision making, IEEE Intelligent Systems, Lessons from building a persian written corpus: Peykare, Lan- 2001, 16(4), 67-70 guage resources and evaluation, 2011, 45(2), 143-164 [14] Bastiaans M. J., On the sliding-window representation in digi- tal signal processing, Acoustics, Speech and Signal Processing, IEEE Transactions, 1985, 33(4), 868-873