Web Application for Romanian Language Phonetic Transcription

MACRo 2017 - 6th International Conference on Recent Achievements in Mechatronics, Automation, Computer Science and Robotics

József DOMOKOS 1, Zsolt Attila SZAKÁCS 2

Department of Electrical Engineering, Faculty of Technical and Human Sciences, Sapientia University, Tg. Mureş, 1e-mail: [email protected], 2e-mail: [email protected]

Manuscript received September 10, 2017, revised October 19, 2017.

Abstract: this paper presents a Romanian language phonetic transcription web service and application built using Java technologies, on the top of the Phonetisaurus G2P, a Word Finite State Transducer (WFST)-driven Grapheme-to-Phoneme Conversion toolkit. We used NaviRO Romanian language pronunciation dictionary for WFST model training, and MIT Language Modeling (MITLM) toolkit to estimate the needed joint sequence n-gram language model. Dictionary evaluation tests are also included in the paper. The service can be accessed for educational, research and other non-commercial usage at http://users.utcluj.ro/~jdomokos/naviro/. Keywords: Romanian language, phonetic transcription, grapheme-to-phoneme (G2P) conversion, letter-to-sound mapping, letter-to-phoneme (L2P) transcription, web service;

1. Introduction

In the past decades, Romanian language was mentioned in the literature as a resource scarce language [1][2][3][4] because of the limited and mostly inconsistent presence on the web, lack of linguistic expertise, lack of electronic resources for natural language processing such as electronic dictionaries, and transcribed speech data [1][2][3][4]. In the last few years there were reported several Romanian language automatic speech recognition (ASR) systems [2][6][7][8][9]. Also in the field of speech synthesis, there were achieved significant results as presented in [3][10][11][12][13]. As the results of these research activities, nowadays Romanian is considered a low-resourced language from the point of view of language resources [14].

1 10.1515/macro-2017-0001

2 J. Domokos, Zs. A. Szakács

Even though the made improvements in the field of ASR, and TTS (Text to Speech) the large vocabulary problem for the Romanian language is far from being solved. The fundamental problem that inhibited the development of large vocabulary tasks both in ASR and TTS systems is the absence of freely available speech and language resources for Romanian. However, the volume of specific language resources developed by Romanian research groups is growing over the years the results are not widely available [14]. Putting Romanian language resources freely available is the main reason which calls for further work in this domain. Pronunciation dictionaries are very useful language resources for spoken language technology. These resources are widely used in ASR and TTS applications because they must be used for tasks like automated segmentation of speech at phonetic level, and also predicting the pronunciation of a written word is an important subtask of all speech production systems [4]. There are several methods and tools reported in this field by different research groups, but unfortunately, in most of the cases, the researchers have no direct access to the indicated resources. Letter-to-phone conversion system based on neural approach was reported in the paper [15]. The authors designed an accurate g2p converter for unrestricted TTS synthesis in Romanian language, based on a parallel architecture of neural networks. The percentage of correctly transcribed words is verryy high on a small hand-built test set. Paper [16], report a rule-based tool for g2p conversion for the Romanian and also some experiments conducted on two different sets of words. A 4779 words database was transcribed obtaining more than 95% accurate phonetic transcriptions, using a set of 102 letter-to-sound rules. Then, using the same rules, a larger set consisting of 15599 words was transcribed generating 91.46% accurate transcriptions. Another group of researchers presents an enhanced version of the rule-based grapheme-to-phoneme transcription system in paper [17]. Some natural language processing tools are introduced as VoiceForge projects at Research Institute for Artificial Intelligence of the Romanian Academy (RACAI) [18] available also on MetaShare website. The project provides a set of tools for Romanian TTS synthesis. According to the authors, the service performs the following TTS specific tasks: text normalization, part-of-speech tagging, lexical stress assignment, phonetic transcription, syllabification and voice synthesis. Paper [19] describes the RACAI TTS entry for the Blizzard Challenge 2013. Article [20] presents the Bermuda component of the TTS toolbox. Bermuda performs phonetic transcription for out-of-vocabulary words using a Maximum Entropy classifier and a custom designed algorithm.

Web Service for Romanian Language Phonetic Transcription 3

Speech processing tools were applied recently to investigate also morpho- phonetic trends in contemporary spoken Romanian, [21] with the objective of improving the pronunciation dictionary and the acoustic models of a speech recognition system. Finally, a multilingual grapheme to phoneme conversion tool is available on Github [21]. Multilingual G2P is based on espeak [23] and can be used in several languages. By default, the transcriber performs the grapheme-to-phoneme conversion for Brazilian Portuguese. The main objective of this article is to present the developed web application, and consequently we organized the rest of the paper as follows. Section 2 presents the whole system architecture and the Phonetic Transcription web Service and the JSP Client Application; Section 3 introduces the used grapheme and phoneme sets; Section 4 gives details about the phonetic transcription dictionary together with some dictionary testing and evaluation; Section 5 summarizes our work, and finally acknowledgments and references are presented.

2. System architecture

Fig. 1 shows the whole system design and deployment architecture. The Romanian language phonetic transcription web service is built using Java technologies, on the top of the Phonetisaurus Word Finite State Transducer (WFST)-driven phoneticizer application [24]. We use Ubuntu linux shell scripts to invoke phonetisaurusg2p application from the Java web service to perform g2p transcription. To make the phonetic transcription application accessible to anyone from the Internet, the developed Java-based phonetic transcription web service is deployed on a Glassfish 4.1 Application Server. The web service can be accessed by clients using the phonetic transcription Java Server Pages (JSP) client application from any Internet browser. An alternative way is to directly access the web service from any software application trough HTTP requests. The phonetic transcription JSP application provides an easy to use interface to introduce a Romanian word to be transcribed (see Fig. 2). The JSP interface passes the word to the web service, and this performs a call to phonetisaurusg2p application using native Ubuntu Linux shell command invocation. The web service then reads the results from the phonetisaurusg2p output text file and pass the results to the client side using the JSP interface (see Fig. 2). The phonetic transcription client JSP application also allows multiple words transcription through a text file upload facility. File transfer is implemented using Java Servlet technology. The phonetisaurusg2p application is called to perform phonetic transcription while end of the text file is reached, and the JSP interface displays the transcription results for the user. Multiple word transcription returns 4 J. Domokos, Zs. A. Szakács the transcribed word list both in the interface and as a downloadable text file (see Fig. 4).

Figure 1: Phonetic transcription system architecture and deployment.

Web Service for Romanian Language Phonetic Transcription 5

Figure 2: Example interface for the Phonetic transcription JSP application.

Figure 3: Interface example for multiple word transcription.

3. The phonetic transcription dictionary

We use NaviRO Romanian language pronunciation dictionary previously presented in [25] for the WFST model training. NaviRO pronunciation dictionary contains 138500 Romanian words from DexOnline [26] dictionary. In the beginning, initial rules learned from a small hand-built set of 2383 transcribed words were used. Later, the dictionary was appended using a slightly modified version of the Dictionary Maker [27] application and over checked by human users. Dictionary development was briefly presented in paper [4]. The used grapheme set consist on the standard 31 graphemes from the Romanian alphabet according to the second edition of DOOM – the spelling, orthoepic and morphological dictionary of Romanian language. Table 1 shows 6 J. Domokos, Zs. A. Szakács the used grapheme set, and Table 2 contains the used grapheme set. Both sets are standard in the literature, although there are some minor disagreements between the linguist experts.

Table 1: The used grapheme set.

Grapheme set a ă â b c d e

f g h i î j k

l m n o p q r

s ş t ţ u v w

x y z

The Romanian phonetic inventory presented in Table II is collected from existing printed linguistic resources as presented in [4][5][10].

Table 2: The used phoneme set.

Phoneme set a vowel used as Romanian a (mama - m a m a)

@ vowel used as Romanian ă (casă - c a s @)

1 vowel used as Romanian â and î (cât - c 1 t, în - 1 n)

e vowel used as Romanian e (mere - m e r e)

i vowel used as Romanian i (pin - p i n) used as weak i at the end of the words (lupi - l u p i_0 vowel i_0) o vowel used as Romanian o (morcov - m o r c o v)

u vowel used as Romanian u (pur - p u r) used for marking Romanian diphthong ea (canapea e_X vowel - k a n a p e_X a) used as i when pronounced as semivowel (croitor - j semivowel k r o j i t o r) used for marking Romanian diphthong oa o_X vowel (amploare - a m p l o_X a r e) b consonant used as Romanian b (bob - b o b)

d consonant used as Romanian d (dar - d a r)

g consonant used as Romanian g (gram - g r a m)

k consonant used as Romanian c (carne - k a r n e)

p consonant used as Romanian p (pupăză - p u p @ z @)

t consonant used as Romanian t (tata - t a t a)

Web Service for Romanian Language Phonetic Transcription 7

Phoneme set m consonant used as Romanian m (mama – m a m a)

n consonant used as Romanian n (nene - n e n e)

l consonant used as Romanian l (letal – l e t a l)

f fricative used as Romanian t (far - f a r)

v fricative used as Romanian t (var – v a r)

s fricative used as Romanian s (sat – s a t)

S fricative used as Romanian ș (șanț - S a n ts)

z fricative used as Romanian z (zahăr – z a h @ r)

h fricative used as Romanian h (har – h a r)

r consonant used as Romanian r (rar – r a r)

Z consonant used as Romanian j (jale - Z a l e)

ts consonant used as ț (țara - ts a r a) used for Romanian ce, ci grapheme group (cireșe - tS consonant tS i r e S e, cer – tS e r) used for ge, gi grapheme group (aborigen - a b o r i dZ consonant dZ e n, gin - dZ i n) sil silence the phonetic null character

4. Dictionary evaluation

We recall here the evaluation tests of the generated pronunciation dictionary since the quality, and the consistency of the developed dictionary is crucial for the use in ASR and TTS systems. We used NaviRO Romanian language pronunciation dictionary [4] for WFST model training. We use 80 % of the dictionary (110,800 words) to train the Phonetisaurus WFST driven grapheme-to-phoneme converter. We retain 10 % of the dictionary for development and 10 % for testing purposes. The joint sequence n-gram model needed by Phonetisaurus was estimated using the MIT Language Modeling toolkit. [28] The statistical n -gram language model describes the conditional probability of a phoneme given its history, described by the n - 1 previous phonemes. It is typically estimated from a large text corpus. In this case the estimation was performed based on the aligned training dictionary. We used MITLM to train a statistical model with a history of length n = 7. As probability smoothing algorithm we used the modified Kneser-Ney smoothing. The resulting language model was stored in ARPA MIT format. 8 J. Domokos, Zs. A. Szakács

Using the training part of the dictionary we performed alignment, N-gram estimation and language model conversion to word finite state transducer. At grapheme level, we have a total number (T) of 117737 tokens (graphemes) in reference. After generating pronunciations for the word list included in the test dataset, we found 117485 matches (M), 157 substitutions (S), 131 insertions (I) and 95 deletions (D). Such we have 99.79 % of matched transcriptions calculated as the ratio of matched and total tokens: M/T. Token error rate (ER) was 0.33 % (calculated as (S+I+D)/T) and accuracy computed as 1.0-ER was 99.67 %. [10] If we make the word level statistics, we had a total number (S) of 13850 sequences (words) of which 13507 were correctly generated sequences (C) and 343 error sequences (E). Therefore the word error rate computed as E/S was 2.48 % and the transcription accuracy at the word level was 97.52 % calculated as 1.0- E/S. [10]

5. Conclusion

After creating and publishing the NaviRO pronunciation dictionary, we created a freely available phonetic transcription web application for Romanian language words. We consider that the web service will help the Romanian speech technology community in transcription of the out of vocabulary words for booth ASR and TTS purposes. The evaluation results on both phoneme and word level are showing that the dictionary is consistent enough to be used in ASR and TTS systems. The developed web service JSP Client is freely available for educational, research and non-commercial usage at http://users.utcluj.ro/~jdomokos/naviro/. Request for direct access to the web service is encouraged.

Acknowledgements

Publishing of this Conference Proceeding is supported by the Sapientia University.

References

[1] D. Cristea, C. Forăscu, “Linguistic resources and technologies for Romanian language” in Computer Science Journal of Moldova, 14 (1), 2006, pp. 34–73. [2] C. Burileanu, V. Popescu, A. Buzo, C. S. Petrea, D. Ghelmez-Hane, “Spontaneous speech recognition for Romanian in spoken dialogue systems” in Proceedings of the Romanian Academy, 11 Series A(1), 2010, pp. 83-91.

Web Service for Romanian Language Phonetic Transcription 9

[3] A. Stan, J. Yamagishi, S. King, M. Aylett, “The Romanian speech synthesis (RSS) corpus: Building a high-quality HMM-based speech synthesis system using a high sampling rate” in Speech Communication, 53(3), 2010, pp. 442–450. [4] J. Domokos, O. Buza, G. Toderean, “100K+ words, machine-readable, pronunciation dictionary for the Romanian language” in Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2012, pp. 320–324. [5] Domokos, J., Buza, O., & Toderean, G. (2011). Automated grapheme-to-phoneme conversion system for Romanian. In Proceedings of the 6th speech technology and human- computer dialogue conference SpeD, pp. 105-110 [6] D.-P. Munteanu, C.-I. Vizitiu, “Robust Romanian language automatic speech recognizer based on multistyle training” in WSEAS Transactions on Computer Research, 3(2), 2008, pp. 98–109. [7] D. Militaru, I. Gavăt, O. Dumitru, T. Zaharia, S. Segarceanu, “ProtoLOGOS, system for Romanian language automatic speech recognition and understanding (ASRU)” in Proceedings of the 5th Conference on Speech Technology and Human–Computer Dialogue (SpeD), 2009, pp. 21–32. [8] A. Buzo, H. Cucu, C. Burileanu, “Text spotting in large speech databases for under-resourced languages” in Proceedings of the 7th International Conference Speech Technology and Human–Computer Dialogue (SpeD), 2013, pp. 77–82. [9] J. Domokos, S. László, O. Buza, G. Toderean “Romanian language voice browsing for web applications using grapheme level acoustic modeling” in Advanced Engineering Forum, Vol. 8–9, 2013, pp. 29–36. [10] J. Domokos, O. Buza, G. Toderean, “Romanian phonetic transcription dictionary for speeding up language technology development” in Language Resources and Evaluation, Vol. 49, Issue 2, , 2015, pp. 311-325. [11] A. Stan, M. Giurgiu, “Romanian Language Statistics and Resources for Text-To-Speech Systems” in Proceedings of the 9th IEEE International Symposium on Electronics and Telecommunications (ISETC), 2010, pp. 381-384. [12] D. Burileanu, C. Negrescu, M. Surmei, “Recent Advances In Romanian Language Text-To- Speech Synthesis” in Proceedings of the Romanian Academy, Series A, Volume 11, Number 1/2010, pp. 92–99. [13] A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, "TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision" in Proc. Interspeech, 2013, pp. 2331-2335. [14] P. G. Zălhan, “Building a LVCSR System For Romanian: Methods And Challenges”, in Journal of Public Administration, Finance and Law (JOPAFL), Issue 10, 2016, pp 181-191. [15] D. Burileanu, M. Sima, A. Neagu, “A phonetic converter for speech synthesis in Romanian” in Proceedings of the XIVth Congress on Phonetic Science (ICPhS), vol. 1, San Francisco, 1999, pp. 503–506. [16] Ṣ.-A. Toma, D. Munteanu, “Rule-based automatic phonetic transcription for the Romanian language” in Proceedings of the first International Conference on Future Computational Technologies and applications, 2009, pp. 682–686. [17] A. Ṣaupe, M. Duma, G. C Silaghi, M. A. Ordean, M. Ordean, “Enhanced rule-based phonetic transcription for the Romanian language” in Proceedings of the 13th International Symphosium on Symbolic and Numeric Algorithms for Scientific Computation (SYNASC), 2009, pp. 401–406. [18] RACAI Text Tools webpage: http://www.racai.ro/en/tools/text/ [19] T. Boroș , R. Ion , Ș. D. Dumitrescu, “The RACAI Text-to-Speech Synthesis System" in Blizzard Challenge, 2013. 10 J. Domokos, Zs. A. Szakács

[20] T. Boroș, D. Ștefănescu, R. Ion, “Bermuda, a data-driven tool for phonetic transcription of words” in Natural Language Processing for Improving Textual Accessibility (NLP4ITA) Workshop Programme, 2012, pp. 35-39. [21] I. Vasilescu, B. Vieru, L. Lamel, “Exploring Pronunciation Variants For Romanian Speech- To-Text Transcription” in Proceedings of 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU), 2014, pp. 161-168. [22] Multilingual Grapheme to Phoneme project on GitHub: https://github.com/jcsilva/multilingual-g2p [23] Espeak webpage: http://espeak.sourceforge.net/ [24] Phonetisaurus a WFST-based Phoneticizer webpage: https://github.com/AdolfVonKleist/Phonetisaurus [25] NaviRO Pronunciation Dictionary homepage: http://users.utcluj.ro/~jdomokos/naviro/ [26] Dexonline webpage: https://dexonline.ro/ [27] Dictionary Maker application webpage: https://sourceforge.net/projects/dictionarymaker/ [28] MIT Language Modeling Toolkit webpage: https://github.com/mitlm/mitlm