NLP Applications of Sinhala: TTS & OCR

Ruvan Weerasinghe, Asanka Wasala, Dulip Herath and Viraj Welgama Language Technology Research Laboratory, University of Colombo School of Computing, 35, Reid Avenue, Colombo 00700, Sri Lanka {arw,raw,dlh,wvw}@ucsc.cmb.ac.lk

This paper focuses primarily on the end-user ap- Abstract plications developed under the above project; Sin- hala TTS system and OCR system. The paper de- This paper brings together the practical ap- scribes the practical applications of these tools and plications and the evaluation of the first evaluates it in the light of experience gained so far. Text-to-Speech (TTS) system for Sinhala The rest of this paper is organized as follows: using the Festival framework and an Opti- Section 2 gives an overview of the Sinhala TTS cal Character Recognition system for Sin- system; Section 3 describes the Sinhala OCR sys- hala. tem. A summary along with future research direc- tions and improvements are discussed in the last 1 Introduction section. Language Technology Research Laboratory † (LTRL) of the University of Colombo School of 2 Sinhala Text-to-Speech System Computing (UCSC), was established in 2004 Sighted computer users spend a lot of time reading evolving from work engaged in by academics of items on-screen to do their regular tasks such as the university since the early 1990’s in local lan- checking email, fill out spreadsheets, gather infor- guage computing in Sri Lanka. mation from internet, prepare and edit documents, Under the scope of the laboratory, numerous and much more. However visually impaired people Natural Language Processing projects are being cannot perform these tasks without an assistance carried out with the relevant national bodies, inter- from other, or without using assistive technologies. national technology partners, local industry and the A TTS (text-to-speech) system takes computer wider regional collaboration particularly within the * text and converts the into audible speech PAN Localization Initiative . The Sri Lankan com- (Dutoit, 1997). With a TTS engine, application, ponent of the PAN Localization Project concen- and basic computer hardware, one can listen to trated on developing some of the fundamental re- computer text instead of reading it. A Screen sources needed for language processing and some Reader (2007) is a piece of software that attempts software tools for immediate deployment at the to identify and read-aloud what is being displayed end of the project. Among the resources produced on the screen. The screen reader reads aloud text is a Sinhala Language Corpus of 10m words, and a within a document, and it also reads aloud infor- tri-lingual Sinhala-English-Tamil lexicon. The two mation within dialog boxes and error messages. In main software tools developed include a Sinhala other words, the primary function of any-screen Text-to-Speech (TTS) system and an Optical Char- reading system is to become the “eye” of the visu- acter Recognition (OCR) system for recognizing ally impaired computer user. These technologies commonly used Sinhala publications. enable blind or visually impaired people to do things that they could not perform before by them- † See website: http://www.ucsc.cmb.ac.lk/ltrl * See project website: http://www.panl10n.net

963 selves. As such, text-to-speech synthesizers make been found unique for the language (Weerasinghe information accessible to the print disabled. et al., 2007). Despite the Festival's incomplete Within Sri Lanka, there is a great demand for a support for UTF-8, the above rules were re- TTS system in local languages, particularly a writtenin UTF-8 multi-byte format following the screen reader or web browser for visually impaired work done for Telugu language (Kamisetty, 2006). people. In the case of the Tamil language, work The current Sinhala TTS engine accepts Sinhala done in India could be used directly. Until the Unicode text and converts it into Speech. A male LTRL of UCSC initiatives were launched in 2004, voice has been incorporated. Moreover, the system there was no viable TTS system found developed has been engineered to be used in deferent plat- for Sinhala, the mother tongue of 74 % Sri forms, operating systems (i.e. Linux and Windows) Lankans (Karunatillake, 2004). and by different software applications (Weeras- A project was launched to develop a ‘commer- inghe et al., 2007). cial grade’ Sinhala text-to-speech system in UCSC in year 2004. Later, it was extended to develop a 2.1 Applications of TTS Synthesis Engine Screen Reader which can be used by visually im- Sinhala text is made accessible via two interfaces, paired persons for reading Sinhala texts. by the TTS engine. A standalone software named The Sinhala TTS system was implemented “Katha Baha” primarily reads documents in Sin- based on the Festival speech synthesizer (Taylor et hala Unicode text format aloud. The same applica- al., 1998). The Festival system is tion can also be used to record the synthesized an open-source, stable and portable multilingual speech. speech synthesis system developed at Center for In this way, local language news papers and text Speech Technology Research (CSTR), University books can be easily transformed into audio materi- of Edinburgh (Taylor et al., 1998, Black and als such as CDs. This software provides a conven- Lenzo, 2003). TTS systems have been developed ient way to disseminate up-to-date news and in- using the Festival framework for different lan- formation for the print disabled. e.g. Newspaper guages, including English, Japanese, Welsh, Turk- company may podcast their news paper, enabling ish, Hindi, and Telugu (Black and Lenzo, 2003). access for print disabled and everyone else. Fur- However, efforts are still continuing to develop a thermore, the same application can be utilized to standard Sinhala speech synthesizer in Sri Lanka. produce Sinhala digital talking books. To ensure The Sinhala text-to-speech system is developed the easy access by print disabled, keyboard short based on the diphone concatenation approach. cuts are provided. Construction of a diphone and implemen- Owing to the prevalent use of Windows among tation of the natural language processing modules the visually impaired community in Sri Lanka, it were key research areas explored in this project. In becomes essential that a system is developed this exercise, 1413 diphones were determined. The within the Windows environment which offers diphones were extracted from nonsensical words, Sinhala speech synthesis to existing applications. and recordings were carried out in a professional The standard speech synthesis and recognition in- studio. Moreover, language specific scripts (phone, terface in Microsoft Windows is the Microsoft lexicon, tokenization) and speaker specific scripts Speech Application Programming Interface (MS- (duration and intonation) were defined for Sinhala. SAPI) (Microsoft Corporation, n.d.). MS-SAPI It is worthy to mention the development of con- enabled applications can make use of any MS- text-sensitive letter-to-sound conversion rule set SAPI enabled voice that has been installed in Win- for Sinhala. Incorporation of a high accuracy na- dows. Therefore, steps were taken to integrate Sin- tive syllabification routine (Weerasinghe et al., hala voice into MS-SAPI. As a result, the MS- 2005) and implementation of comprehensive text SAPI compliant Sinhala voice is accessible via any analysis facilities (capable of producing the accu- speech enabled Windows application. The Sinhala rate pronunciation of the elements such as num- voice is proved to work well with “Thunder”‡ a bers, currency symbols, ratios, percentages, abbre- freely available screen reader for Windows. Addi- viations, Roman numerals, time expressions, num- tionally, steps were taken to translate and integrate ber ranges, telephone numbers, email addresses, English letters and various other symbols) have ‡ Available from: http://www.screenreader.net/

964 common words found related to Thunder screen ments as optical images using a device such as reader (e.g. link=“සබැඳිය”, list item= “ලැයිස්තු flatbed scanner. Recognition- involves converting අයිතම”) (Weerasinghe et al., 2007). these images to character streams representing let- Since most Linux distributions now come with ters of recognized words and the final element in- Festival pre-installed, the integration of Sinhala volves accessing or storing the converted text. voice in such platforms is very convenient. Fur- Many OCR systems have been developed for thermore, the Sinhala voice developed here was recognizing Latin characters (Weerasinghe et al., made accessible to GNOME-Orca and Gnoperni- 2006). Some OCR systems have been reported to cus - powerful assistive screen reader software for have a very high accuracy and most of such sys- people with visual impairments. tems are commercial products. Leaving a land It is noteworthy to mention that for the first time mark, a Sinhala OCR system has been developed in Sri Lankan history, the print disabled commu- at UCSC (Weerasinghe et al., 2006). nity will be able to use computers in their local Artificial Neural Network (ANN) and Template languages by using the current Sinhala text-to- Matching are two popular and widely used algo- speech system. rithms for optical character recognition. However, the application of above algorithms to a highly 2.2 Evaluation of the Text-to-Speech Synthe- inflected languages such as Sinhala is arduous due sis Engine to the high number of input classes. Empirical es- Text-to-speech systems have been compared and timation of least number of input classes needed evaluated with respect to intelligibility (under- for training a neural net for Sinhala character rec- standability of speech), naturalness, and suitability ognition suggested about 400 classes (Weeras- for used application (Lemmetty, 1999). As the inghe et al., 2006). Therefore, less-complicated K- Sinhala TTS system is a general-purpose synthe- nearest neighbor algorithm (KNN) was employed sizer, a decision was made to evaluate it under the for the purpose of Sinhala character recognition. intelligibility criterion. Specially, the TTS system The current OCR system is the first ever re- is intended to be used with screen reader software ported OCR system for Sinhala and is capable of by visually impaired people. Therefore, intelligibil- recognizing printed Sinhala letters typed using ity is a more important feature than the naturalness. widely used fonts in the publishing industry. The A Modified Rhyme Test (MRT) (Lemmetty, recognized content is presented as editable Sinhala 1999), was designed to test the Sinhala TTS sys- Unicode text file (Weerasinghe et al., 2006). tem. The test consists of 50 sets of 6 one or two A large volume of information is available in the syllable words which makes a total set of 300 printed form. The current OCR system will expe- words. The words are chosen to evaluate phonetic dite the process of digitizing this information. characteristics such as voicing, nasality, sibilation, Moreover, the information available via printed and consonant germination. Out of 50 sets, 20 sets medium is inaccessible to the print disabled, and were selected for each listener. The set of 6 words the OCR system, especially when coupled with is played one at the time and the listener marks the Sinhala TTS, will provide access to these informa- synthesized . The overall intelligibility of the tion for the print disabled. system measured from 20 listeners was found to be 3.1 Evaluation of the Optical Character Rec- 71.5% (Weerasinghe et al., 2007). ognition System 3 Optical Character Recognition System The performance of the Sinhala OCR system has been evaluated using 18000 sample characters for Optical Character Recognition (OCR) technology Sinhala. These characters have been extracted from is used to convert information available in the various books and newspapers (Weerasinghe et al., printed form into machine editable electronic text 2006). Performance of the system has been evalu- form through a process of image capture, process- ated with respect to different best supportive fonts. ing and recognition (Optical Character Recogni- The results have been summarized in the Table 1 tion, 2007). (Weerasinghe et al., 2006). There are three essential elements to OCR tech- nology. Scanning – acquisition of printed docu-

965 Font FM DL Lakbima Letter References % Recog. 97.17 96.26 89.89 95.81 * Alan W. Black and Kevin A. Lenzo. 2003. Building Table 1. Experimental Results of Classification Synthetic Voices, Language Technologies Institute, Carnegie Mellon University and Cepstral LLC. Re- From this evaluation it can be concluded that the trieved from http://festvox.org/bsv/. current Sinhala OCR has average accuracy of 95% Microsoft Corporation. (n.d.). Microsoft Speech SDK (Weerasinghe et al., 2006). Version 5.1. Retrieved from: 4 Conclusion and Future Work http://msdn2.microsoft.com/en-/library/ms990097.aspx This paper brings together the development of a T. Dutoit. 1997. An Introduction to Text-to-Speech Syn- diphone voice for Sinhala based on the Festival thesis, Kluwer Academic Publishers, Dordrecht, Netherlands. speech synthesis system and an Optical Character Recognizer for Sinhala. C. Kamisetty, S.M. Adapa. 2006. Telugu Festival Text- Future work on the Sinhala TTS engine will to-Speech System. Retrieved from: mainly focus on improving the prosody modules. http://festival-te.sourceforge.net/wiki/Main_Page A speech corpus containing 2 hours of speech has W.S. Karunatillake. 2004. An Introduction to Spoken been already recorded. The material is currently Sinhala, 3rd edn., M.D. Gunasena & Co. ltd., 217, being segmented, and labeled. We are also Olcott Mawatha, Colombo 11. planning to improve the duration model using the data obtained from the annotated speech corpus. It Sami Lemmetty. 1999. Review of Speech Synthesis is also expected to develop a female voice in near Technology, MSc. thesis, Helsinki University of future. The current Sinhala OCR system is font Technology. dependent. Work is in progress to make the OCR Screen Reader. 2007. Screen Reader. Retrieved from: system font independent and to improve the accu- http://en.wikipedia.org/wiki/Screen_reader. racy. Sinhala OCR and the TTS systems, which are Optical Character Recognition. 2007. Optical Character currently two separate applications, will be inte- Recognition. Retrieved from: grated enabling the user friendliness to the print http://en.wikipedia.org/wiki/Optical_character_recog disabled. nition A number of other ongoing projects are aimed at P.A Taylor, A.W. Black, R.J. Caley. 1998. The Archi- developing resources and tools such as a POS tag tecture of the Festival Speech Synthesis System, set, a POS tagger and a tagged corpus for Sinhala, Third ESCA Workshop in Speech Synthesis, Jenolan an on-the-fly web page translator, a translation Caves, Australia. 147-151. memory application and several language teaching Ruvan Weerasinghe, Asanka Wasala, Kumudu Gamage. -learning resources for Sinhala, Tamil and English. 2005. A Rule Based Syllabification Algorithm for All resources developed under this project are Sinhala, Proceedings of 2nd International Joint Con- made available (under GNU General Public Li- ference on Natural Language Processing (IJCNLP- cense) through the LTRL website. 05). Jeju Island, Korea. 438-449. Acknowledgement Ruvan Weerasinghe, Dulip Lakmal Herath, N.P.K. Medagoda. 2006. A KNN based Algorithm for Printed Sinhala Character Recognition, Proceedings This work was made possible through the PAN th Localization Project, (http://www.PANL10n.net) a of 8 International Information Technology Confer- grant from the International Development Re- ence, Colombo, Sri Lanka search Center (IDRC), Ottawa, Canada, adminis- Ruvan Weerasinghe, Asanka Wasala, Viraj Welgama tered through the Center for Research in Urdu and Kumudu Gamage. 2007. Festival-si: A Sinhala th Language Processing, National University of Text-to-Speech System, Proceedings of 10 Interna- Computer and Emerging Sciences, Pakistan. tional Conference on Text, Speech and Dialogue (TSD 2007), Pilseň, Czech Republic, September 3-7, 2007. 472-479

* FM – “FM Abhaya”, DL – “DL Manel Bold”, Letter – “Letter Press”

966