BEP: a practical bilingual system for Japanese

T. Watanabea, M. Sakamoto, and M. Kiriake

a Department of Information Science, Shonan Institute of Technology, 1-1-25 Tsujido-NishiKaigan, Fujisawa, Kanagawa, 251-8511, JAPAN , Tel +81-466-30-0217, Fax +81-466-34-1096

All authors are members of ARGV (Accessibility Research Group for the Visually-Impaired)

The Bilingual Emacspeak Platform (BEP) is a bilingual text-to-speech system for Japanese visually impaired users and runs under both Windows and Linux. It is implemented as an extension of Raman’s Emacspeak, a self-voicing GNU Emacs platform with dedicated auditory user interface. Based on the experiences of existing Japanese screen-readers, BEP is equipped with functions necessary for Japanese speech system. BEP for Windows uses various types of Japanese and English voices to represent bilingual information naturally and unambiguously. In this paper, the authors firstly review the features of Emacspeak and BEP and then show the requirements of a bilingual speech synthesis system for efficient and practical use.

Keywords: bilingual speech system, human-computer interaction, auditory user interface, text-to-speech, visually impaired, assistive technology, Japanese

1. Introduction screen-reader to transfer rich contents of 2-dimentionally arranged information, e.g. Personal computers enabled people to use icons and menus, on the graphic display to a computer for their personal use. simple 1-dimentional speech output. Thus, Personal computers also changed the lives we must develop a new effective of people with disabilities because this new mechanism and user interface to output a technology was used as an assistive lot of information in modern computers technology. With the aid of a personal using synthesized voice. Such speech computer, people with disabilities can work, output also enhances mobile access of study, live, or play independently. For personal network appliances for general example, visually impaired (low-vision and users and enables multi-modal user blind) people can use a computer to read interface. information with a screen-reader, which is software that reads text on a display through a speech synthesizer. 2. Emacspeak

As technology grew more sophisticated, In order to overcome the mismatching of new technologies that made it easier for Windows’ GUI and screen-readers, T.V. non-disabled people to use computers often Raman developed a new speech system created barriers for people with disabilities. called Emacspeak 1 . Emacspeak is an For example, Windows OS popularized Emacs subsystem that provides complete personal computers to general people; speech access to GNU Emacs, a structure- however, visually impaired people cannot make full use of its GUI (Graphical User 1 Emacspeak is distributed as open-source at Interface) because it is difficult for a http://emacspeak.sourceforge.net/ sensitive, fully customizable editor. 3. Bilingual Emacspeak Platform Emacspeak knows context-specific information of what it is speaking and can 3.1 Overview and structure efficiently transfer the information to the sense of hearing with the following In 1999, we, the sighted and the visually techniques: impaired who love Emacs, launched the Bilingual Emacspeak Project (Watanabe, 1) Auditory User Interface (Raman, 1997): 2001) that develops Bilingual Emacspeak Raman claims that to adjust to various Platform (BEP) to make Emacspeak requirements of various circumstances bilingual (Japanese and English) and to of users, user interface must be make it run under Windows as well as separated from the system itself. In Linux. The current system is bilingual this manner, Raman added AUI, which because Japanese users need to read works independently of GUI, to English as well as Japanese at workplace Emacspeak. For example, it reads a and in the Internet. It runs under calendar with additional information Windows because there are many users such as “Monday, April 1st, 2001”. who use Windows. It also runs under Without AUI, 2-dimentional Linux because the original Emacspeak information of a calendar is lost and it works only under and UNIX is is spoken as “one ”, which makes no expected to be an alternative accessible OS sense to the visually impaired users. for the handicapped (Watanabe, 2000b). 2) Voice fonts: With the aid of functions of Emacs, Emacspeak can assign different As shown in Figure 1 of the next page, the voices to different information. For current system consists of two parts, an example, when reading a source code program package that acts as file, it assigns different voices to AUI and a software speech server. Emacs comment strings, key words, and Lisp program represents the bilingual quoted strings. Emacspeak reads a version of Emacspeak. The speech server Web page with Auditory Cascading controls English and Japanese text-to- Style Sheets2: it uses different pitches speech engines. It accepts commands in of the same voice for reading different the DECtalk format and text in the heading level elements. DECtalk, a Japanese Shift_JIS character-encoding high-quality text-to-speech hardware scheme. Windows’ speech server, which is synthesizer, enabled Emacspeak to use a modification of G. Bishop’s one 3 , uses various kinds of voices with various Microsoft’s American English text-to- characteristics (e.g., pitch, stress, speech engine and a commercial Japanese intonation). text-to-speech engine that are conformable 3) Auditory icons: Auditory icons are short to Microsoft Speech API 4 (Watanabe, sound cues that alert the user to 2000a). Linux’s speech server uses a significant events. A user can do Japanese text-to-speech engine developed another task while, for example, by us based on a commercial software compiling a source program until an development kit (Watanabe, 2000b) and is auditory icon informs her/him the end not bilingual. It has not had bilingual of compilation. capability yet but it speaks English with a Japanese text-to-speech engine using the Japanese-English dictionary made by J. Ishikawa (Ishikawa) with Japanese accent.

2 Auditory Cascading Style Sheets is defined in the CSS2 (http://www.w3.org/TR/REC- 3 This program was written by G. Bishop for CSS2). personal use and given to us as open source.

Figure 1. The structure of Bilingual Emacspeak. Bold boxes are modules developed by the current project. Meadow is a Windows’ porting of Emacs 20.x, Mew is a Japanese Email package, and Wnn and Canna are Kana-Kanji translation servers.

3.2 Language identification and for syntactic and contextual reasons. assignment of engine’s language Emacspeak has punctuation modes that determine how to pronounce some The current version of BEP does not care characters such as punctuation marks. about the language of the text in the For example, every mark should be read Emacs Lisp level. It identifies the when writing a source code, whereas language in the speech server. BEP for punctuation marks should not be read Windows uses a Japanese text-to-speech when reading a letter. As a text-speech- engine if the text to be read, which is engine does not know what it is speaking, transmitted from Emacs, is encoded in i.e. contextual information, a speech server multi-byte codes. As discussed later, must switch the punctuation mode language identification in a speech server according to contextual information or is not good for effective bilingual speech user’s demands. system; therefore, this language identification will be carried out in the Japanese language is written in Kanji Emacs Lisp level in the near future to (Chinese characters), Hiragana, Katakana, make robust and flexible discrimination of and other multi-byte characters. Multiple the language in multilingual information. Kanji characters have the same pronunciation. Hiragana and Katakana 3.3 Auditory representations of Japanese correspond one -to-one and each pair has the same pronunciation. As a user cannot Usually, pronunciation of each Japanese identify these characters only by hearing word is determined by the Japanese text- the pronunciation, he/she needs additional to-speech engine. In some cases, however, information to distinguish them. Japanese pronunciation should be specified by the screen-readers have developed many speech system to reflect user’s demands or methods to unambiguously represent Japanese to the ears (Saito, 1995). BEP not offered by the existing Japanese uses two mechanisms to give this screen-readers (Watanabe, 2001). additional information to the user. One is audio formatting, or use of various voice- fonts 4 . For example, Hiragana and 4. Requirements of a speech synthesis Katakana will be read with different voices system for practical and bilingual use or different pitch of the same voice. The other is an explanatory reading of Kanji, 4.1 The characteristics of speech output which is the same function as ordinary Japanese screen-readers have. The At first, let us summarize the current system has 3 levels of explanation characteristics of speech output: for each Kanji: 1) a very brief explanation 1) Speech output lacks birds-eye view. As used for reading a character according to a a speech output is 1-dimensional cursor movement, 2) a short explanation sequential audio stream, a user cannot for selecting the correct Kanji in a grasp the whole structure with speech Japanese input method, and 3) a long output. A user cannot obtain the explanation used for other cases. The necessary information until the audio dictionary of these explanatory readings is stream reaches the point. based on the one used in J. Ishikawa’s screen-reader. The first-level explanation 2) A user cannot easily hear again the is used when a user want to quickly missed information because it is examine the information character by difficult to rewind the speech stream. character. In this case, pronunciation of 3) It takes longer time to hear each character is defined in a dictionary, information by speech than to read by i.e. raw and simple pronunciation is used eyes. to quickly identify each character. The second level is used when a user want to examine a character in detail. In this case, 4.2 Requirements of an efficient speech the character is explained with some synthesis system additional information to clearly distinguish it from other characters having Visually impaired users have proved that a similar pronunciation. The third level will human can use a computer only with be used to explain the detailed information speech output. They use speech output of that character. because they cannot use sight. Sighted users can use sight, and speech output is 3.4 BEP as a practical and efficient speech used as multi-modal and alternative synthesis system interface. In this sense sighted users are not well accustomed to speech output and BEP is still under development but the they feel difficulties in using a speech alpha version of executables and source system. codes have been distributed on the Web (http://www.argv.org/bep/). Support is Characteristics of speech output and carried out using a mailing list and the experiences in developing and supporting web site. A small number of users are BEP clearly show the requirements of a using BEP in their everyday life. The practical speech system: the most number of current users is not large important requirements are not to speak because to use BEP the user must be text naturally or expressively but to speak familiar with Emacs and it takes some essential information as fast as possible in time to learn how to use it and also a manner best suited to the sense of because the current version is not robust. hearing. In other words, However, BEP has many functions that are 1) To speak information twice or three times faster than the normal speed.

4 This function has not been implemented yet. Speaking text in double speed shortens the time to hear the information in 5) One letter speech synthesis: The half. current text-to-speech engines are good at speaking strings; however, 2) To speak the necessary information they are not good at pronouncing a only. Otherwise it takes long time single letter because of their synthesis before obtaining the necessary mechanism. One letter pronunciation information. is needed when a user want to 3) To use AUI instead of obtaining examine what is written. information displayed on a screen. 6) Phonemic speaking: In case a user 4) To use light text-to-speech engines want to read a character one by one, with clear voices. Details of the the fixed pronunciation, which should requirements are discussed later in be predefined in a dictionary 5 of a this section. speech system, can be assigned to 5) To use capable speech server. Details every character. In this case, of the requirements are discussed later phonemic speaking, i.e. pronunciation in this section. represented in a string of phonemes is directly given to a speech server, 4.3 Requirements of text-to-speech reduces the analyzing time of a text-to- engines speech engine especially in Japanese. Phonemic speaking reduces the task of Requirements of text-to-speech engines, a text-to-speech engine because it can which are basic components for a speech create PCM data directly from synthesis system, are as follows: phonemic data. 7) Open source engines: Text-to-speech 1) A quick engine: Good response is engines are basic components of AUI indispensable for a practical speech as fonts are basic ones of GUI. Thus system. open-source, or at least free, text-to- 2) A clear voice that acts as a code: The speech engines are needed. most important feature required to the 8) Highly capable engine: A text-to- voice is to correctly give the speech engine must have the ability to information to the user at a high speed. read at least as three times fast as the Thus a text-to-speech engine must ordinary speech rate. It must have the have a voice that can clearly ability to change characteristics such distinguish each character. Natural as speed, pitch, stress, and intonation. voices tend to lose this kind of clarity To use voice fonts, several kinds of at faster speech rate. voices must be available for each sex of 3) Multilingual (American, European, each language. Asian, etc) engines: Te xt-to-speech engines for many languages are 4.4 Requirements of a bilingual speech available under Windows; however, system there is no practical Japanese engine under Linux. Experiences of BEP show that a bilingual 4) Unified interface: Windows has an speech system should use the user ’s native established speech interface; however, language (for us, Japanese) in most cases there is no such interface for Linux. because users are not bilingual, i.e. they The lack of standard interface in Linux cannot hear the second language (English) prevents us from simultaneously well. At the same time, users need the handling engines of different native English text-to-speech engine when languages. A unified interface that 5 This pronunciation is determined in the considers characteristics of various first level of our dictionary described in the languages is needed. section 3.3. they want to hear information as English. uses this dictionary. It should be Generally speaking, system messages, noted that Japanese-English is not directory and file names, and an isolated appropriate for all cases. Users want English word in a Japanese context should to hear Japanese-English when they be read in Japanese pronunciation even if think the information is a part of they are English. Thus, a bilingual speech Japanese. Even if such information is system should have the following written according to English grammar functions: and written in ISO-8859-1, it should be treated as a part of Japanese and 1) Context-based language selection: The pronounced in Japanese-English. selection of language used to read However, when users want to hear information depends on the context English information, it should be and user ’s demands; therefore, pronounced in native English or users language selection should be carried cannot recognize the meaning well. out by Emacs, i.e. Emacs Lisp level, 3) Variable ratio of speech rate: The and not by the speech server. As the speech rate for Japanese and English current implementation of BEP should be changed separately because automatically switches Japanese and hearing English is a difficult task for English engines according to the ordinary Japanese users. language of the text in a speech server, it should be modified in the near future. 4.5 Importance of meta-data

2) Native pronunciation and Japanese- Emacs, an intelligent text editor, can English pronunciation: The ordinary determine the characteristics and Japanese are accustomed to hear semantics of information to some degree; Japanese-English pronunciation, or so howeve r, without showing meta-data by called Katakana English, than native some other methods, it is difficult to pronunciation. For them, English represent information suitably to various should be pronounced in this users. For example, sometimes it is Japanese-English. As the current impossible to determine the language and Japanese text-to-speech engines are the character-encoding scheme only by not good at pronouncing native analyzing the character codes of the text. English 6 , a bilingual speech system Emacspeak can use HTML and Auditory should have an `English to Japanese- Cascading Style Sheets to expressively English’ dictionary, which gives represent data. The combination of XML Japanese-English pronunciation and style sheets, which act as some of AUI consisting of Japanese phonemes, to functions, would be the best way to clearly present English to Japanese represent data in a way that matches the users. Actually, J. Ishikawa’s user’s demands. Japanese screen-reader for DOS

(Ishikawa) has developed such

dictionary and reads English in 6. Concluding Remarks Japanese-English well. BEP for Linux

We have developed a bilingual speech 6 In general, Japanese text-to-speech engines system based on Emacspeak and Japanese are good at pronouncing English neither in screen-readers. The alpha version of BEP native English nor in Japanese-English. is used by Japanese visually impaired These engines cannot correctly pronounce users who want to use Emacs. In Japanese written in Roman letters. It seems developing and supporting the system, we that they do not have English pronunciation found important requirements of a dictionaries and proper algorithms to practical speech synthesis system that determine appropriate pronunciation in have not been discussed well. They are these cases. Acknowledgements (1) An auditory user interface that presents the information effectively to The authors are grateful to all members, the sense of hearing using various voice especially Mr. K. Inoue, of Bilingual fonts and auditory icons. Emacspeak Project for their contribution to the current system9. This work was (2) Highly capable text-to-speech engines supported in part by Research for the that produce easy-to-hear voice at Future Program of JSPS under the Project double or triple speech rate. ``Integrated Network Architecture for (3) Free and open-source text-to-speech Advanced Multimedia Application engines of many languages for many Systems" (JSPS-RFTF97R16301). operating systems. (4) A standard interface for international text-to-speech engines. References

(5) A mechanism to select language and Raman, T.V. (1997). Auditory user pronunciation according to the user ’s interfaces -- toward the speaking demands. computer --. Kluwer Academic (6) Meta-data. Publishers. Watanabe, T., Inoue, K., Sakamoto, M., Emacs itself can handle multilingual text; Kiriake, H., Honda, H., Nishimoto, T., therefore, it would be nice to extend BEP & Kamae. T. (2001). Bilingual to a multilingual speech system to meet Emacspeak Platform -- A universal the requirements of worldwide users, speech interface with GNU Emacs --. especially of Asian users7. Proceedings of the 1st International Conference in Universal Access in The current system for Windows requires Human-Computer Interaction 2001. large CPU power. Pentium II or higher and enough memory is required8 . The Watanabe, T., Inoue, K., Sakamoto, M., required machine specification of Linux Honda, H., & Kamae, T. (2000a). version is not so demanding. It can run on Bilingual Emacspeak for Windows -- a MMX Pentium computer. As the current a self-speaking bilingual Emacs --. system is still under development, both Tech. Report of IEICE SP2000, versions are not stable enough to be used 100(256), 29-36, (in Japanese). to practical use. We are now struggling to Watanabe, T., Inoue, K., Sakamoto, M., improve the system, especially the speech Kiriake, M., Shirafuji, H., Honda, H., servers. Recently, some PDAs that use Nishimoto, T., & Kamae. T. (2000b). Linux are on the market. When running Bilingual Emacspeak and BEP on such a PDA that does not have accessibility of Linux for the visually enough machine power, hardware speech impaired. Proceedings of Linux synthesizer would be a better choice to Conference 2000 Fall, 246-255, (in ensure good response. Japanese).

Ishikawa, J. Details of his speech system are described in http:// fuji.u-shizuoka-ken.ac.jp/~ishikawa/

9 The authors think the most important point 7 In fact, there are few useful screen-readers of BEP is not technical features but that the for Windows in Korea and China. current system is developed by the 8 BEP for Windows can somehow run on a cooperation of sighted and visually impaired computer with 200MHz MMX Pentium and Emacs users to make a new speech system 64MB memory but it becomes more stable that does not suffer from the mismatching of with more CPU power and more memory. Windows’ GUI. devpgm.htm (in Japanese). Some dictionaries used in BEP are taken from his speech system. Saito, M. (1995). Software Production for the Visually Impaired. Journal of IPSJ, 36(12), 1116-1121, (in Japanese).