Framework of Electronic System for Chinese and Romance Languages

WONG Fai ⎯ MAO Yuhang

Speech and Language Processing Research Center, Tsinghua University, 100084 Beijing China [email protected] [email protected]

ABSTRACT: Most of Chinese-English electronic are designed for Chinese-speakers, and there is no any electronic dictionary developed for Chinese and Romance languages. In this paper, we propose a dictionary framework for Chinese and Romance languages. By using the mouse-tracking technique, the dictionary interface has been designed for both Chinese and non Chinese-speakers. Even non Chinese-speakers can easily consult a Chinese from the dictionary without requiring any knowledge of using the Chinese writing system. The dictionary system has been designed to run on different language platform without replying on any Chinese support system. These elementary resolutions has realized a dictionary system that can be used by Chinese and non Chinese-speakers. Furthermore, morphological component is integrated since Romance languages have a highly developed morphology. This proposed framework can be applied to other languages. KEY : Chinese-Romance, electronic dictionary, mouse-tracking technology and morphological analysis.

Automatique des Langues (TAL), Volume 44(2).: Les dictionnaires électroniques, page 225-245. Hermès Sciences Publications. 226 Automatique des Langues (TAL), Volume 44-2.

1. Introduction

As the computer technology developed, electronic form of dictionaries are emerging rapidly everywhere. By taking the advantage of new technologies, electronic dictionary systems are added with the potential functionalities through the computer devices. In particular, the extensive cross-referencing and the rapid access of material relevant to user’s need can now be obtained in an efficient way. The content of electronic dictionary is extending from standard information commonly available, such as meanings, grammar, usage, transliterated text and etc. to the inclusion of multimedia presentations such as pronunciation of word in the form of audio files, pictures and video clips, etc. There are a large number of dictionaries now available in the electronic forms, and some of them are even available for consulting on the Internet. For example, Merriam-Webster online [Webster] is an online dictionary, where user can look up word on the Internet. WordNet [Miller et al., 1993] is an English lexical reference system, where lexical items are organized into synonym sets that are represented for different lexical concepts. The relationships between synonym sets are connected with different relations. WordNet has been widely used by many NLP applications [Priss 98][Hovy et al., 1999][Rebecca et al., 2001]. EDR Electronic Dictionary [EDR 95] composes of a collection of specialized dictionaries for words, concepts and bilingual lexical items. It aims at providing a foundation for linguistic databases and explaining the relation of electronic dictionaries to very large knowledge bases. It adopts an interlingual architecture based on language independent concepts that are linked to lexical entries. A detailed research and development of electronic dictionaries in Europe can be found in [Sérasset 93]. In general, electronic dictionaries can be classified into different classes according to their usage purposes such as human use electronic dictionaries, specialized and multi-usage electronic dictionaries. Specialized electronic dictionaries are usually developed for specific applications like machine translation systems, while the multi-usage electronic dictionaries initially are developed without any particular application purpose, they can be used by different language processing applications as MT systems, language understanding systems, grammar checker, etc. [Sérasset 93][Copestake 92][EDR 95]. In the market, a large number of bilingual and multilingual electronic dictionaries are available. Many of the bilingual dictionaries are designed for native language to English, English to native language [BABYLON][Copestake 90]. Multilingual dictionaries are usually developed for languages that are from the same cognates, i.e. Acquila and Collings On-Line. However, there is no practical and commercial electronic dictionaries specially developed for the Chinese and Romance languages, such as Chinese and Portuguese, Chinese and French, etc. In particular the electronic dictionary that is intended to be used by both native users. Most Chinese-English dictionaries are designed for Chinese people to consult the meaning of English word, on the other hand, English-speaker cannot make use of Framework of Electronic Dictionary 227 the dictionary for consulting Chinese word. This is because Chinese and English languages are quite different in computation aspect. Chinese is a non-alphabetic language. It relies on particular input system supported for writing. This has limited the operating system used in a computer, i.e. Chinese Windows. In order to let a Chinese related electronic dictionary run under a non-Chinese Windows environment, it needs a third party Chinese support system and it is infeasible for non-Chinese users. To our knowledge, until now there is no any practical Chinese and Romance electronic dictionary that has been developed for bilingual usage1. This gap should be filled, as there is a demand from the market. In this paper, a framework of electronic dictionary for Chinese and Romance languages is proposed. By using the mouse-tracking technology, its elaborated interface allows both Chinese and non Chinese-speakers to easily consult the dictionary without relying on any specific Chinese support system. This framework resolves the incompatible problem caused by different language systems. The implemented dictionary system can run in any Windows platform, e.g. Chinese, English, French, Portuguese, etc. Considering that Romance languages, like French and Portuguese, are very rich in paradigms, a morphological analyzer is integrated into the dictionary to improve the coverage of lexical items from their variations. The framework can be easily applied to any languages, or even for multi-languages purpose, with properly re-configuring the system components so that it can be adapted to different languages requirements. The remainder of this paper is organized as follows. Section 2 describes the problems of some existing dictionaries. Section 3 presents the features of our design framework for Chinese and Romance electronic dictionary. The process of morphological component is presented in section 4. The principal of mouse-tracking technology and the mechanism to resolving the multi-language platform problem are described in section 5 and section 6. Finally, two applications, Chinese-French and Chinese-Portuguese electronic dictionaries, are demonstrated in section 7, and a conclusion is drawn to end this paper.

2. Description of problems

Chinese is a difficult language to learn. Most of the dictionaries between Chinese and other languages are designed for Chinese-speakers who want to learn a foreign language. These dictionaries are initially implemented for use under Chinese Windows environment. Some of them are developed to be compatible in English Windows version, and all of such systems are limited to Chinese-English dictionaries. For other languages like French and Portuguese, there is a conflicting problem at displaying together with Chinese text. This is because Chinese and Romance languages share the common code area of the character-encoding scheme

1 Here, our definition of bilingual usage is that the electronic dictionary can be used for consulting the word of these languages by both Chinese and non-Chinese speaker. 228 Automatique des Langues (TAL), Volume 44-2. in computer system. In Chinese-English dictionaries, there is no such situation, since all the alphabetic characters of English are standard in the coding scheme, and is harmonious with Chinese. But characters, like ‘á ê ì õ ú’, appearing in Romance languages are considered as extended character set, and they are not properly handled at normal when Chinese text is involved in displaying. In the existing electronic Chinese-English dictionaries, we found that the look-up process of unknown words is particularly difficult in Chinese, due to the complicated writing system, especially for non Chinese-speakers. Since Chinese is non-alphabetic language, it relies on some input methods. The types of the input system most widely used by Chinese speakers are the Pinyin and the root radical methods. The Pinyin method is based on the Pinyin system that transliterates Chinese ideograms into the Roman alphabet. The Pinyin system is officially adopted in China. While the root radical method is concerning the smaller component units, referred to as radicals, of a Chinese character. Each radical is associated to a key in the keyboard, in writing, the root radical of the character is identified [Karel et al., 2002]. This depends on the user’s ability to identify the root, and which can be a problem for non Chinese-speakers if they do not have any knowledge about the Chinese language. However, with electronic dictionary, there are several ways to look up word. Firstly, with editable texts, user can simply copy the unknown word and paste it into the dictionary look-up window to achieve the desired result. Secondly and more conventionally, user tries to input the word through the writing system [Bilac et al., 2002]. This requires the user to be trained and be capable of using one of the input method for writing Chinese. Another alternative is to make use of the computer technology by creating a comprehensive dictionary interface [Agirre et al., 1993] that allows users to look up words in an efficient manner, i.e. mouse-track technology. Furthermore, another bottleneck in consulting a dictionary is due to the high overhead involved in typing words. For people who have to frequently referencing documents in different languages, consulting dictionaries is indispensable and it will definitely occupy most of the valuable time of users, and most importantly, this consultation behavior will interference users’ ongoing work, when they need to look up an unknown word, they need to switch from their working environments to the electronic dictionary. Although [Agirre et al., 1993] have similar idea to simplify users’ involvement by adopting reasoning mechanisms which are analogous to those used by human in consulting a dictionary, their study remains on the data access methods. From our viewpoint, it is still not enough. Their system does help people to rapid cross-reference related information, but when consulting, people have to come to the dictionary system. In contrast to Chinese, Romance languages are very rich in inflectional in variation. The inflectional morphology of a verb is to define the possible variations on a to reflect the grammatical meaning. In addition to number and person, verbs are also conjugated according to mood and tense. Totally, it gives more than 60 inflected forms for a single verb. Nouns and adjectives have inflected forms also, Framework of Electronic Dictionary 229 they can appear in the feminine and/or in the plural. Therefore, the analysis of morphology of lexical items for Romance languages is very important. This can conclude the conventions of various grammatical joined to a lemma and hence to reduce the size of a dictionary from keeping their inflectional paradigms in full [Elisabete et al., 1999][Simonetta et al., 1999].

3. Features of electronic Chinese/Romance dictionary

Based on the discussed problems, we try to take the advantages of computer technologies to construct a framework that can be applied to realize a Chinese and Romance electronic dictionary, which can be beneficial to not only Chinese- speakers, but also non-Chinese speakers. For the purpose of this paper, we will focus mainly on dictionary interface, the multi-language platform and the component of morphological analysis. However, the dictionary possess the following vital features: − The lexical entry of the dictionary is represented in SGML format. This representation permits the lexical data to be independent from any specific structure. The advantage of this representation allows the electronic dictionary to be easily transferred to other languages with different grammatical information. A SGML parser is embedded for internal processing and attributes interpreting. The explanation of the looked up word after consulting from the dictionary can be exported by user for further editing. − The dictionary provides for morphological processing in order to identify the unrecognized word caused by its inflection and derivation. This arrangement is due to the empirical study of the characteristics of Romance languages, because they have a highly developed morphology. In our study, we found that, by using the morphological analyzer, the performance of the dictionary is greatly improved. Most of the inflected word can be identified and the grammatical information can be derived. For inflectional verbs, the derived grammatical information is presented to user, in comprehensive explanation, through a simple interpreter. − For the front end, mouse-tracking technology is applied to implement a comprehensive dictionary interface. Where the system will monitor the movement of the mouse cursor as well as the context underneath of the mouse pointer in the screen. The textual information is then captured, after the mouse pointer stayed for a while, and delivered to the dictionary look up window to consult. The explanation of the word is then returned and displayed on a small popup window. Therefore, once the electronic dictionary is started up, user can easily move the mouse and point to the unknown word. Then the system will capture the word, consult the dictionary, and show the translation of word in a small window. With this technique, a fast look-up interface is established and hence the look-up bottleneck is therefore resolved. Meanwhile, non Chinese-speakers can easily make use of this device for looking up Chinese word. Another and most benefit is that users are not necessary to 230 Automatique des Langues (TAL), Volume 44-2. stop their ongoing tasks and still can consult the meaning of unknown word while reading and editing. − In addition to the mouse-point instant look-up manner, a simple Pinyin input method is integrated in the dictionary. This alternative permits user to manually input a Chinese word based the associated transliteration. Considering that, user may like to look up words that are not available in electronic format. For Chinese learner, Pinyin is the elementary for studying Chinese, and almost all of the learners have studied Pinyin when they learn Chinese. That is the reason why Pinyin input method is considered in our dictionary and this input system, on the other hand, can help to reinforce the transliteration of Chinese word in the mind of Chinese learners. − Instead of providing the pronunciation of word in the forms of audio files, a text-to-speech engine is adopted in our dictionary. This will let the user listen to the exact pronunciation rather than interpret the linguistic representation. Currently, the speaking engine consists of five different languages, Mandarin, Cantonese, English, French and Portuguese. Amongst the languages, the speaking engine for Mandarin, Cantonese and English that we used is developed by the Speech and Language Processing Research Center of Tsinghua University [Mao et al., 2001]. While the languages of Portuguese and French, we have adopted the speaking engine of ELAN TTS [Philippe et al., 2001][ELAN]. − As already addressed, the electronic dictionary is designated for both Chinese and non Chinese-speakers. Therefore, dictionary system has been designed for multi-language platform, where it can run on different Windows environment, without relying on any third party language support system. Both Chinese and Romance languages can be properly handled and displayed. In non-Chinese Windows, user is able to consult Chinese and non-Chinese word from the dictionary, where the Chinese word can be input either through the mouse-tracking method or the Pinyin method from the dictionary system. − For , in order to minimize the user involvement in consultation, a language distinguishing mechanism is developed to automatically determine the language of input word. User does not need to tell the dictionary the source language and the target language. Thus, when using the mouse-tracking method, the system is able to identify the language of the source, and then asks the dictionary to look up the word in corresponding database. This can greatly improve the way of dictionary consulting. In summary, our study primarily focuses on the design of a comprehensive dictionary interface. Which allows both Chinese and non Chinese-speakers to easily consult the dictionary. Moreover, the characteristic, i.e. morphotactics, of Romance languages is studied and a morphological analyzer is conceived to improve the performance of the electronic dictionary for Chinese and Romance languages. The system components and the flow of word consultation processes are illustrated in Figure 1.

Framework of Electronic Dictionary 231

Mouse tracking Chinese/Romance bilingual dictionary

Identify source Input Display lookup language, process Speak out word/phrase infected word result and lookup

Key in Multi-language platform

Figure1. Modular Diagram

4. Morphological Component

Portuguese2 is one of the languages from the same cognate as Romance that has a highly developed morphology, and is relatively richer in inflectional variation. The lexical item is realized in various forms according to the grammatical category of number, gender, person, tense, etc. The richer the morphology it is, the higher the requirement it needs to the analyser [Wong et al., 2002]. The analysis component is divided into two major modules: inflectional module and derivational module. The analyzing of inflection and the encoded grammatical information according to the ending morpheme is processed in inflectional module. While the derivational module analyses the prefix or suffix of a lexical item and hence extracts the syntactic information (such as lexical category) from this lexical morpheme. Usually, lexical morpheme (prefix or suffix) is a syntactic information marker. Similar to earlier efforts, the morphological component needs a dictionary used by both inflectional and derivational modules. A basic lexical dictionary lists the lexical items occurring as much as possible, but all the entries are identified by a canonical form of the word (uninflected word). This prevents from the explosive growth of the item entries [Elisabete et al., 1999]. In addition to the basic lexical dictionary, inflectional module enquires three more dictionaries. First, the component treats closed-class verbal items, such as irregular verbs, by listing their inflectional paradigms in full. For treatment of inflectable open-class lexical

2 Throughout this section, we use Portuguese as an analysing example language in the description of the morphological component. 232 Automatique des Langues (TAL), Volume 44-2. items, two relatively small-specialized are used in place of the usual large lexicon: lexicon of stems (roots) of words and lexicon of ending morphemes together with the grammatical information encoded. Much simpler, derivational module relies only on the basic lexical dictionary and a morphemes (prefixes and suffixes) lexicon where lexical category of canonical item and the derived item are encoded in additions. The Morphotactis, the structure and content of the knowledge base in morphological analysis component have been designed: 1) to help solving analysis problems – inflection, derivation and homography paradigms, and 2) to minimize the knowledge acquisition efforts. Thus, as a result, the coverage of lexical item can be greatly improved, as well the computational performance.

4.1. The Lexicons

The lexicon of inflectional endings (suffixes) consists of all the empirically derived inflections for regular verbs, nouns, adverbs, and adjectives. The entry in the lexicon of ending is of the form as:

⎡ { f1,1, f1,2 ,... f1,n} ⎤ ⎢ ⎥ Suffix → ⎢ ... ⎥ ⎢ ⎥ ⎣{ f m,1, f m,2 ,... f m,n}⎦ [1]

Where fi,j are values of grammatical features associated with this ending, such as grammatical category of gender, number, person, tense, etc. This information is necessary for the analysis in inflection module. For example, the entry of a suffix of -amos is shown in Table 1.

-amos V1, T1, P4 Verb, 1st inflection rule, Present (indicative mode), Person (1st, plural) V3, T7, P4 Verb, 3rd inflection rule, Present (subjunctive mode), Person (1st, plural) V37, T11, P4 Verb, 37th inflection rule, Imperative, Person (1st, plural)

Table 1. Lexical entry for the suffix amos

The identifier “V1” indicates the verbal declension paradigm, “T1” demonstrates the tense of the current inflected verbal item and “P4” further identifies the person Framework of Electronic Dictionary 233 and number. These features provide the syntactic information, while the lexical meaning can be retrieved from the basic lexical dictionary. This lexicon is used to define the initial set of morphological features of a word for further testing. Table 2 shows another entries of -íssimo and -ibilíssimo, which can be used to analyze adjective. Some entries may contain strings of lexical item before the ending to be substituted.

-íssimo -o, -e ASP, G1, N1 Superlative adjective, masculine, singular, substituted ending -o or -e -ibilíssimo -ível ASP, G1, N1 Superlative adjective, masculine, singular, substituted ending -ível

Table 2. Lexicon entries for suffixes -íssimos and -ibilíssimo

The lexicon of stems consists a list of roots of the verbal items with the category inflection paradigms and their infinitive verb forms. The use of this lexicon is to identify the infinitive verb, and to verify if the stem, output from the ending analyzer, is belonging to the same paradigm. From another point of view, the software analyzer can prevent from misinterpreting an inflected verb caused by incorrect input. Table 3 shows partial entries.

consult- consultar (to consult) V1 Verb, 1st inflection rule part- partir (to leave) V3 Verb, 3rd inflection rule

Table 3. Lexicon entries for stems consult- and part-

The lexicon of morphemes consists a list of suffixes (or prefixes), which are empirically able to join to a lexical item to produce a new lexical item. This lexical morpheme, on the other hand, is information marker. Contrary to what happens in inflectional morphology, one of the features, part-of-speech is obligatory. Usually, the new lexical item resulting from this operation may have a different lexical category than the base item. Table 4 shows the entries of lexicon.

-dor -r N, V Noun - new item, Verb - base item, substituted ending -r -amente -o A, D Adjective - new item, Adverb - base item, substituted ending -o 234 Automatique des Langues (TAL), Volume 44-2.

Table 4. Lexicon entries for suffixes dor and amente

All of these lexicons are required in the morphological component for analysing the lexical morpheme, and extracting the syntactic information from the morpheme paradigms. This morphological component is initially developed for use in MT system. Therefore, the detailed syntactic information derived here from the morpheme of word in electronic dictionary is not the necessity.

4.2. Analysis Algorithm

The morphological analyzer attempts to look up the information from the basic lexicon to an input word. Whether or not the attempt successes, the input word is passed to the analyzer for further testing. This compulsory process tries to explore the homography of a word in its inflected form, in order to maximize the coverage. During the analysing process, when a word was not identified by a parser in any component, it is then fed into next component procedure. Figure 2 illustrates the procedure for analysing a word. The order of the calls to component procedures in the algorithm in Figure 2 is established to minimize the processing time and effort. This is because the morphological conventions for verb is more completed than that of others, therefore, it is less effort to analyse the verbal items.

Lookup word yes from basic lexicon ? no

Parse verbal yes item?

Process Input word no terminated Parse derived yes items?

no

Identify as unknown word

Figure 2. Parsing procedure Framework of Electronic Dictionary 235

All of the parser components are running similar strategy in analysing. For example in verbal analyzer, the parser goes through the ending lexicon for a match, then determines the stem of the word. The truncated stem will be identified from the stem lexicon to retrieve the grammatical information encoded in the entry. If an item could not be parsed by any component, the item is identified as an unknown word at the output of the system and consequently will show no translated word to it.

5. Mouse Tracking Technology

Compared with the traditional keyboard input manner, mouse-tracking instant look-up has obvious advantage. User can easily get the meaning of an unknown word, during reading electronic document or browsing the web pages with different languages. This input manner saves time from typing and, therefore, increases the user interests in reading and language learning. The principle of mouse-tracking technology is to monitor and capture every output text that Windows painting device is going to draw on the screen. Once the textual information has been obtained, it can be further analyzed and processed. In MS Windows, all the output relevant (Application Programming Interfaces) are integrated in GDI.dll (Graphics Device Interface dynamic link library), including text drawing functions. Where, TextOut and ExtTextOut are the two elementary functions for displaying text information on the screen [Jeffrey 96]. Therefore, if the two functions can be intercepted, we can freely to monitor all the context of displaying text. That is the key technique in realizing the methodology of mouse-tracking technology. Generally, there are three different ways to intercept a DLL function. We briefly describe the mechanism of each in followings: − The first way is to use Tool Help library, which Windows provides as a means to debug other applications [Microsoft 96]. This method can intercept all the functions of a DLL that the application calls. But it requires that the application must be under debugging. Apparently, this method cannot be applied to arbitrary application. Which means, this method is valid to a specified application’s interface in the screen, but not to others. − An alternative way is to find all the certain function callings in an application’s executable codes. Since it causes the changes of coding in applications, we strongly do not recommend of using this method. Similarly, this mechanism validates to a specified application, and cannot applied to arbitrary application. − The third way is to modify the coding of the DLL functions [Matt 96]. Where, we can add some codes to the function being intercepted. Then all the calls to this function can be traced and monitored by user defined codes. Which can fulfill the requirements in: 1) be able to monitor the content of an output text, 2) can be applied to any application running in the same environment. But it must be very careful in operating this mechanism. What is doing here is to directly access the kernel functions of the Windows operating system. Any mistaking may cause the 236 Automatique des Langues (TAL), Volume 44-2. system to halt. In our empirical experiments, this technique has been proved stable, and has been widely applied to many of our research systems.

5.2. The Principle of Word Tracking

The objective of mouse-tracking is to capture the textual information being pointed by the mouse cursor after the mouse pointer has stayed over the text for a time slice. This word input method can provide user a fast and convenient way to consult the dictionary for an unknown word. The text capturing procedure of this mechanism can be divided into several steps, as shown in the shaded region in Figure 4.

Monitor mouse movement and setup a timer Determine the no language of input text

Time slice expires?

yes Translated into traget language Cause the text under mouse pointer to be repainted

Display the translated result in Capture the text while popup window repainting with TextOut/ExTextOut

Figure 4. Mouse tracking instant translation mechanism

The technique of message hook is applied in monitoring the mouse movement, since Windows works upon the message mechanism [Microsoft 96]. The kernel of Windows receives messages from various kinds of input devices, such as keyboard, mouse, interrupt devices, etc. The kernel analyses the messages and forwards it to Framework of Electronic Dictionary 237 the corresponding message handlers. Therefore, upon receiving a message of mouse movement in our tracking engine, a timer is triggered and waited for the expiration of defined time slice. Once the time runs out, the action of text capturing starts by: 1) causing the text underneath to be repainted, 2) monitoring the text drawing functions and capturing the text when Windows is trying to repaint the text with the drawing functions and, 3) identifying the target text and filtering out unrelated content by comparing the coordinates of text with that of the location of the mouse pointer. Then the wanted text is therefore captured by the mouse-tracking engine for further processing. The rest of the look up process is passed to the dictionary for retrieving the translation, the usage examples and grammatical attributes of the word, etc. The returned result is then displayed either in a popup window or more detialinformation is presented to user on the displaying window of the electronic dictionary.

6. Multi-language platform resolution

Resolving the conflict caused by the overlapping of internal coding scheme between Chinese and Romance characters is one of the main concerns in developing a multi-language platform application. Some alphabetic characters (symbols) that used in Romance languages are sharing the same native codes with that of Chinese characters. The problem is obvious when both Chinese and Romance languages like French are displayed together, as demonstrated in Figure 5.

Correct content Wrongly display Les langues indo-européennes Les langues indo-europ 閑 nnes 印欧语系 印欧语系

Figure 5. Problem at displaying Chinese, “印欧语系”, and French ,“Les langues indo-européennes (the Indo-European languages)”, together, the symbol ‘é’ is misinterpreted as Chinese character by the Windows system

This situation must be handled by the electronic dictionary system internally. The Windows system will not process this problem, since it interprets the displayed text based on the default character set which is normally predefined by different version of Windows, such as Chinese, English, Portuguese, French, etc. Under Chinese Windows, all the textual data will be treated as Chinese, and under the French Windows, Chinese text will also be processed as Latin characters. The solution of this problem requires two steps in processing. Firstly, a language identification mechanism is needed to determine the language for the text, Chinese or Romance. Then, the display parameters are adjusted according to the identified 238 Automatique des Langues (TAL), Volume 44-2. language, and use the relevant font to display the text. These are the basic steps in handling for multi-language application. Following shows the code range that used by Chinese system and the standard Latin system, as in Table 5, where the range of codes, 0xc0 ~ 0xfe, are commonly shared by Chinese and Latin characters, that, as a result, causes to the confusion in text displaying.

Code area of the first byte Code area of the second byte Chinese 0xa1 – 0xfe 0x40-0x7e && 0xa1-0xfe Latin 0x20-0x7f && 0xc0–0xff

Table 5. Code area between Latin and Chinese

6.1. Language distinguishing mechanism

Language distinguishing algorithm is to make use of the distinction of the coding systems of Chinese and Romance. The mechanism is simply based on the statistical information of the frequency of codes that appears in each language. Chinese characters use two bytes of code for representation. The leading byte (high byte) of the character is in the range of 0xa1 and 0xfe. Which are the extended characters in the ASCII character code, its value is greater than 0x80 (127). That is, Chinese text shows a higher statistics with these extended characters. In Latin, the use of these extended characters is occasional. Another consideration is the word delimiters, unlike Latin language, Chinese text is ordinarily written without any delimiters. These evidences allow Chinese to be distinguishable from the languages of Romance.

6.2. Mechanism in properly displaying multi-language content

The confusion in displaying Latin and Chinese text in Windows environment is mainly affected by two factors: character set and the associated font face. If an application uses a font with an unsupported character set, the system will not attempt to translate or interpret the drawing text that are rendered with that font. Therefore, it is crucial to correctly choose a character set for different language in font mapping process. To ensure that the results are consistent, the value of a specific character set should be matched to that supported by the selected font face. This configuration is only valid under Chinese Windows. For non-Chinese Windows, it lies in the Unicode encoding scheme. The difference between the handling in Chinese and non- Chinese Windows is that all the characters must be processed in Unicode scheme other than the conventional single byte representation. That is, the text has to be Framework of Electronic Dictionary 239 converted into Unicode representation in advance. The rest procedures to properly display Chinese text under non-Chinese Windows are similar to the steps applied in Chinese Windows environment. In addition, the required font faces must be installed in the system. With this mechanism, the electronic dictionary system can be run in any Windows version without relying on extra language supported system, and thus the electronic dictionary can be really used by both Chinese and non Chinese- speakers.

7. Applications

The sections above described the main technologies and components necessary for establishing a Chinese-Romance electronic dictionary, that is designed for Chinese and non Chinese-speakers who wants to learn language. In this section, we describe two actual applications of this framework, and explain the benefit of the system to Chinese learners as well as Romance learners through examples.

France [f&phon07;&phon02;s] nom propre féminin la France: 法国: Les habitants de France sont les Français. 法国居民是法国人。 Une amie roumaine fait ses études en France. 一位罗马尼亚朋友在法国学习。

Table 6. A lexical entry of French-Chinese dictionary represented in SGML

As mentioned, the lexical entries of the dictionary are encoded in SGML format. Different constituent parts of an entry and its textual structure are identified and marked up by the SGML elements. This representation ensures that the lexical information can be encoded according to its provisions should be transportable to other applications. Table 6 shows an encoded lexical entry of French-Chinese dictionary. The constituents of the entry, such as grammatical attributes, phonetic notations, explanation, the usage examples, etc. are defined and encapsulated through the SGML elements. This encapsulated knowledge is most often used to 240 Automatique des Langues (TAL), Volume 44-2. identify specific locations or reference information within the entry, thus interested lexical fragments can be easily retrieved from the entry for special usage. Currently, in the developed Chinese-French electronic dictionary, there are around 21,000 French to Chinese lexical entries and 83,000 entries of Chinese to French in the dictionary. While for the Chinese-Portuguese electronic dictionary, there are around 42,000 Portuguese to Chinese entries and around 37,000 entries for Chinese to Portuguese in the dictionary.

7.1. Chinese and French electronic dictionary

In our investigation, we found that there is no any practical electronic dictionary for Chinese and French. This motivates us to looking for a way to establish a electronic dictionary framework that can be applied to the languages between Chinese and Romance. Therefore, the Chinese-French electronic dictionary is the first developed system that based on our framework. The dictionary system can be run in any Windows environment. Basically, all the features described in section 3 are included in the system. The main interface of the dictionary is shown in Figure 6. The dictionary provides a look-up window, close candidates window, word explanation window and a set of function buttons. The function buttons allow user to activate the mouse-tracking method, Pinyin input method, and disable or enable the speak function so that the content of the looked up result can be spoken to the user.

Figure 6. Chinese-French electronic dictionary Framework of Electronic Dictionary 241

7.1.1. Pinyin input method In conventional manner, user can look up word by typing the word at the dictionary look-up window. This dictionary system can accept words of these languages, when French word is input, the dictionary will automatically determine the language of the source, and returns the word explanation in target language, and vice versa. For Chinese user, it is very easy for them to consult French meaning for Chinese word. However, non Chinese-speaker can make use of the Pinyin input method to achieve the same goal. An example is illustrated in Figure 6, where the Pinyin transliterations, fa guo, of Chinese word 法国 (France) is input, and the meaning in French is shown in the explanation window with usage examples.

7.1.2. Mouse-tracking input method Alternative way to look up word in a more convenient manner is through the mouse-tracking method. Where user can easily consult an unknown word by putting the mouse pointer over the text and waits for the dictionary to return the looked up result and display in a popup window. Figure 7 demonstrates the situation, where the word 法国 (France) is being pointed and the explanation is returned in a small window. With this look-up method, users can easily get the meaning of a word while they are reading, writing or even browsing the Internet. Most importantly, users do not need to take away their attention from their works, and still they can get help from the dictionary. Incorporated with the language distinguishing mechanism, users do not need to care about the source of language. When pointing to a French word, the system will show the meaning in Chinese as Figure 8, and vice versa.

Figure 7. Mouse-tracking look up for Chinese word 242 Automatique des Langues (TAL), Volume 44-2.

Figure 8. Mouse-tracking look up for French word

7.2. Chinese and Portuguese electronic dictionary

In order to avoid from repeating the description for the same functions again in the introduction of Chinese-Portuguese electronic dictionary, we continue to demonstrate the rest features of our framework through the application of Chinese- Portuguese electronic dictionary. The development of this application is due to the consideration that Macao S.A.R. has a special characteristic, where Chinese and Portuguese are the official languages. Most of the official documents are bilingual format. Before the emerging of this electronic dictionary [Wong et al., 2002], people use to look up word from conventional paper dictionaries. Now, a lot of people have adopted our system in their work. As early system, the Chinese-Portuguese electronic dictionary system has similar interface as shown in Figure 9.

Framework of Electronic Dictionary 243

Figure 9. Chinese-Portuguese electronic dictionary

7.2.1. Inflected word recognition and restoration Another feature incorporated in the dictionary is the morphological processing. Since the variations of Romance word are very rich. A single verb may produce more than sixty different forms based on person, number, tense and mood. The use of morphological component can greatly reduce the size of dictionary from keeping its variation in full, and on the other hand, new words can be recognized through the set of inflectional and derivational paradigms. Figure 9 shows an example of presenting the explanation of an inflected word, partimos, with its grammatical information derived from morphological process.

7.2.2. Content spoken As discussed, both of these electronic dictionaries are integrated with the text-to- speech engine. The dictionary is able to speak out all the information displayed in the explanation window. Each of which has three different languages, Mandarin, Cantonese and Portuguese, or French for Chinese-French dictionary. In addition to finding the meaning of a word, language learns can also learn the pronunciation of word or phrases by listening to the system. The looked up results can also be exported by user for further processing and usage.

8. Conclusion and future work 244 Automatique des Langues (TAL), Volume 44-2.

In this paper, we have proposed a design framework for the Chinese and Romance electronic dictionary and this framework can be directly transferred to other Romance languages. The current work mainly focuses on designing the dictionary interface and resolving the multi-language platform problem. By using the mouse-tracking technology, a convenient look-up interface has been established. This interface has resolved the limitation problem so that both Chinese and non Chinese-speakers can use the dictionary for consulting words. The design dictionary can run in any Windows version, such as English, Portuguese, French, etc. without relying on any Chinese support system for displaying Chinese characters. In additions, a morphological component and text-to-speech component are integrated in the dictionary to improve its performance in analysing word, and make use of the new technology to speak out the content of the looked up result to users, other than interpret the linguistic information in plain text. In future, under the support of mouse-tracking technology, we are planning to incorporate a parser to help selecting the best meaning for the look-up word according to the context that users are reading. We believe that the feature of integrating the sense disambigatingp module can help users to quickly catch the right interpretation of word, and filter out the irrelevant information.

References

Agirre E., Arregi X., Artola X., Diaz de Ilarraza A., Evrard F., Sarasola K. Intelligent Dictionary Help System, Proc. 9th Symposium on Languages for special Purposes. Bergen (Norway), 1993. BABYLON, bilingual dictionary available at http://www.babylon.com. Breen J. W., Building an Electronic Japanese-English Dictionary, Japanese Studies Association of Australia Conference, July 1995. Breen J. W., A WWW , Japanese Studies Centre Symposium, July 1999. Copestake A., An approach to building the hierarchical element of a lexical knowledge base from a machine readable dictionary, in Proceedings of the First International Workshop on Inheritance in Natural Language Processing, Tilburg, pp. 19-29, 1990. Copestake A., The ACQUILEX LKB: Representation Issues in the Semi-automatic Acquisition of Large Lexicons, Proceedings of the 3rd Conference on Applied Natural Language Processing, Trento, Italy. April 1992. EDR., EDR Electronic Dictionary Technical Guide, Japan Electronic Dictionary Research Institute, Ltd. In Japan, 1995. Elisabete R., Cristina M. And Jorge B., A Computational Lexicon of Portuguese for Automatic Text Parsing, in Proceedings of SIGLEX99: Standardizing Lexical Resources, 37th Annual Meeting of the ACL, pp. 74-80, College Park, Maryland, USA, 1999. ELAN, http://www.elan.fr/. Framework of Electronic Dictionary 245

Hovy E. and Lin C. Y., Automated Text Summarization in SUMMARIST, Advances in Automatic Text Summarization, 81-94, MIT Press 1999. Hu H. W., He Y. M., Wang Q. X., Chen J. J. and Yong D. S., Realization of Janpanese to Chinese Net-translating Browser, In Proceedings 1998 International Conference on Chinese Information Processing (Tsinghua Univ. Beijing, China, Nov. 18-20, 1998), pp. 482-489. Jeffrey R., Advanced Windows NT, Microsoft Press, 1996. Karel S. and Robert W., Using Affordances in Electronic Chinese/English Dictionaries for Non Chinese-Speackers, Technical Reports, Cognitive Science, Carleton University, 2002. Mao Y. H. and Zhang G. Z., Design and Implementation of a Cantonese Text-to-Speech System, in Proceedings of Natural Language Understanding and Machine Translation, China, pp 443-447, 2001. Matt P., Windows 95 System Programming Secrets, IDG Books World Wide, Inc., 1996. Microsoft, Programmer’s Guide to Microsoft Windows, Microsoft Press, 1996. Miller G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K., Five Papers on WordNet. Prinston Univ., NJ, 1993. Philippe B. M. and Benoît S., Input/Output Normalisation and Linguistic Analysis for a Multilingual Text-To- System, in Proceedings of 4th ISCA Tutorial and Research Workshop on Speech Synthesis, 2001. Priss U. E., The formalization of WordNet by Methods of Relational Concept Analysis, In Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database and some of its applications. Cambrige, MA: The MIT Press 176-196, 1998. Rebecca G., Lisa P., Bonnie J. Dorr and Philip R., Mapping Lexical Entries in a Verbs Database to WordNet Senses, ACL 244-251, 2001. Bilac S., Baldwin T. and Tanaka H., Bringing the Dictionary to the User: the FOKS system, In Proc. of the 19th International Conference on Computational Linguistics (COLING2002), pages 89-95, 2002. Simonetta V. and Annibale E., Electronic Dictionaries and Linguistic Analysis of Italian Large Corpa, in Proceedings of Workshop: SIGLEX99, University of Maryland, USA, pp. 91-97, 1999. Sérasset G., Recent Trends of Electronic Dictionary Research and Development in Europe, Technical Memorandum Electronic Dictionary Research (EDR), Tokyo, Japan, 1993 . Webster., http://www.m-w.com. Wong F., Dong M. C. and Mao Y. H., The Research in Semantic & Morphological Analysis and Mouse Tracking Technology: Solving the difficulties in Chinese/Portuguese machine translation, in Proceedings of The Translation Workshop of New Century, IPM, March 2002. 246 Automatique des Langues (TAL), Volume 44-2.

Wong F., Mao Y. H., Dong M. C. and Li Y. P., Design and Implementation of Bi-directional Portuguese-Chinese Word-by-Word Machine Translation Tool, in Proceedings of Symposium on Technological Innovation in Macau, Macau, pp. 141-148, 2002.