Framework of Electronic Dictionary System for Chinese and Romance Languages

Framework of Electronic Dictionary System for Chinese and Romance Languages WONG Fai ⎯ MAO Yuhang Speech and Language Processing Research Center, Tsinghua University, 100084 Beijing China [email protected] [email protected] ABSTRACT: Most of Chinese-English electronic dictionaries are designed for Chinese-speakers, and there is no any electronic dictionary developed for Chinese and Romance languages. In this paper, we propose a dictionary framework for Chinese and Romance languages. By using the mouse-tracking technique, the dictionary interface has been designed for both Chinese and non Chinese-speakers. Even non Chinese-speakers can easily consult a Chinese word from the dictionary without requiring any knowledge of using the Chinese writing system. The dictionary system has been designed to run on different language platform without replying on any Chinese support system. These elementary resolutions has realized a dictionary system that can be used by Chinese and non Chinese-speakers. Furthermore, morphological component is integrated since Romance languages have a highly developed morphology. This proposed framework can be applied to other languages. KEY WORDS: Chinese-Romance, electronic dictionary, mouse-tracking technology and morphological analysis. Automatique des Langues (TAL), Volume 44(2).: Les dictionnaires électroniques, page 225-245. Hermès Sciences Publications. 226 Automatique des Langues (TAL), Volume 44-2. 1. Introduction As the computer technology developed, electronic form of dictionaries are emerging rapidly everywhere. By taking the advantage of new technologies, electronic dictionary systems are added with the potential functionalities through the computer devices. In particular, the extensive cross-referencing and the rapid access of material relevant to user’s need can now be obtained in an efficient way. The content of electronic dictionary is extending from standard information commonly available, such as meanings, grammar, usage, transliterated text and etc. to the inclusion of multimedia presentations such as pronunciation of word in the form of audio files, pictures and video clips, etc. There are a large number of dictionaries now available in the electronic forms, and some of them are even available for consulting on the Internet. For example, Merriam-Webster online [Webster] is an online dictionary, where user can look up word on the Internet. WordNet [Miller et al., 1993] is an English lexical reference system, where lexical items are organized into synonym sets that are represented for different lexical concepts. The relationships between synonym sets are connected with different relations. WordNet has been widely used by many NLP applications [Priss 98][Hovy et al., 1999][Rebecca et al., 2001]. EDR Electronic Dictionary [EDR 95] composes of a collection of specialized dictionaries for words, concepts and bilingual lexical items. It aims at providing a foundation for linguistic databases and explaining the relation of electronic dictionaries to very large knowledge bases. It adopts an interlingual architecture based on language independent concepts that are linked to lexical entries. A detailed research and development of electronic dictionaries in Europe can be found in [Sérasset 93]. In general, electronic dictionaries can be classified into different classes according to their usage purposes such as human use electronic dictionaries, specialized and multi-usage electronic dictionaries. Specialized electronic dictionaries are usually developed for specific applications like machine translation systems, while the multi-usage electronic dictionaries initially are developed without any particular application purpose, they can be used by different language processing applications as MT systems, language understanding systems, grammar checker, etc. [Sérasset 93][Copestake 92][EDR 95]. In the market, a large number of bilingual and multilingual electronic dictionaries are available. Many of the bilingual dictionaries are designed for native language to English, English to native language [BABYLON][Copestake 90]. Multilingual dictionaries are usually developed for languages that are from the same cognates, i.e. Acquila and Collings On-Line. However, there is no practical and commercial electronic dictionaries specially developed for the Chinese and Romance languages, such as Chinese and Portuguese, Chinese and French, etc. In particular the electronic dictionary that is intended to be used by both native users. Most Chinese-English dictionaries are designed for Chinese people to consult the meaning of English word, on the other hand, English-speaker cannot make use of Framework of Electronic Dictionary 227 the dictionary for consulting Chinese word. This is because Chinese and English languages are quite different in computation aspect. Chinese is a non-alphabetic language. It relies on particular input system supported for writing. This has limited the operating system used in a computer, i.e. Chinese Windows. In order to let a Chinese related electronic dictionary run under a non-Chinese Windows environment, it needs a third party Chinese support system and it is infeasible for non-Chinese users. To our knowledge, until now there is no any practical Chinese and Romance electronic dictionary that has been developed for bilingual usage1. This gap should be filled, as there is a demand from the market. In this paper, a framework of electronic dictionary for Chinese and Romance languages is proposed. By using the mouse-tracking technology, its elaborated interface allows both Chinese and non Chinese-speakers to easily consult the dictionary without relying on any specific Chinese support system. This framework resolves the incompatible problem caused by different language systems. The implemented dictionary system can run in any Windows platform, e.g. Chinese, English, French, Portuguese, etc. Considering that Romance languages, like French and Portuguese, are very rich in paradigms, a morphological analyzer is integrated into the dictionary to improve the coverage of lexical items from their variations. The framework can be easily applied to any languages, or even for multi-languages purpose, with properly re-configuring the system components so that it can be adapted to different languages requirements. The remainder of this paper is organized as follows. Section 2 describes the problems of some existing dictionaries. Section 3 presents the features of our design framework for Chinese and Romance electronic dictionary. The process of morphological component is presented in section 4. The principal of mouse-tracking technology and the mechanism to resolving the multi-language platform problem are described in section 5 and section 6. Finally, two applications, Chinese-French and Chinese-Portuguese electronic dictionaries, are demonstrated in section 7, and a conclusion is drawn to end this paper. 2. Description of problems Chinese is a difficult language to learn. Most of the dictionaries between Chinese and other languages are designed for Chinese-speakers who want to learn a foreign language. These dictionaries are initially implemented for use under Chinese Windows environment. Some of them are developed to be compatible in English Windows version, and all of such systems are limited to Chinese-English dictionaries. For other languages like French and Portuguese, there is a conflicting problem at displaying together with Chinese text. This is because Chinese and Romance languages share the common code area of the character-encoding scheme 1 Here, our definition of bilingual usage is that the electronic dictionary can be used for consulting the word of these languages by both Chinese and non-Chinese speaker. 228 Automatique des Langues (TAL), Volume 44-2. in computer system. In Chinese-English dictionaries, there is no such situation, since all the alphabetic characters of English are standard in the coding scheme, and is harmonious with Chinese. But characters, like ‘á ê ì õ ú’, appearing in Romance languages are considered as extended character set, and they are not properly handled at normal when Chinese text is involved in displaying. In the existing electronic Chinese-English dictionaries, we found that the look-up process of unknown words is particularly difficult in Chinese, due to the complicated writing system, especially for non Chinese-speakers. Since Chinese is non-alphabetic language, it relies on some input methods. The types of the input system most widely used by Chinese speakers are the Pinyin and the root radical methods. The Pinyin method is based on the Pinyin system that transliterates Chinese ideograms into the Roman alphabet. The Pinyin system is officially adopted in China. While the root radical method is concerning the smaller component units, referred to as radicals, of a Chinese character. Each radical is associated to a key in the keyboard, in writing, the root radical of the character is identified [Karel et al., 2002]. This depends on the user’s ability to identify the root, and which can be a problem for non Chinese-speakers if they do not have any knowledge about the Chinese language. However, with electronic dictionary, there are several ways to look up word. Firstly, with editable texts, user can simply copy the unknown word and paste it into the dictionary look-up window to achieve the desired result. Secondly and more conventionally, user tries to input the word through the writing system [Bilac et al., 2002]. This requires

Load more