An Intelligent Multi-Dictionary Environment Gdbor Pr6sz6ky Morphologic K6smfirki U
Total Page:16
File Type:pdf, Size:1020Kb
An Intelligent Multi-Dictionary Environment Gdbor Pr6sz6ky MorphoLogic K6smfirki u. 8., H-1118 Budapest, Hungary proszeky @ morphologic.hu Abstract after the above claim, we see that software tools An open, extendible multi-dictionary sys- for translators, even the most recent ones, do not tem is introduced in the paper. It supports yet guarantee perfect solutions to automatic translation. More and more systems introduce, the translator in accessing adequate entries however, new facilities to the translator working of various bi- and monolingual dictionaries in a computational environment. As Hutchins and translation examples from parallel cor- says, "the best use must be made of those systems pora. Simultaneously an unlimited number that are available, and the producers and develop- of dictionaries can be held open, thus by a ers must be encouraged to improve and introduce single interrogation step, all the dictionaries new facilities to meet user needs." (Hutchins (translations, explanations, synonyms, etc.) 1996) can be surveyed. The implemented system (called MoBiDic) knows morphological It is almost a commonplace that texts - books, rules of the dictionaries' languages. Thus, newspapers, letters, official memos, brochures, never the actual (inflected) words, but al- any type of publications, reports, etc. - in the ways their lemmas - that is, the right dic- nineties are written, sent, read and translated with tionary entries - are looked up. MoBiDic the help of the electronic media. Consequently, has an open, multimedial architecture, thus traditional information sources, like paper-based it is suitable for handling not only textual, dictionaries, and lexicons, are no longer as much a but speaking or picture dictionaries, as well. part of the translation environment. The same system is also able to find words Electronic dictionaries for most developers just and expressions in corpora, dynamically mean, however, to make the well-known paper providing the translators with examples dictionary image appear on the computer screen. from their earlier translations or other It is easy to understand why we say that dictionary translators' works. MoBiDic has been de- computerization does not mean producing ma- signed for translator workgroups, where the chine-readable versions of traditional printed dic- translators' own glossaries (built also with tionaries, but the combination of the existing lexi- the help of the system) may also be dis- cal resources with up-to-date language technol- seminated among the members of the ogy. group, with different access rights, if On the other hand, there is a question whether needed. The system has a TCP/IP-based we have to continue in the traditional way of de- client-server implementation for various veloping new - and different - lexicons for any platforms and available with a gradually in- new application/system, starting from scratch creasing number of dictionaries for numer- every time and therefore consuming time, money ous language pairs. and manpower, or is it new lexicons. In what follows, timely to think of the possi- Introduction bility of making the effort to converge, trying to "The whole world of translation is opening up, to avoid unnecessary duplications and - where pos- new possibilities, and to technological and meth- sible - building on what already exists (Calzolari odological change" (Kingscott 1993). Some years 1994). Consequently, in the near future we have to combine the two above needs: making existing 1067 lexical resources computationally accessible and k:~:rm~ I II II II .. !DI :,..I showing the strategy how to develop we try to ar- gue for changes in development strategies of electronic translation dictionaries. Today's ling- I.N~ kit~ os ware technology can - and must - use dynamic 2" lel°ess el kimer, lel~'P, vegi~/a actions, like morpho-syntactic analysis, lemmati- lI.(k ~eft.) lie k allilleilli 141tt/ddl laNtlil, 1~~ a miglii zation, spell checking, and so on. On the other a~s-[elm z [.~] (v#.) hand, dictionaries can never be full in any sense, ~sgel~eitet 2. (hezuk6I) elme ~#,, t ~ivo2~k. 16melty leer am~ekem ~ei therefore we have to make parallel multi- ~l[[[[[[[[gmnim[ii[m 4.3, kiallzik,~au)l;k~akul elels:~, ~haravad eusgekss:en 5. elfoID", elt~mik, elv~z dictionary access possible. It means that a single eu~en~c~ 6.v~gz~d~ au~em~e~ ~ .. 7. our e~.) (~mi~e) t ~ek~ik, (~mit) h aj ~r~l, ('emit) h ejla~z dictionary look-up should use an unlimited num- em~echnet , seLq Plan geii ~ra~ iu az a ~rve ber of lexical resources that are available for the ausgei~.oche~ ~I 9. au~e~em lu#en kib oc i ~t translator. Figure 1 1 The MoBiDic Look-up System Look-up of a morphologically complex inflected form: To start with the most natural activity concerning 'ausgegangen' in a German-Hungarian dictionary. dictionaries is searching them for a single word. There is no problem if it can be found among the are supposed to know the expression (what's headwords of the dictionary, that is, when the in- more: the keyword of the expression) to find it in put string can match. But sometimes the translator the lexicon. Search for 'leada dog's life' through starts the look-up process by clicking an inflected its components gives the following result in word-form of an open document that cannot be MoBiDic: found among the headwords. For the user it is a lead {lead, leads, leading, led} boring and time-consuming task to type the lexical 27 occurrences in expressions of the basic dictionary, form, that is, the one accepted letter-by-letter by dog {dog, dogs, dog's, dogs'} the dictionary. To make the system able to find 21 occurrences in expressions of the basic dictionary, the stem of the input word-form automatically, life {life, lives, life's, lives'} MoBiDic uses a lemmatizer that provides the dic- 77 occurrences in expressions of the basic dictionary, tionary look-up module with the stem(s) to be lead AND life found (Figure 1). 5 occurrences in expressions of the basic dictionary, dog AND life Translators frequently want to find the word as 2 occurrences in expressions of the basic dictionary, a part of multi-word expressions or idioms. If the lead AND dog user does not know whether the actual word is 1 occurrence in expressions of the basic dictionary, part of some phrasal compound or idiom, the tra- lead a dog's life ditional paper dictionaries are very difficult to I occurrence as an expression in the basic dictionary. use. Namely, if the word in question is the so- called headword of a multi-word expression, it 'Bi' is somewhat misleading in the name Mo- can be found easily. In case it is not the headword, BiDic. Bilingual in this sense means that the one has to know the phrasal compound the word source and the target language are not the same is a part of, but it is a typical "Catch 22" Situation: types of object for the program. For MoBiDic, if the expression is known why to search the dic- source language is the language the morphology tionary for it? MoBiDic helps the user to find all of which has to be known, to provide the user the multi-word expressions containing the actual with adequate output. The output is expected to be word's stem, independently whether it is a head- in the target language - the characters, the alpha- word or not. E.g. not only 'lead' but both 'dog' and betic order, etc. of which has to be known to make '//fe' provide us (among others) with the multi- the hits appear on the screen in adequate format. word expression 'lead a dog's life' that can be Of course, the source and target languages can be found under 'lead' only in a paper dictionary. In the same, e.g. in explanatory or etymological dic- other words, users of the traditional dictionaries tionaries (Figure 2). 1068 tionaries occurrences of the word in texts of other authors, or wants to see bilingual texts with their aligned translations: monolingual or aligned bilin- gual corpus, a free text search module and a lem- matizer. 2 Dictionaries in MoBiDic Figure 2 The lexicographic basis for MoBiDic is sup- Hungarian explanation of 'acceptable quality level' in plied by various publishing houses. More pre- the English-Hungarian Economical Explanatory Dic- cisely, MorphoLogic has licenses to almost 50 tionary. dictionaries already published in paper format of miscellaneous topics, diverse sizes and many lan- guage pairs. The user can choose which dictionary There is an another sort of monolingual dic- to use in general, and which of them open actu- tionary, the synonym dictionary. The translator ally. Currently, if all the available dictionaries are frequently wants to use a synonym (antonym, hy- open, MoBiDic handles approximately 1 million pernym, hyponym) of the actual word. An intelli- lexical entries. gent software tool, like MorphoLogic's Helyette 1, is the combination of a thesaurus (synonym dic- Some of the dictionaries, mainly the termino- tionary), a morphological analyzer and a genera- logical ones, have usually a very simple list-based tor, because the output is re-inflected according to structure. Dictionaries shown by Figure 1 and the morphological information contained by the Figure 2, however, appear on the screen with the input word-form. The - so-called inflectional - traditional paper dictionary image. It is done by thesaurus works as follows: using SGML representations and an on-line INPUT: came SGML-RTF conversion. MoBiDic can do exact ANALYSIS : came = come + Past structural search not influenced by the layout at STEM: come all. SYNONYM: go SYNTHESIS: go + Past = went Generally, the original lexical resource - even OUTPUT: went it has been available in electronic format - did not use SGML.