Machine Translation of Very Close Languages

Jan HAJI(~ Jan HRIC Vladislav KUBON Computer Science Dept. KTI MFF UK OFAL MFF UK Johns Hopkins University Malostransk6 nfim.25 Malostransk6 mim.25 3400 N. Charles St., Baltimore, Praha 1, Czech Republic, 11800 Praha 1, Czech Republic, 11800 MD 21218, USA [email protected] ff.cuni.cz [email protected] [email protected]

demonstrate that this assumption holds only for Abstract really very closely related languages. Using examples of the transfer-based MT system between Czech and Russian 1. Czech-to-Russian MT system RUSLAN RUSLAN and the word-for-word MT system with morphological disambiguation between 1.1 History Czech and Slovak (~ESILKO we argue that for really close languages it is possible to The first attempt to verify the hypothesis that obtain better translation quality by means of related languages are easier to translate started in simpler methods. The problem of translation mid 80s at Charles University in Prague. The to a group of typologically similar languages project was called RUSLAN and aimed at the using a pivot language is also discussed here. translation of documentation in the domain of operating systems for mainframe computers. It Introduction was developed in cooperation with the Research Institute of Mathematical Machines in Prague. At Although the field of machine translation has a that time in former COMECON countries it was very long history, the number of really successful obligatory to translate any kind of documentation systems is not very impressive. Most of the funds to such systems into Russian. The work on the invested into the development of various MT Czech-to-Russian MT system RUSLAN (cf. Oliva systems have been wasted and have not (1989)) started in 1985. It was terminated in 1990 stimulated a development of techniques which (with COMECON gone) for the lack of funding. would allow to translate at least technical texts from a certain limited domain. There were, of 1.2 System description course, exceptions, which demonstrated that The system was rule-based, implemented in under certain conditions it is possible to develop Colmerauer's Q-systems. It contained a full- a system which will save money and efforts fledged morphological and syntactic analysis of invested into human translation. The main reason Czech, a transfer and a syntactic and why the field of MT has not met the expectations morphological generation of Russian. There was of sci-fi literature, but also the expectations of almost no transfer at the beginning of the project scientific community, is the complexity of the due to the assumption that both languages are task itself. A successful automatic translation similar to the extent that does not require any system requires an application of techniques from transfer phase at all. This assumption turned to be several areas of computational linguistics wrong and several phenomena were covered by (morphology, syntax, semantics, discourse the transfer in the later stage of the project (for analysis etc.) as a necessary, but not a sufficient example the translation of the Czech verb "b~" condition. The general opinion is that it is easier [to be] into one of the three possible Russian to create an MT system for a pair of related equivalents: empty form, the form "byt6" in future languages. In our contribution we would like to

7 tense and the verb "javljat6sja"; or the translation particular natural language to the finest detail of of verbal negation). its syntax there were other problems. One of them At the time when the work was terminated in was the existence of non-projective constructions, 1990, the system had a main translation which are quite common in Czech even in dictionary of about 8000 words, accompanied by relatively short sentences. Even though they so called transducing dictionary covering another account only for 1.7°/'o of syntactic dependencies, 2000 words. The transducing dictionary was every third Czech sentence contains at least one, based on the original idea described in Kirschner and in a news corpus, we discovered as much as (1987). It aimed at the exploitation of the fact 15 non-projective dependencies; see also Haji6 et that technical terms are based (in a majority of al. (1998). An example of a non-projective European languages) on Greek or Latin stems, construction is "Soubor se nepodafilo otev~it." adopted according to the particular derivational [lit.: File Refl. was_not._possible to_open. - It was rules of the given languages. This fact allows for not possible to open the file]. The formalism used the "translation" of technical terms by means of a for the implementation (Q-systems) was not meant direct transcription of productive endings and a to handle non-projective constructions. Another slight (regular) adjustment of the spelling of the source of trouble was the use of so-called stem. For example, the English words semantic features. These features were based on localization and discrimination can be lexical semantics of individual words. Their main transcribed into Czech as "lokalizace" and task was to support a semantically plausible "diskriminace" with a productive ending -ation analysis and to block the implausible ones. It being transcribed to -ace. It was generally turned out that the question of implausible assumed that for the pair Czech/Russian the combinations of semantic features is also more transducing dictionary would be able to profit complex than it was supposed to be. The practical from a substantially greater number of productive outcome of the use of semantic features was a rules. This hypothesis proved to be wrong, too higher ratio of parsing failures - semantic features (see B6mov~, Kubofi (1990)). The set of often blocked a plausible analysis. For example, productive endings for both pairs (English/Czech, human lexicographers assigned the verb 'to run' a as developed for an earlier MT system from semantic feature stating that only a noun with English to Czech, and Czech/Russian) was very semantic features of a human or other living being similar. may be assigned the role of subject of this verb. The evaluation of results of RUSLAN showed The input text was however full of sentences with that roughly 40% of input sentences were 'programs' or 'systems' running etc. It was of translated correctly, about 40% with minor errors course very easy to correct the semantic feature in correctable by a human post-editor and about the dictionary, but the problem was that there 20% of the input required substantial editing or were far too many corrections required. re-translation. There were two main factors that On the other hand, the fact that both languages caused a deterioration of the translation. The first allow a high degree of word-order freedom factor was the incompleteness of the main accounted for a certain simplification of the dictionary of the system. Even though the system translation process. The grammar relied on the contained a set of so-called fail-soft rules, whose fact that there are only minor word-order task was to handle such situations, an unknown differences between Czech and Russian. word typically caused a failure of the module of 1.3 Lessons learned from RUSLAN syntactic analysis, because the dictionary entries contained - besides the translation equivalents We have learned several lessons regarding the MT and morphological information - very important of closely related languages: syntactic information. • The transfer-based approach provides a The second factor was the module of syntactic similar quality of translation both for closely analysis of Czech. There were several reasons of related and typologically different languages parsing failures. Apart from the common inability • Two main bottlenecks of full-fledged of most rule-based formal grammars to cover a transfer-based systems are:

8 - complexity of the syntactic dictionary MAHT (Machine-aided human translation) - relative unreliability of the syntactic systems. We have chosen the TRADOS analysis of the source language Translator's Workbench as a representative Even a relatively simple component system of a class of these products, which can be (transducing dictionary) was equally complex characterized as an example-based translation for English-to-Czech and Czech-to-Russian tools. IBM's Translation Manager and other translation products also belong to this class. Such systems Limited text domains do not exist in real life, uses so-called translation memory, which contains it is necessary to work with a high coverage pairs of previously translated sentences from a dictionary at least for the source language. source to a target language. When a human translator starts translating a new sentence, the 2. Translation and localization system tries to match the source with sentences already stored in the translation memory. If it is 2.1 A pivot language successful, it suggests the translation and the human translator decides whether to use it, to Localization of products and their documentation modify it or to reject it. is a great problem for any company, which wants The segmentation of a translation memory is a key to strengthen its position on foreign language feature for our system. The translation memory market, especially for companies producing may be exported into a text file and thus allows various kinds of software. The amounts of texts easy manipulation with its content. Let us suppose being localized are huge and the localization that we have at our disposal two translation costs are huge as well. memories - one human made for the source/pivot It is quite clear that the localization from one language pair and the other created by an MT source language to several target languages, system for the pivot/target language pair. The which are typologically similar, but different substitution of segments of a pivot language by from the source language, is a waste of money the segments of a target language is then only a and effort. It is of course much easier to translate routine procedure. The human translator texts from Czech to Polish or from Russian to translating from the source language to the target Bulgarian than from English or German to any of language then gets a translation memory for the these languages. There are several reasons, why required pair (source/target). The system of localization and translation is not being penalties applied in TRADOS Translator's performed through some pivot language, Workbench (or a similar system) guarantees that if representing a certain group of closely related there is already a human-made translation present, languages. Apart from political reasons the then it gets higher priority than the translation translation through a pivot language has several obtained as a result of the automatic MT. This drawbacks. The most important one is the system solves both problems mentioned above - problem of the loss of translation quality. Each the human translators from the pivot to the target translation may to a certain extent shift the language are not needed at all and the machine- meaning of the translated text and thus each made translation memory serves only as a subsequent translation provides results more and resource supporting the direct human translation more different from the original. The second from the source to the target language. most important reason is the lack of translators from the pivot to the target language, while this is usually no problem for the translation from the 3. Machine translation of (very) closely source directly to the target language. related Slavic languages 2.2 Translation memory is the key In the group of Slavic languages, there are more The main goal of this paper is to suggest how to closely related languages than Czech and Russian. overcome these obstacles by means of a Apart from the pair of Serbian and Croatian combination of an MT system with commercial languages, which are almost identical and were

9 considered one language just a few years ago, the and its governing noun. An alternative way to the most closely related languages in this group are solution of this problem was the application of a Czech and Slovak. stochastically based morphological disambiguator This fact has led us to an experiment with (morphological tagger) for Czech whose success automatic translation between Czech and Slovak. rate is close to 92°/'0. Our system therefore consists It was clear that application of a similar method of the following modules: to that one used in the system RUSLAN would lead to similar results. Due to the closeness of 1. Import of the input from so-called 'empty' both languages we have decided to apply a translation memory simpler method. Our new system, (~ESILKO, 2. Morphological analysis of Czech aims at a maximal exploitation of the similarity 3. Morphological disambiguation of both languages. The system uses the method of 4. Domain-related bilingual glossaries (incl. direct word-for-word translation, justified by the single- and multiword terminology) similarity of syntactic constructions of both 5. General bilingual dictionary languages. 6. Morphological synthesis of Slovak Although the system is currently being tested on 7. Export of the output to the original translation texts from the domain of documentation to memory corporate information systems, it is not limited to any specific domain. Its primary task is, however, Letus now look in a more detail at the individual to provide support for translation and localization modules of the system: of various technical texts. ad 1. The input text is extracted out of a 3.1 System (~ESiLKO translation memory previously exported into an ASCII file. The exported translation memory (of The greatest problem of the word-for-word TRADOS) has a SGML-Iike notation with a translation approach (for languages with very relatively simple structure (cf. the following similar syntax and word order, but different example): morphological system) is the problem of morphological ambiguity of individual word Example 1. - A sample of the exported translation forms. The type of ambiguity is slightly different memory in languages with a rich inflection (majority of ... Slavic languages) and in languages which do not have such a wide variety of forms derived from a 23051999 single lemma. For example, in Czech there are VK only rare cases of part-of-speech ambiguities (st~t Pomoci v~kazu ad-hoc m65ete [to stay/the state], zena [woman/chasing] or tri rychle a jednoduge vytv~i~et regerge. [three/rub(imperative)]), much more frequent is n/a the ambiguity of gender, number and case (for example, the form of the adjective jam[ [spring] Our system uses only the segments marked by is 27-times ambiguous). The main problem is that , which contain one source even though several Slavic languages have the language sentence each, and , same property as Czech, the ambiguity is not which is empty and which will later contain the preserved. It is distributed in a different manner same sentence translated into the target language and the "form-for-form" translation is not by CESiLKO. applicable. Without the analysis of at least nominal groups it ad 2. The morphological analysis of Czech is is often very difficult to solve this problem, based on the morphological dictionary developed because for example the actual morphemic by Jan Haji6 and Hana Skoumalov~i in 1988-99 categories of adjectives are in Czech (for latest description, see Haji~ (1998)). The distinguishable only on the basis of gender, dictionary contains over 700 000 dictionary number and case agreement between an adjective entries and its typical coverage varies between

10 99% (novels) to 95% (technical texts). The The multiple-word terms are sequences of lemmas morphological analysis uses the system of (not word forms). This structure has several positional tags with 15 positions (each advantages, among others it allows to minimize morphological .category, such as Part-of-speech, the size of the dictionary and also, due to the Number, Gender, Case, etc. has a fixed, single- simplicity of the structure, it allows modifications symbol place in the tag). of the glossaries by the linguistically naive user. The necessary morphological information is Example 2 - tags assigned to the word-form introduced into the domain-related glossary in an "pomoci" (help/by means of) off-line preprocessing stage, which does not pomoci: require user intervention. This makes a big NFP2 ...... A .... ]NFS7 ...... A .... I R--2 ...... difference when compared to the RUSLAN where : Czech-to-Russian MT system, when each N - noun; R - preposition multiword dictionary entry cost about 30 minutes F - feminine gender of linguistic expert's time on average. S - singular, P - plural ad 5. The main bilingual dictionary contains data 7, 2 - case (7 - instrumental, 2 - genitive) necessary for the translation of both lemmas and A - affirmative (non negative) tags. The translation of tags (from the Czech into ad 3. The module of morphological the Slovak morphological system) is necessary, disambiguation is a key to the success of the because due to the morphological differences both translation. It gets an average number of 3.58 systems use close, but slightly different tagsets. tags per token (word form in text) as an input. Currently the system handles the 1:1 translation of The tagging system is purely statistical, and it tags (and 2:2, 3:3, etc.). Different ratio of uses a log-linear model of probability distribution translation is very rare between Czech and Siovak,

- see Haji~, Hladkfi (1998). The learning is based but nevertheless an advanced system of dictionary on a manually tagged corpus of Czech texts items is under construction (for the translation 1:2, (mostly from the general newspaper domain). 2:1 etc.). It is quite interesting that the lexically The system learns contextual rules (features) homonymous words often preserve their automatically and also automatically determines homonymy even after the translation, so no feature weights. The average accuracy of tagging special treatment of homonyms is deemed is between 91 and 93% and remains the same necessary. even for technical texts (if we disregard the ad 6. The morphological synthesis of Slovak is unknown names and foreign-language terms that based on a monolingual dictionary of SIovak, are not ambiguous anyway). developed by J.Hric (1991-99), covering more The lemmatization immediately follows tagging; than ]00,000 dictionary entries. The coverage of it chooses the first lemma with a possible tag the dictionary is not as high as of the Czech one, corresponding to the tag selected. Despite this but it is still growing. It aims at a similar coverage simple lemmatization method, and also thanks to of Slovak as we enjoy for Czech. the fact that Czech words are rarely ambiguous in their Part-of-speech, it works with an accuracy ad 7. The export of the output of the system exceeding 98%. (~ESILKO into the translation memory (of TRADOS Translator's Workbench) amounts ad 4. The domain-related bilingual glossaries mainly to cleaning of all irrelevant SGML contain pairs of individual words and pairs of markers. The whole resulting Slovak sentence is multiple-word terms. The glossaries are inserted into the appropriate location in the organized into a hierarchy specified by the user; original translation memory file. The following typically, the glossaries for the most specific example also shows that the marker domain are applied first. There is one general contains an information that the target language matching rule for all levels of glossaries - the sentence was created by an MT system. longest match wins.

11 Example 3. -A sample of the translation memory languages, namely for Czech-to-Polish translation. containing the results of MT Although these languages are not so similar as Czech and Slovak, we hope that an addition of a ... simple partial noun phrase parsing might provide results with the quality comparable to the full- 23051999 fledged syntactic analysis based system RUSLAN MT! (this is of course true also for the Czechoto-Slovak Pomoci v~kazu ad-hoc mfi~ete translation). The first results of Czech-to Polish rychle a jednodu~e vytv~i~et re,erie. translation are quite encouraging in this respect, Pomoci v~kazov ad-hoc m6~ete even though we could not perform as rigorous r~chio a jednoducho vytvhrat' re,erie. testing as we did for Slovak. Acknowledgements 3.2 Evaluation of results This project was supported by the grant GAt~R The problem how to evaluate results of automatic 405/96/K214 and partially by the grant GA(~R translation is very difficult. For the evaluation of 201/99/0236 and project of the Ministry of our system we have exploited the close Education No. VS96151. connection between our system and the TRADOS Translator's Workbench. The method References is simple - the human translator receives the B6movfi, Alevtina and Kubofi, Vladislav (1990). Czech- translation memory created by our system and to-Russian Transducing Dictionary; In: Proceedings translates the text using this memory. The of the Xlllth COLING conference, Helsinki 1990 translator is free to make any changes to the text Haji~, Jan (1998). Building and Using a Syntactially proposed by the translation memory. The target Annotated Coprus: The Prague Dependency text created by a human translator is then Treebank. In: Festschrifi for Jarmila Panevov~i, compared with the text created by the mechanical Karolinum Press, Charles Universitz, Prague. pp. application of translation memory to the source 106---132. text. TRADOS then evaluates the percentage of Haji~, Jan and Barbora Hladk~t (1998). Tagging matching in the same manner as it normally Inflective Languages. Prediction of Morphological evaluates the percentage of matching of source Categories for a Rich, Structured Tagset. ACL- text with sentences in translation memory. Our Coling'98, Montreal, Canada, August 1998, pp. 483- system achieved about 90% match (as defined by 490. the TRADOS match module) with the results of Haji~, Jan; Brill, Eric; Collins, Michael; Hladk~t human translation, based on a relatively large Barbora; Jones, Douglas; Kuo, Cynthia; Ramshaw, (more than 10,000 words) test sample. Lance; Schwartz, Oren; Tillman, Christoph; and Zeman, Daniel: Core Natural Language Processing Technology Applicable to Multiple Languages. The 4. Conclusions Workshop'98 Final Report. CLSP JHU. Also at: The accuracy of the translation achieved by our http:llwww.clsp.jhu.edulws981projectslnlplreport. system justifies the hypothesis that word-for- Kirschner, Zden~k (1987). APAC3-2: An English-to- word translation might be a solution for MT of Czech Machine Translation System; Explizite really closely related languages. The remaining Beschreibung der Sprache und automatische problems to be solved are problems with the one- Textbearbeitung XII1, MFF UK Prague to many or many-to-many translation, where the Oliva, Karel (1989). A Parser for Czech Implemented lack of information in glossaries and dictionaries in Systems Q; Explizite Beschreibung der Sprache sometimes causes an unnecessary translation und automatische Textbearbeitung XVI, MFF UK Prague error. The success of the system CESILKO has encouraged the investigation of the possibility to use the same method for other pairs of Slavic