Ruslan - an Nt System Between Closely Related Languages
Total Page:16
File Type:pdf, Size:1020Kb
RUSLAN - AN NT SYSTEM BETWEEN CLOSELY RELATED LANGUAGES Jan Haji~ J , , . Vyzkumny ustav matematxckych stroju , P J Loretanske nam. 3 118 55 Praha 1, Czechoslovakia Machinery) at the Department of Software ABSTRACT in cooperation with the Department of Mathematical Linguistics, Faculty of A project of machine translation of Mathematics and Physics, Charles Czech computer manuals into Russian is University, Prague. described, presenting first a description of the overall system structure and concentrating then mainly Input texts on input text preparation and a parsing algorithm based on bottom-up parser The texts our system should translate programmed in Colmerauer's Q-systems. are software manuals to V~MS-developed DOS-4 operating system which is an advanced extension to the common DOS. The texts are currently maintained on INTRODUCTION tapes under the editing and formatting system PES (Programmed Editing System). In mid-1985, a project of machine This system allows for preparation, translation of Czech computer manuals editing and binding-ready printout using into Russian was started, thus national printer chain(s). Texts are constituting a second MT project of the stored on tapes using an internal format group of mathematical linguistics at containing upper/lowercase letters, Charles University (for a full editing & formatting commands, version description of the first project, see number/identification, info on (Kirschner, 1982) and (Kirschner, in last-changed pages etc.; most of this press)). can be used to improve the overall translation quality. On the other hand, Our goals are both practical part of it is somewhat confusing and (translation or re-translation of new or must be handled carefully. re-edited manuals for export purposes within the COMECON countries, of an By now, we have access to 65 manuals estimated amount of 500 to I000 pages a on tapes, containing about 12.000 pages year) and theoretical (we wish to verify (approx. 1.500.000 running words - our approach to the analysis of Czech 53.000 different word fomrs). The and to develop a theoretical background complete documentation covers 78 manuals for translation between closely related and is still growing. languages such as Czech and Russian). The project is carried out by V~S, Prague (Research Institute for Computing 113 The overall structure (1) The text is "punched" from a tape, to "visualize" all embedded editing RUSLAN is a unidirectional system & formatting commands; dealing with one pair of languages (SL - (2) Fully automatic preprocessing Czech, TL - Russian). We adopt a follows, which includes: transfer-llke translation scheme (in the - national & special characters sense we do not use any intermediate conversion & coding pilot language), but with many - sentence boundaries recognition simplifications due to the close (3) The Czech morphological analysis relationship between Czech and Russian, (HA) is performed, followed by so that it belongs to the so-called (4) the syntactico-semantic analysis direct method (in the sense of (Slocum, (SSA) with respect to Russian 1985)). sentence structure, for each input sentence separately. The translation process itself is to (5) The representation obtained in the be carried out in batch (we have to previous step is converted into respect the hardware available). This Russian surface word llst in an means that no human intervention is appropriate order simultaneously possible during the process. performing some TL-dependent Nevertheless, our aim is to obtain changes. high-quallty results which would require (6) Then, morphological synthesis of usual post-editing only. No human Russian (MSR) is performed and at pre-editing is contained in the system the same time synthesized words are design. decoded and put out along with preserved editing & formatting The translation unit is constituted commands, and at last by a single sentence. Thus, the (7) the output is saved onto a tape recognition of sentence boundaries is a under the PES system again. part of the preprocessing. The resulting text can be then easily For the time being, a treatment of printed and corrected using PES editing ellipsis is not provided for, but a facilities. modification of the analysis is being prepared to account for cases (not very frequent in the translated manuals) Some gore details where information necessary for an appropriate translation should be looked Since the overall structure of RUSLAN for in the previous sentence(s). does not differ considerably from the existing MT-systems, we will concentrate ourselves in our paper on some Translation steps interesting details. RUSLAN performs following steps to ad (1): Getting a text out of the tape obtain the translation of a given (part of a) manual: This function is performed by means of PES "punch" command only. Internally 114 coded words and commands are converted sentence" in most cases; a more to card-like character format, so they complicated situation arises when can be read easily by other programs. evaluating punctuation frames. Frames This step is processed separatelly for ".", ";", "?" are created using because we want to achieve the maximal quite complicated algorithms. Clearly, hardware and operating syste~ it is not possible to obtain 100% independence possible. correctness without a deeper analysis, so we prefer (isolated) missing cuts to ad (2): Preproceaslng incomplete sentences. Tests showed only one missing cut every 100 pages of True words and punctuation are continuous text (introductory manuals), recognized and coded using alphanumeric and every 30-50 pages in reference characters only. Special characters manuals; no incomplete sentences (such as /, +, :, greek chars, etc.) appeared anywhere in the sample. This and YES-commands are coded similarly, looks promising, because missing cuts but they are handled as word attributes result in slowdown of analysis only. rather than as separate words. ad (S): Morphological analysis The recognition of sentence boundaries proved to be the hardest Since Czech is a highly inflectional problem of this stage. We have language, this part is a little more developed a special algorithm for complicated task than a MA for English. sentence boundaries recognition, which However, in the stage of MA of Czech we takes editing commands and punctuation obtain much more useful information for into consideration, as well as the syntactico-semantic analysis. upper/lowercase letters in special positions. This algorithm is based on MA is based on pattern unification. frames and features. Text is cut During the MA, the main dictionary is whenever the "End Of Sentence" condition searched through to find all possible is met. Such a condition is raised when stems; ambiguities are treated in one of the features of the next text parallel during the next phase of element is found in the frame of the processing. current text element. ad (4): Syntactico-semantic analysis Features assigned to each element are e. g. "beginning of sentence" - SSA is the most important part of unconditional sentence boundary assigned RUSLAN. Using Sgall's FGD as the to some PES commands, or "capitalized" - theoretical starting point (for the most this one is assigned to the word recent formulation, see (Sgall et al., starting with exactly one uppercase 1986)), the dependency approach and letter. Among other features we use data-driven parsing are the corner there are "common word", "uppercase stones and valency frames are the tools only", "number" and some other of SSA. To control the combinatoric classifying PES commands. expansion, semantic features are used as additional constraints to the syntactic Frames contain "beginning of ones (for a more detailed account of 115 SSA, see (Oliva, in prep.)). prepositions and some pronouns) forced The result of SSA is affected by the by the adjacent word(s). TL-syntax - so there is no true separate transfer component in our system. In After MSR, each word is decoded most cases, the need for changes can be (including its attributes) to the resolved on the basis of the Czec~ FEB-acceptable format and "punched" out. sentence. A module is being prepared" This is an inverse operation to step carrying out some minor restructuring (2). (necessary e. g. for determining the word order and some instances of ad (7): Catalogization negation), which will be performed before the synthesis. Handled by YES solely, this is an inverse operation to step (1). The close relationship between Czech and Russian helps us to leave many Implementation ambiguities unresolved and to allow the output to be as ambiguous as the input. All the testing is performed on the We must resolve such ambiguities that EC-1027 or IBM/370 systems at V~MS would create multiple outputs in the TL, (under DOS-4). The base of the system and select only one of them, but this is (steps 3, 4 and 5) is capable to run the case of only limited number of under the OS operating system as well. sentences. Steps 1 and 7 are handled by special ad (5): Generation software, which is a part of the DOS-4 operating system. Steps 2 and 8 are For the time being, no true written in standard Pascal (including TL-restructuring is being performed. the MSR module). Steps 3 to 5 are During the dependency tree programmed in the well-known Q-systems, decomposition, morphological information implemented through Fortran IV (G or H is transferred from the governor to its level). We use the Q-language compiler dependent modifications according to with the kind permission of its original agreement. The original word order is author, prof. B. Thouin; some marginal slightly changed when needed. An changes were made in the Q-language ordered list of words with morphological interpreter due to the practical needs information and editing/formatting of our system. The only noticeable attributes restored is the output of change is that complete graphs deleted this phase. formerly due to the CUL + DE + SAC mechanism are passed now (unchanged) to ad (6): Morphological synthesis the next Q-system for further processing.