A Model of Competence for Corpus-Based Machine Translation

A Mo del of Comp etence for Corpus-Based Machine Translation Michael Carl Institut fur Angewandte Informationsforschung, Martin-Luther-Strae 14, 66111 Saarbruc ken, Germany, [email protected] is based. In CBMT-systems, it is assumed that Abstract the reference translations given to the system in a In this pap er I elab orate a mo del of comp etence training phase have equivalence meanings. Accord- for corpus-based machine translation CBMT along ing to their intelligenc e, these systems try to g- the lines of the representations used in the transla- ure out of what the meaning invariance consists in tion system. Representations in CBMT-systems can the reference text and learn an appropriate source b e rich or austere, molecular or holistic and they can language/target language mapping mechanism. A b e ne-grained or coarse-grained. The pap er shows translation can only b e generated if an appropriate that di erent CBMT architectures are required de- example translation is available in the reference text. p endent on whether a b etter translation qualityor An interesting question in CBMT systems is thus: a broader coverage is preferred according to Boitet what theory of meaning should the learning pro- 1999's formula: \Coverage * Quality = K". cess implement in order to generate an appropriate understanding of the source text such that it can 1 Intro duction b e mapp ed into a meaning equivalent target text? In the machine translation MT literature, it has Dummett Dummett, 1975 suggests a distinction often b een argued that translations of natural lan- of theories of meaning along the following lines: guage texts are valid if and only if the source language text and the target language text have the In a rich theory of meaning, the knowledge of same meaning cf. e.g. Nagao, 1989. If we assume the concepts is achieved by knowing the features that MT systems pro duce meaningful translations of these concepts. An austere theory merely re- to a certain extent, wemust assume that such sys- lies up on simple recognition of the shap e of the tems have a notion of the source text meaning to a concepts. A rich theory can justify the use of a similar extent. Hence, the translation algorithm to- concept by means of the characteristic features gether with the data it uses enco de a formal mo del of of that concept, whereas an austere theory can meaning. Despite 50 years of intense research, there justify the use of a concept merely byenumer- is no existing system that could map arbitrary input ating all o ccurrences of the use of that concept. texts onto meaning-equivalent output texts. Howis A molecular theory of meaning derives the that p ossible? understanding of an expression from a nite According to Dummett, 1975 a theory of mean- numb er of axioms. A holistic theory, in con- ing is a theory of understanding: having a theory trast, derives the understanding of an expres- of meaning means that one has a theory of under- sion through its distinction from all other ex- standing. In linguistic research, texts are describ ed pressions in that language. A molecular theory, on a numberoflevels and dimensions each contribut- therefore, provides criteria to asso ciate a cer- ing to its understanding and hence to its meaning. tain meaning to a sentence and can explain the Traditionally, the main fo cus has b een on semantic concepts used in the language. In a holistic the- asp ects. In this research it is assumed that know- ory nothing is sp eci ed ab out the knowledge of ing the prop ositional structure of a text means to the language other than in global constraints understand it. Under the same premise, researchin related to the language as a whole. MT has fo cused on semantic asp ects assuming that texts have the same meaning if they are semantically In addition, the granularity of concepts seems cru- equivalent. cial for CBMT implementations. Recent research in corpus-based MT has di er- A ne-grained theory of meaning derives con- ent premisses. Corpus-Based Machine Translation cepts from single morphemes or separable words CBMT systems make use of a set of reference of the language, whereas in a coarse-grained translations on which the translation of a new text theory of meaning, concepts are obtained from with resp ect to the three levels of description are morpheme clusters. In a ne-grained theory of returned as the b est available translation. meaning, complex concepts can b e created by Example Based Machine Translation EBMT hierarchical comp osition of their comp onents, systems Sato and Nagao, 1990; Collins, 1998; whereas in a coarse-grained theory of meaning, Guvenir and Cicekli, 1998; Carl, 1999; Brown, 1997 complex meanings can only b e achieved through are richer systems. Translation examples are stored a concatenation of concept sequences. as feature and tree structures. Translation templates are generated which contain - sometimes weighted The next three sections discuss the dichotomies of - connections in those p ositions where the source theories of meaning, rich vs. austere, molecular vs. language and the target language equivalences are holistic and coarse-grained vs. ne-grained where a strong. In the translation phase, a multi-layered few CBMT systems are classi ed according to the mapping from the source language into the target terminology intro duced. This leads to a mo del of language takes place on the level of templates and comp etence for CBMT. It app ears that translation on the level of llers. systems can either b e designed to have a broad cov- The ReVerb EBMT system Collins, 1998 p er- erage or a high quality. forms sub-sentential chunking and seeks to link con- stituents with the same function in the source and 2 Rich vs. Austere CBMT the target language. A source language sub ject is translated as a target language sub ject and a source A common characteristic of all CBMT systems is language ob ject as a target language ob ject. In case that the understanding of the translation task is de- there is no appropriate translation template avail- rived from the understanding of the reference trans- able, single words can b e replaced as well, at the lations. The inferred translation knowledge is used exp ense of translation quality. in the translation phase to generate new transla- The EBMT approach describ ed in Guv enir and tions. Cicekli, 1998 makes use of morphological knowl- Collins 1998 distinguishes b etween Memory- edge and relies on word stems as a basis for trans- Based MT, i.e. memory heavy, linguistic light and lation. Translation templates are generalized from Example-Based MT i.e. memory light and linguistic aligned sentences by substituting di erences in sen- heavy. While the former systems implement an aus- tence pairs with variables and leaving the identical tere theory of meaning, the latter make use of rich substrings unsubstituted. An iterative application representations. of this metho d generates translation examples and The most sup er cial theory of understanding translation templates which serve as the basis for is implemented in purely memory-based MT ap- an example based MT system. An understanding proaches where learning takes place only by extend- consists of extraction of comp ositionally translatable ing the reference text. No abstraction or generaliza- substrings and the generation of translation tem- tion of the reference examples takes place. plates. Translation Memories TMs are such purely A similar approach is followed in EDGAR Carl, memory based MT-systems. A TM e.g. TRADOS's 1999. Sentences are morphologically analyzed and Translator's Workb ench Heyn, 1996, and STAR's translation templates are decorated with features. TRANSIT calculates the graphemic similarity of the Fillers in translation template slots are constrained input text and the source side of the reference trans- to unify with these features. In addition to this, lations and return the target string of the most sim- a shallow linguistic formalism is used to p ercolate ilar translation examples as output. TMs make use features in derivation trees. of a set of reference translation examples and a k- Sato and Nagao 1990 prop osed still richer repre- nn retrieval algorithm. They implement an austere sentations where syntactically analyzed phrases and theory of meaning b ecause they cannot justify the sentences are stored in a database. In the translation use of a word other than by lo oking up all contexts phase, most similar derivation trees are retrieved in which the word o ccurs. They can, however, enu- from the database and a target language deriva- merate all o ccurrences of a word in the reference tion tree is comp osed from the translated parts. By text. means of a thesaurus semantically similar lexical The TM distributed by ZERES Zer, 1997 follows items may b e exchanged in the derivation trees. a richer approach. The reference translations and Statistics based MT SBMT approaches imple- the input sentence to b e translated are lemmatized ment austere theories of meaning. For instance, in and part-of-sp eech tagged. The source language sen- Brown et al. 1990 a couple of mo dels are pre- tence is mapp ed against the reference translations sented starting with simple sto chastic translation on a surface string level, on a lemma level and on mo dels getting incrementally more complex and rich a part-of-sp eech level. Those example translations byintro ducing more random variables. No linguistic which show greatest similarity to the input sentence Richness of Representation Richness of Representation Granularity of Representation 6 6 6 sent.

Load more