Translating Into Free Word Order Languages

Translating into Free Word Order Languages Beryl Hoffman Centre for Cognitive Science University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW, U.K. hoffman~cogsci, ed. ac. uk Abstract information, and contrastiveness. (Hajifiov£/etal, 1993; Steinberger, 1994) present approaches that In this paper, I discuss machine trans- determine the IS by using cues such as word order, lation of English text into a relatively definiteness, and complement semantic types (e.g. "free" word order language, specifically temporal adjuncts vs arguments) in the som:cc Turkish. I present algorithms that language, English. I believe that we cannot rely use contextual information to determine upon cues in the source language in order to de- what the topic and the focus of each sen- termine the IS of the translated text. Instead, I tence should be, in order to generate the use contextual informati<)n in the target language contextually appropriate word orders in to determine the IS of sentences in the target lan- the target language. guage. In section 2, I discuss the Information Struc- 1 Introduction ture, and specifically th<~ topic and the focus in Languages such as Catalan, Czech, Finnish, Ger- naturally occurring Turkish data. Then, in section man, Hindi, Hungarian, Japanese, Polish, Rus- 3, I present algorithms for determining the topic sian, Turkish, etc. have much freer word order and the focus, and show that we can generate con- than English. For example, all six permutations textually appropriate word orders in '[~/rkish using of a transitive sentence are grammatical in Turk- these algorithms in a simple MT implementation. ish (although SOV is the most common). When 2 Information Structure we translate an English text into a "free" word order language, we are faced with a choice between ]n the Information Structure (IS) that I use for many different word orders that are all syntacti- Turkish, a sentence is first divided into a topic cally grammatical but are not all felicitous or con- and a comment. The topic is the maiu ele- textually appropriate. In this paper, I discuss ma- ment that the sentence is about, and the com- chine translation (MT) of English text into 2hrk- ment is the information conveyed about this toI)ic. ish and concentrate on how to generate the appro- Within the comment, we tind the focus, the most priate word order in the target language based on information-bearing const.itnent in the senten(:e, contextual information. and the ground, the rest of the sentence. The fo The most comprehensive project of this type is cus is the new or important information in the sen- presented in (Stys/Zemke, 1995) for MT into Pol- tence and receives prosodic prominence in speech. ish. They use the referential form and repeated In Turkish, the pragmatic fimction of topic is mention of items in the English text in order to assigned to the sentence-initial position mM the predict the salience of discourse entities and or- focus to the immediately preverbM position, fol- der the Polish sentence according to this salience lowing (Erguvanh, 11984). The rest of the sentence ranking. They also rely on statistical data, choos- forms the ground. ing the most frequently used word orders. I argue In (Iloffman, 1995; Iloffman, 1995b), I show for a more generative approach: a particular in- that the information structure components of formation structure (IS) can be determined from topic and focus can be suecessfiflly used in gener- the contextual information and then can be used ating the context-appropriate answer to database to generate the felicitous word order. This paper queries. Determining the topic and focns is fairly concentrates on how to determine the IS from con- easy in the context of a simple question, however textual information using centering, old vs. new it is much more complica.ted in a text. In the fol- 556 The Cb in SOY seiitences. Tim Cb in OSV sento, nco.s. Cb = Subject 14 (47%) Cb = Subject 4 (13%) Cb = Object 6 (20%) C'b = Object :t6 (53%) Cb = Subj or Obj? 6 (20%) Cb = Sub,i or Ob.i? - 6 (2()%)- Cb = Subj or Other Obj? 0 (0%) Cb = Sul)j or Other ()b.i'? 2 U%) No Cb 4 (1.3%) No Cb 2 (7%) TOTAL 3O TO'rn l, 30 Figure 1: The Cb it, SOV a,nd OSV Sentences. lowing sections, I will describe the characteristics 2.1 Topic of topic, focus, and ground components of the 1S in naturally occurring texts analyzed in (lloffman, Although humans can intuitively determine whal, 1995b) and allude to possible algorithms for deter- the tol)ic of a sentence is, the traditional delinition mining them. The algorithms will then be spelled (what tim sentence is about) is too vague to be im- out in section 3. plemented in a COmlml, ational system, l propose An example text from the cortms 1 is shown be- heuristics based on familiarit,y and salience to de- low. The noncanonical OSV word order in (1)b is termine discourse-old seal;ante topics, ~tt~¢l heu ris~ contextually appropriate because the object t)ro- ties based on grammatical reb~tions Ibr discou rse- noun is a discourse-old topic that links the se.n- new t.opics. Speakers can shill; Loa new topic tence to the previous context, and the sul)jeet, at the start, of a new discourse sag/ileal., ;ts iH "your father", is a discourse-new focus that is be- (2)a. Or they can continue ta.lking about Lh(~ sam(, ing contrasted with other relatives. Discourse-old (liscours(>o[(I tot)it , as iu (2)1). entities are those that were previously mentioned (2) a. [Mary]m went to lhe I,ookstore. in the discourse while discourse-new entil, ics are b. [She]./. I)ought a new book on linguistics. those that were not (Prince, 1992). A discourse-old topic often serves 1.o liuk the O) a. sentence to the previous context l)y evoking a Bu defteri de gok say(lira ben. familiar and sMient discourse entity. (~enteriug This notebk-acc too much like-l)st-lS I. Theory ((~rosz/etal, 1{)95) provides a measure of 'As for this notebook, I like it very much.' saliency based on the obserwrtions t;hat salient b. discourse entities are often mentioned rel)ea.1;edly Bunu da baban ml verdi? (OSV) within a discourse segment and are oft.an r(mlized This-Ace too father-2S Quest give-Past? as pronouns. (rl~lran, 1995) provides a. (:OUlpre- 'Did your FATHER, give this to you?' hensive study of null and overt subjects in Turk- (CHILDES lba.cha) ish using Centering Theory, and [ inw~stigate the Many people have suggested that "free" word interaction between word order and (',catering in order languages order information from old to new Turkish in (Iloffman, 1996). information. However, the Old-to-New ordering In the Centering Algoritl.n, each nt,l,era.nce in prim:iple is a generalization to which exceptions a discom:se is associated with a ranked list of dis- can be found. 1 believe that the order in which course entities called the forward-lookiug eent.ers speakers place old vs. new items in a sentence re- (Cf list;) that contains every (lis(:ours(~ entity that flects the information structures that are awdlable is reMized in thai; utteraltce. The Cf list is usually to the speakers. The ordering is actually tile 'Ibpic ranked according to a hierarchy of granmmtica] followed by the Focus. Tile qbpic tends to be relal, ions, e.g. subjects are aSSllllled to ])e lllore discourse-old inlbrmation and the focus disconrse- salient than objects. The backward looking cen- new. However, it is possible to have a disconrse- ter (Cb) is the most salient member of t,he Cf list NEW topic and a discourse-OLD focus, as wc will that links the era'rent utterance to the iwevious ut- see in the following sections, which explains the terance. The Cb of an utterance is delined as the exceptions to the Old-To-New ordering principle. highest ranke(l element of the previous u tterance's Cf list that also occurs iu the curren(, utterance. 1The data was collected fi'om transcribed conver- If there is a pronoun in the sentence, it ia likely sations, contemporary novels, and adult speedl from to be the (Jb. As we. will see, the (~,b has much in the CHILDES corpus. common with a sentence- tol)ic. 557 S-init IPV Post-V sov,osv soov,os_v ovs, svoo_ Discourse-Old 55 (85%) 43 (67%) 56 (93%) Inferrable 8 (13%) 10 06%) 4 (7%) D-New, Hearer-Old i (2%) 1 (2%) 0 * D-New, Hearer-New 0 10 (15%) 0 TOTAL 64 64 60 Figure 2: Given/New Status in Different Sentence Positions The Cb analyses of the canonical SOV and the focus. Focusing discourse-new information is of- noncanonical OSV word orders in 251rkish are ten called presentational or informational focus as summarized in Figure 1 (forthcoming study in shown in (3)a. Broad/wide focus (focus projec- (Hoffman, 1996)). As expected, the subject is tion) is also possible where the rightmost element often the Cb in the SOV sentences. However, in the phrase is accented, but the whole phrase is in the OSV sentences, the object, not the sub- in focus. However, we can also use focusing in or- ject, is most often the Cb of the utterance. A der to contrast one item with another, and in this comparison of the 20 discourses in the first two case the focus can be discourse-old or discourse- rows 2 of the tables in Figure 1 using the chi- new, e.g.

Load more