<<

Translating into Free

Beryl Hoffman Centre for Cognitive Science University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW, U.K. hoffman~cogsci, ed. ac. uk

Abstract information, and contrastiveness. (Hajifiov£/etal, 1993; Steinberger, 1994) present approaches that In this paper, I discuss machine trans- determine the IS by using cues such as word order, lation of English text into a relatively definiteness, and complement semantic types (e.g. "free" word order , specifically temporal adjuncts vs arguments) in the som:cc Turkish. I present algorithms that language, English. I believe that we cannot rely use contextual information to determine upon cues in the source language in order to de- what the topic and the of each sen- termine the IS of the translated text. Instead, I tence should be, in order to generate the use contextual informati<)n in the target language contextually appropriate word orders in to determine the IS of sentences in the target lan- the target language. guage. In section 2, I discuss the Information Struc- 1 Introduction ture, and specifically th<~ topic and the focus in Languages such as Catalan, Czech, Finnish, Ger- naturally occurring Turkish data. Then, in section man, , Hungarian, Japanese, Polish, Rus- 3, I present algorithms for determining the topic sian, Turkish, etc. have much freer word order and the focus, and show that we can generate con- than English. For example, all six permutations textually appropriate word orders in '[~/rkish using of a transitive sentence are grammatical in Turk- these algorithms in a simple MT implementation. ish (although SOV is the most common). When 2 Information Structure we translate an English text into a "free" word or- der language, we are faced with a choice between ]n the Information Structure (IS) that I use for many different word orders that are all syntacti- Turkish, a sentence is first divided into a topic cally grammatical but are not all felicitous or con- and a comment. The topic is the maiu ele- textually appropriate. In this paper, I discuss ma- ment that the sentence is about, and the com- chine (MT) of English text into 2hrk- ment is the information conveyed about this toI)ic. ish and concentrate on how to generate the appro- Within the comment, we tind the focus, the most priate word order in the target language based on information-bearing const.itnent in the senten(:e, contextual information. and the ground, the rest of the sentence. The fo The most comprehensive project of this type is cus is the new or important information in the sen- presented in (Stys/Zemke, 1995) for MT into Pol- tence and receives prosodic prominence in speech. ish. They use the referential form and repeated In Turkish, the pragmatic fimction of topic is mention of items in the English text in order to assigned to the sentence-initial position mM the predict the salience of discourse entities and or- focus to the immediately preverbM position, fol- der the Polish sentence according to this salience lowing (Erguvanh, 11984). The rest of the sentence ranking. They also rely on statistical data, choos- forms the ground. ing the most frequently used word orders. I argue In (Iloffman, 1995; Iloffman, 1995b), I show for a more generative approach: a particular in- that the information structure components of formation structure (IS) can be determined from topic and focus can be suecessfiflly used in gener- the contextual information and then can be used ating the context-appropriate answer to database to generate the felicitous word order. This paper queries. Determining the topic and focns is fairly concentrates on how to determine the IS from con- easy in the context of a simple question, however textual information using centering, old vs. new it is much more complica.ted in a text. In the fol-

556 The Cb in SOY seiitences. Tim Cb in OSV sento, nco.s. Cb = 14 (47%) Cb = Subject 4 (13%) Cb = 6 (20%) C'b = Object :t6 (53%) Cb = Subj or Obj? 6 (20%) Cb = Sub,i or Ob.i? - 6 (2()%)- Cb = Subj or Other Obj? 0 (0%) Cb = Sul)j or Other ()b.i'? 2 U%) No Cb 4 (1.3%) No Cb 2 (7%) TOTAL 3O TO'rn l, 30

Figure 1: The Cb it, SOV a,nd OSV Sentences.

lowing sections, I will describe the characteristics 2.1 Topic of topic, focus, and ground components of the 1S in naturally occurring texts analyzed in (lloffman, Although humans can intuitively determine whal, 1995b) and allude to possible algorithms for deter- the tol)ic of a sentence is, the traditional delinition mining them. The algorithms will then be spelled (what tim sentence is about) is too vague to be im- out in section 3. plemented in a COmlml, ational system, l propose An example text from the cortms 1 is shown be- heuristics based on familiarit,y and salience to de- low. The noncanonical OSV word order in (1)b is termine discourse-old seal;ante topics, ~tt~¢l heu ris~ contextually appropriate because the object t)ro- ties based on grammatical reb~tions Ibr discou rse- noun is a discourse-old topic that links the se.n- new t.opics. Speakers can shill; Loa new topic tence to the previous context, and the sul)jeet, at the start, of a new discourse sag/ileal., ;ts iH "your father", is a discourse-new focus that is be- (2)a. Or they can continue ta.lking about Lh(~ sam(, ing contrasted with other relatives. Discourse-old (liscours(>o[(I tot)it , as iu (2)1). entities are those that were previously mentioned (2) a. [Mary]m went to lhe I,ookstore. in the discourse while discourse-new entil, ics are b. [She]./. I)ought a new book on . those that were not (Prince, 1992). A discourse-old topic often serves 1.o liuk the O) a. sentence to the previous context l)y evoking a Bu defteri de gok say(lira ben. familiar and sMient discourse entity. (~enteriug This notebk-acc too much like-l)st-lS I. Theory ((~rosz/etal, 1{)95) provides a measure of 'As for this notebook, I like it very much.' saliency based on the obserwrtions t;hat salient b. discourse entities are often mentioned rel)ea.1;edly Bunu da baban ml verdi? (OSV) within a discourse segment and are oft.an r(mlized This-Ace too father-2S Quest give-Past? as . (rl~lran, 1995) provides a. (:OUlpre- 'Did your FATHER, give this to you?' hensive study of null and overt subjects in Turk- (CHILDES lba.cha) ish using Centering Theory, and [ inw~stigate the Many people have suggested that "free" word interaction between word order and (',catering in order languages order information from old to new Turkish in (Iloffman, 1996). information. However, the Old-to-New ordering In the Centering Algoritl.n, each nt,l,era.nce in prim:iple is a generalization to which exceptions a discom:se is associated with a ranked list of dis- can be found. 1 believe that the order in which course entities called the forward-lookiug eent.ers speakers place old vs. new items in a sentence re- (Cf list;) that contains every (lis(:ours(~ entity that flects the information structures that are awdlable is reMized in thai; utteraltce. The Cf list is usually to the speakers. The ordering is actually tile 'Ibpic ranked according to a hierarchy of granmmtica] followed by the Focus. Tile qbpic tends to be relal, ions, e.g. subjects are aSSllllled to ])e lllore discourse-old inlbrmation and the focus disconrse- salient than objects. The backward looking cen- new. However, it is possible to have a disconrse- ter (Cb) is the most salient member of t,he Cf list NEW topic and a discourse-OLD focus, as wc will that links the era'rent utterance to the iwevious ut- see in the following sections, which explains the terance. The Cb of an utterance is delined as the exceptions to the Old-To-New ordering principle. highest ranke(l element of the previous u tterance's Cf list that also occurs iu the curren(, utterance. 1The data was collected fi'om transcribed conver- If there is a in the sentence, it ia likely sations, contemporary novels, and adult speedl from to be the (Jb. As we. will see, the (~,b has much in the CHILDES corpus. common with a sentence- tol)ic.

557 S-init IPV Post-V sov,osv soov,os_v ovs, svoo_ Discourse-Old 55 (85%) 43 (67%) 56 (93%) Inferrable 8 (13%) 10 06%) 4 (7%) D-New, Hearer-Old i (2%) 1 (2%) 0 * D-New, Hearer-New 0 10 (15%) 0 TOTAL 64 64 60

Figure 2: Given/New Status in Different Sentence Positions

The Cb analyses of the canonical SOV and the focus. Focusing discourse-new information is of- noncanonical OSV word orders in 251rkish are ten called presentational or informational focus as summarized in Figure 1 (forthcoming study in shown in (3)a. Broad/wide focus (focus projec- (Hoffman, 1996)). As expected, the subject is tion) is also possible where the rightmost element often the Cb in the SOV sentences. However, in the is accented, but the whole phrase is in the OSV sentences, the object, not the sub- in focus. However, we can also use focusing in or- ject, is most often the Cb of the utterance. A der to contrast one item with another, and in this comparison of the 20 discourses in the first two case the focus can be discourse-old or discourse- rows 2 of the tables in Figure 1 using the chi- new, e.g. (3)b. square test shows that the association between (3) a. What did Mary do this summer? sentence-position and Cb is statistically signifi- She [wandered around TURKEY]F. cant (X 2 = 10.10, p < 0.001). a Thus, the Cb, b. It wasn't [ME],., - It was [HF, R]e. when it is not dropped, is often placed in the sen- tence initial topic position in Turkish regardless of (VMlduvf, 1992) defines fbcns as the most whether it is the subject or the object of the sen- information-bearing constituent, and this defini- tence. The intditive reason for this is that speak- tion encompasses both contrastive and presenta- ers want to form a coherent discourse by imme- tional focusing. I use this definition of focus as diately linking each sentence to the previous ones well. However, as will see, we still need two differ- by placing the Cb and discourse-old topic in the ent algorithms in order to determine which items sentence-initial position. are in focus in the target sentence in MT. We must There are also situations where no Cb or check to see if they are discourse-new information discourse-old topic can be found. Then, a as well as checking if they are being contrasted discourse-new topic can be placed in the sentence- with another item in the discourse model. initial position to start a new discourse seg- In Turkish, items that are presentationally or ment. Discourse-new topics are often subjects or contrastively focused are placed in the immedi- situation-setting (e.g. yesterday, in the ately preverbM (IPV) position and receive the pri- morning, in the garden) in 3Mrkish. mary accent of the phrase. 4 As seen in Figure 2, brand-new discourse entities are found in the,,IPV 2.2 Focus position, but never in other positions in the sen- tence in my Turkish corpus. The distribution of The term focus has been used with many differ- brand-new (the starred line of the table) versus ent meanings. Focusing is often associated with discourse-old information (the rest of the table 5) new information, but it is well-known that old in- is statistically significant, (X 2 = 10.847, p < .001). formation, for example pronouns, can be focused This supports the association of discourse-new [b- as well. I think part of the confusion lies in the cus with the IPV position. distinction between contrastive and presentational

2The centering analysis is inconclusive in some 4Some languages such as Greek and Russian treat cases because the subject and the object in the sen- presentational and contrastive focus differently in tence are realized with the same referential form (e.g. word order. both as overt pronouns or as full NPs). 5 lnferrables refer to entities that the hearer can eas- ZAlternatively, using the canonical SOV sentences ily accmnmodate based on entities already in the dis-. as the expected frequencies, the observed frequencies course model or the situation. Hearer-old entities are for the noncanonical OSV sentences significantly di- well-known to the speaker and hearer but not neces- verge from the expected frequencies (X 2 = 8.8, p < sarily mentioned in the prior discourse (Prince, 1992). 0.005). They both behave like discourse-oM entities.

558 However, as can be seen in Figure 2, most semantic representation ~md the discourse model. of the focused subjects in the OSV sentences in Then, the sentence planner sends the semantic my corpus were actually discourse-old informa- representation and the information strncture it tion. Discourse-old entities that occur in the IPV has determined to the sentence realization com- position are contrastively focused. In (Rooth, ponent for Turkish. This component consists of a 1985)'s alternative-set theory, a contrastively fo- -driven bottom up generation algorithm that cused item is interpreted by constructing a set uses the semantic as well as the information strnc- of alternatives from which the focused item must ture features given by the planner to choose an ap- be distinguished. Generalizing from his work, we propriate head in the . The grammar used can determine whether an entity should be con- for the generation of 3hlrkish is a lexicalist formal- trastively focused by seeing if we can construct an ism called Multiset-CCG (Hoffman, 1995; Iloff- alternative set from the discourse model. man, 1995b), an extension of CCGs. Multiset- CCG was developed in order to capture formal 2.3 Ground and descriptive properties of "free" and restricted Those items that do not play a role in IS of the word order in simple and complex sentences (with sentence as the topic or the focus form the ground discontinuous constituents and long distance de- of the sentence. In Turkish, discourse-old informa- pendencies). Mnltiset-CCG captures the context- tion that is not the topic or focus can be dependent meaning of word order in 'Fnrkish by (4) a. dropped, compositionally deriving the predicate- b. postposed to the right of the verb, structure and the information strnctm'e of a sen- c. or placed unstressed between the topic and tence in parallel. the focus. The following sections describe the algorithms Postposing plays a backgrounding fnnction in used by the sentence plauner to determine the IS Turkish, and it is very common. Often, speak- of the 'lSlrkish sentence, given the semantic repre- ers will drop only those items that are very salient sentation of a parsed English sentence. (e.g. mentioned just in the previous sentence) and 3.1 The Topic Algorithm postpose the rest of the discourse-old items, lIow- ever, the conditions for dropping arguments can As each sentence is translated, we update the dis- be very complex. (Turan, 1995) shows that there course model, and keep track of the forward look- are semantic considerations; for instance, generic ing centers list (Cflist) of the last processed sen- tence. This is simply a list of all the discourse objects are often dropped, but specific objects enities realized in that sentence ranked according are often realized as overt pronouns and fronted. Thus, the conditions governing dropping and post- to the theta-role hierarchy found in the semantic posing are areas that require more research. representation. Thus, the Cf list for the reI)re- sentation give(Pat, Chris, book) is the ranked list 3 The Implementation [Pat,Chris,book], where the subject is assmned to be more salient than the objects. In order to simplify the MT implementation, I Given the semantic representation for the sen- concentrate on translating short and simple En- tence, the discourse model of the text processe(l glish texts into Turkish, using an interlingua rep- so far, and the ranked C[ lists of the current and resentation where concepts in the semantic repre- previous sentences in the discourse, the follow- sentation map onto at most one word in the En- ing algorithm determines the topic of (;he sen- glish or Turkish . The translation pro- tence. First, the algorithm tries to choose the ceeds sentence by sentence (leaving aside ques- most salient discourse-old entity as the sentence tions of aggregation, etc.), but contextual infor- topicf If there is no discourse-old entity realized mation is used during the incremental generation in the sentence, then a situation-setting o, of the target text. These simplifications allow the subject is chosen as the discourse-new topic. me to test out the algorithms for determining the topic and the focus presented in this section. l. Compare the current Cf list with the previous In the implementation, first, an English sen- sentence's Cf list; and choose the firs( item tence is parsed with a Combinatory Categorial that is a member of both of the ranked lists Grammar, CCG, (Steedman, 1985). The semantic (the Cb). representation is then sent to the sentence plan- 6(Stys/Zemke, 1995) use the saliency ranking to ner for Turkish. The Sentence Planner uses the order the whole sentence in Polish. tIowever, [ I)clieve algorithms in the following subsections to deter- that there is a distinct notion of topic and fo(:as in mine the topic, focus, and ground from the given Turkish.

559 2. If 1 fails: Choose the first item in the current Once the topic and focus are determined, the re- sentence's Cf list that is discourse-old (i.e. is mainder of the semantic representation is assigned already in the discourse model). as the gronnd. For now, items in the ground are ei- ther generated in between the topic and the focus 3. If 2 fails: If there is a situation-setting ad- or post-posed behind the verb as backgrounded verb in the semantic representation (i.e. a information. Further research is needed to disa.m- predicate modifying the main event, in rep- biguate the use of the two possible word orders. resentation), choose it as the discourse-new Further research is also needed on the exact role topic. of verbs in the IS. Verbs can be in the focus or 4. If 3 fails: choose the first item in the Cf list the ground in Turkish; this cannot be seen in the (i.e. the subject) as the discourse-new topic. word order, but it is distinguished by sentential for narrow focus readings. The algorithm Note that the determination of the sentence above works for verbs since I place events that topic is distinct from the question of how to realize are realized as verbs in the sentence into the dis- the salient Cb/topic (e.g. as a dropped or overt course model as well. ltowever, verbs are usu- pronoun or full NP). In the MT domain, this can ally not in focus unless they are surprising or con- be determined by the referential form in the source trastive or in a discourse-initiM context. Thus, the text. This trick can also be used for accommodat- algorithm needs to be extended to a(:comnaodate ing inferrable or hearer-old entities that behave as discourse-new verbs that are nonetheless expected if they are discourse-old even though they are lit- in some way into the ground component. In addi- erally discourse-new. If an item that is not; in the tion, verbs often participate in broad focus read- discourse model is nonetheless realized as a defi- ings, and fllrther research is needed to account for nite NP in the source text, the speaker is treating the observation that broad focus readings are only the entity as discourse-old. This is very similar to available in canonical word orders. (Stys/Zemke, 1995)'s MT system which uses the referential form in the source text to predict the 3.3 Examples topicality of a phrase in the target text. The English text in (5) is translated using the 3.2 The Focus Algorithm word orders in (6) following the Mgorithrns given Given the rest of the semantic representation for above. In (6), the numbers following T and F indi- the sentence and the discourse model of the text cate the step in the respective algorithm which de- processed so far, the following algorithm deter- termined the topic or focus for that sentence. Note mines the focus of the sentence. The first step is that the inappropriate word orders (indicated by to determine presentational focusing of discourse- #) cannot be generated by the algorithm. new information. Note that the focus, unlike the (5) a. Pat will meet Chris today. topic, can contain more than one element; this al- b. There is a tMk at four. lows broad focus as well as narrow focusing. If c. Chris is giving the talk. there is no discourse-new information, the second d. Pat cannot come. step in the algorithm allows contrastive focusing of discourse-old information. In order to construct (6) a. the alternative sets, a small knowledge base is used Bugiin Pat Chris'le bulu~acak. (AdvSOV) to determine the semantic type (agent, object, or Today Pat Chris-with meet-flit. (T:3,F~I) event) of the entities in the discourse model. 1. If there are any discourse-new entities (i.e. b. not in the discourse model) in the sentence, D6rtde bir konu~ma vat. (AdvSV,#SAdvV) put their semantic representations into focus, Four-Lot one talk exist. (T:3,F:I) 2. Else for each discourse entity realized in the c. Konu~mayl Chris w'.riyor. (OSV,#SOV) sentence, Talk-Ace Chris give-Prog. (T:I,F:2)

(a) Look up its semantic type in the KB and d. construct an alternative set that consists Pat gelemiyecek. (SV,@VS) of all objects of that type in the discourse Pat come-Neg-Fu|;. ('F:2,F:I for the verb) model, (b) If the constructed alternative set is not The algorithms can also utilize long distance empty, put the discourse entity's seman- scrambling in 3~rkish, i.e. constructions where tic representation into the focus. an element of an embedded has been ex-

560 tracted and scrambled into the matrix clause in Haji~ov& Eva, Petr Sgall, and liana Skounm,lowt order to play a role in the IS of the matrix clause. 11993. Identifying Topi(: and Focus 1)y an Auto= For example the b sentence in the following text is marie Procedure. l'rocccdings of the ,%,:th Coat- translated using long distance scrambling because ference of the Eurolwan Chapter of the As.soci- "the talk" is the Cb of the utterance and there- ali(m for Computational Linguistic.< fore the best sentence topic, even though it is the Beryl tIott'man. 1995. Integrating Frec Word O> argument of an embedded clause. der and Information Structure. t'roeced- ings of the European A ssoeiation for Com.puta- (7) a. There is a talk at four. tiou,I Linguistics (I';A CL). b. Pat thinks that Chris will give the talk. Beryl Hoffman. 1995. 7he Computational Anah,l- sis of the Syntax and Inte~Tnvtatimt of "[i','ee'; Word Order in Turki.~h. Ph.I). dissertation. (8) a. D6rtde bir konu~ma var. (AdvSV) 1RCS q>ch Report 95-17. l)ept, of (~on,puter Four-Lot one talk exist. and Information Science. [ Miversil;y of I'eJmsyl- vania. b. Konu+mayh Pat [Chris'in ei verecegini] Beryl lloffman, to appear 1996. Word Order, in- Taik-Acci Pat [Chris-gen ci givc-ger-as-a<:c] fbrmation Structure, and Centering in Turkish. Centering in Discourse. eds. F,llcn I'rin<:e, Ar- samyor. (O281[S2V2]V1) avind .loshi, and Marilyn Walker. Oxford (hal- think-Prog. (T:I,F:I) versify I)ress. Ellen F. Prince. The ZPG Letter: Subjects, l)ef- initeness and Information Status. Discourse 4 Conclusions descro~tion: diw'rse analyses of a ])rod rais- ing t,e.vt, eds. Thonrl)son, S. and Mann, W. In the machine translation task from Fnglish into Philadelphia: ,lohn Beujamins ILV. pl),2(,)5 a "free" word order language, it is crucial to 325. 1992. choose the contextnally appropriate word order in the target language. In this paper, I discussed how Mats l{.ooth. 1985. Association with l,'o- to determine the appropriate word order using cus. Ph.D. Dissertation. lJniversity of Mas- contextual information in translating into Turk- sachusel;t,s. Amherst. ish. I presented algorithms for deterndning the Mark Steedman. 1985. Dependencies mid <:oordi- topic and the focus of the sentence. These algo- nation in the grammar of l)uteh and Englislr, rithms are sensitive to whether the information is Language, 61:523 568. old or new in the discourse model (incrementally constructed from the translated text); whether l{alfSteinberger. 1994. Trolish formation structure for a semantic representation MT: Towards Robust Implementation. I{cccnl is constructed using these algorithms, the sentence Advanees in NLI'. with the contextually appropriate word order is {)rnit Turan. 1995. Null vs. Ow'~rt ,5'ubjer:ls i~ generated in the target language using Multiset 7}lrkish Discourse: g Centering An~dysis. Uni: CCG, a grammar which integrates syntax and in- versil,y of Pennsylvania, Linguistics l>h.l), dis- formation structure. sertation. Fmric Va.llduvL 1992. The l'nformational Corn.po- References rtent. New York: Garla,d.

Eser Emine Erguvanh. 1984. The l,'uuction of Word Order in . University of California Press.

Barbara Grosz and Aravind K. Joshi and Scott Weinstein. 1995. Centering: A Framework for Modelling the Local Coherence of Discourse. Computational Linguistics.

561