An Application of the Interlingua System ISS for Spanish-English Pronominal Anaphora Generation

An Application of the Interlingua System ISS for Spanish-English Pronominal Anaphora Generation,

Jesfis Peral and Antonio Ferrfindez Research Group on Language Processing and Information Systems. Department of Software and Computing Systems. University of Alicante. 03690 San Vicente del Raspeig. Aiicante, Spain. {jperal, antonio} @dlsi.ua.es

Abstract system (Leavitt et al. (1994)), that are designed for well-defined domains. Although full parsing In this ~paper, we present the Interlingua of these texts could be applied, we have used system ISS to generate the pronominal partial parsing of the texts due to the anaphora into the Spanish and English unavoidable incompleteness of the grammar. languages. We also describe the main This is a main difference with the majority of the problems in the pronoun generation into interlingua systems such as the DLT system both languages such as zero-subject based on a modification of Esperanto (Witkam constructions and number, gender and (1983)), the Rosetta system which is syntactic differences. Our system improves experimenting with Montague semantics as the other proposals presented so far due to the basis for an interlingua (Appelo and fact that we are able to solve and generate Landsbergen (1986)), the KANT system, etc. as intersentential anaphora, to detect they use full parsing of the text. coreference chains and to generate Spanish After the parsing and solving pronominal zero-pronouns into English, issues that are anaphora, an interlingua representation of the hardly considered by other systems. Finally, whole text is obtained. In the interlingua we provide outstanding results of our system representation no semantic information is used on unrestricted corpora. as input, unlike some approaches that have as input semantic information of the constituents Introduction (Miyoshi et al. (1997), Castell6n et al. (1998), One of the main problems of many commercial the DLT system, etc). Machine Translation (MT) and experimental From this interlingua representation, the systems is that they do not carry out a correct generation of anaphora (including intersentential pronominal anaphora generation. Solving the anaphora), the detection of coreference chains of anaphora and extracting the antecedent are key the whole text and the generation of Spanish issues in a correct generation into the target zero-pronouns into English have been carried language. Unfortunately, the majority of MT out, issues that are hardly considered by other systems do not deal with anaphora resolution systems. Furthermore, this approach can be used and their successful operation usually does not for other different applications, e.g. Information go beyond the sentence level. In this paper, we Retrieval, Summarization, etc. present a complete approach that allows pronoun The paper is organized as follows: In section 1, resolution and generation into the target the complete approach that includes Analysis, language. Interlingua and Generation modules will be Our approach works on unrestricted texts described. These modules will be explained in unlike other systems, like the KANT interlingua detail in the next three sections. In section 5, the Generation module has been evaluated in order

! This paper has been partly financed by the collaborative research project between Spain and The United Kingdom number HB 1998-0068.

42 to measure the efficiency of our proposal. To do on pronominal anaphora resolution and so, two experiments have been accomplished: generation into the target language. In the generation of Spanish zero-pronouns into pronominal anaphora resolution in both the English (syntactic generation module) and the Spanish and English languages, the system has generation of English pronouns into Spanish achieved an accuracy of 84% and 87% ones (morphological generation module). respectively. Finally, the conclusions of this work will be A grammar defined by means of the presented. grammatical formalism SUG (Slot Unification Grammar) is used as input of SUPAR. A 1 System Architecture translator that transforms SUG rules into Prolog clauses has been developed. This translator will The complete approach that solves and generates provide a Prolog program that will parse each the anaphor is based on the scheme of Figure 1. sentence. SUPAR allows to carry out either a full Translation is carried out in two stages: from the or a partial parsing of the text, with the same source language to the Interlingua, and from the parser and grammar. Here, partial parsing Interlingua into the target language. Modules for techniques have been used due to the analysis are independent from modules for unavoidable incompleteness of the grammar and generation..In this paper, although we have only the use of unrestricted texts (corpora) as inputs. studied the Spanish and English languages, our These unrestricted corpora used as input for approach is easily extended to other languages, the partial parser contain the words tagged with i.e. multilingual system, in the sense that any their grammatical categories obtained from the analysis module can be linked to any generation output of a part-of-speech (POS) tagger. The module. word, as it appears in the corpus, its lemma and 1 its POS tag (with morphological information) is supplied for each word in the corpus. The corpus is split into sentences before applying the I NLPproblem I NLPproblem I i parsing. The output of the parsing module will be the Slot Structure (SS) that stores the necessary information2 for Natural Language Processing (NLP) problem resolution. This SS will be the input for the following module in which NLP problems (anaphora, extraposition, ellipsis, etc.) [ Spanish8enefation [ [ Englishgeneration I will be treated and solved. In Fernindez et al. (1998), a partial parsing3 Figure 1. System architecture. strategy that provides all the necessary As can be observed in Figure 1, there are information for resolving anaphora is presented. three independent modules in the process of This partial parsing shows that only the generation: Analysis, Interlingua and Generation following constituents are necessary for modules. anaphora resolution: co-ordinated prepositional and noun phrases, pronouns, conjunctions and 2 Analysis module The analysis is carried out by means of SUPAR 2 The SS stores for each constituent the following (Slot Unification Parser for Anaphora information: constituent name (NP, PP, etc.), resolution) system, presented in Femindez et al. semantic and morphologic information, discourse (2000). SUPAR is a computational system marker (identifier of the entity or discourse object) focused on anaphora resolution. It can deal with and the SS of its subconstituents. several kinds of anaphora, such as pronominal 3 It is important to emphasize that the system allows anaphora, one-anaphora, surface-count anaphora to carry out a full parsing of the text. In this paper, and definite descriptions. In this paper, we focus partial parsing with no semantic information is used in the evaluation of our approach.

43 verbs, regardless of the order in which they ( Verb: be "] appear in the text. The free words consist of 1Number: plural 1 ACTION= .~ Person: third constituents that are not covered by this partial 1 Tense: past ] parsing (e.g. adverbs). L Type: impersonalJ After applying the anaphora resolution ("Cat: NP "~ module, a new Slot Structure (SS') is obtained. ]Identifier: X [ In this new structure the correct antecedent I Head: boy [ (chosen from the possible candidates) for each ) Number:plural AGENT= ~ Gender:masculine anaphoric expression will be stored together I Person:third [ with its morphological and semantic I MODIFIER:of the mountains [ information. SS' will be the input for the ~Sem_Ref: human J lnterlingua system. THEME={} 3 Interlingua system (ISS) As said before, the Interlingua system takes the f Cat: PP SS of the sentence after applying the anaphora Identifier:Y resolution module as input. This system, named Prep: in . ~Cat: NP lnterlingua Slot Structure (1SS), generates an [Identifier: Z interlingua representation from the SS of the MODIFIER= ~ I Head:garden ..~ Number: singular sentence. ENTITY= ~ (3ender:masculine SUPAR generates one SS for each sentence [Person: third from the whole text and it solves intrasentential ~, Sere Ref:location and intersententiai anaphora. Then, 1SS L generates the interlingua representation of the Figure 2. lnterlingua representation of a clause. whole text. This is one of the main advantages As can be observed in Figure 25, the of 1SS because it is possible to generate interlingua is a frame composed of semantic intersentential pronominal anaphora. roles and features extracted from the SS of the To begin with, 1SS splits sentences into clause. Semantic roles that have been used in clauses 4. To identify a new clause when partial this approach are the following: ACTION, parsing has been carried out, the following AGENT, THEME and MODIFIER that heuristic has been applied: correspond to verb, subject, object and H1 Let us assume that the beginning of a new prepositional phrases of the clause respectively. clause has been found when a verb is parsed The notation we have used is based on the and a free conjunction is subsequently parsed. representation used in KANT interlingua. To In this particular case, a free conjunction identify these semantic roles when partial does not imply conjunctions that join co- parsing has been carried out and no semantic ordinated noun and prepositional phrases. It knowledge is used, the following heuristic has refers, here, to conjunctions that are parsed in been applied: our partial parsing scheme. H2 Let us assume that the NP parsed before Once the text has been split into clauses, the verb is the agent of the clause. In the same the next stage is to generate the interlingua way, the NP parsed after the verb is the theme representation for clauses. We have used a of the clause. Finally., all the PP found in the clause are its modifiers. complex feature structure for each clause. In Figure 2 the information of the first clause of the example (1) is presented: 5 Only the relevant attributes of each semantic role (1) The boys of the mountains were in the appear in a simplified way in the picture. Additional garden. They were catching flowers. attributes are added to the semantic roles in order to complete all the necessary information for the 4 A clause could be defined as "a group of words interlingua representation. containing a verb".

44 In Figure 2 the following elements have representation of the whole text of the example been found: ACTION= 'were', AGENT = 'the (1) can be observed. boys of the mountains', THEME= ~ (it has not ENITTY I been found any NP after the verb) and Cat: NP MODIFIER = 'in the garden'. These elements Identifier: V Head: boy are represented by a simple feature structure. Numb,Gend, Pers: plural, mase, thlrd Features are represented as attributes with their MODIFIER: CLAUSE 1 corresponding values. Sem_Ret~human Sentence 1D: 1 ACTION~be The semantic role ACTION has the following attributes: Verb with the value of the MODIFIER: lemma of the verb; Number, Person and Tense ENI'IIT 2 Cat: PP, Id: Q, Prep: in (grammatical features) and Type with the type of Cat:. NP Identifier: X the verb: impersonal, transitive, etc. Head: mountain Numb, Gend, Pets: 'The semantic role AGENT has the pl, fern, third following fiitributes: that contains the Sern_Ref:location COREFERENCE Cat CHAIN syntactic category of the constituent; Identifier with the value of the discourse marker; Head Cl.d USE 2 that contains the lemma of the constituent's ENTITY 3 Sentence ID: 2 ACTIO/'~catch head; Number, Gender and Person contain .THEME:Z Cat:Identifier: NP Y t grammatical features of the constituent; Head: garden Numb, Oend,Pets: AGENT: MODIFIER that contains all the information sing, masc.,third about the modifiers (PP) of the NP, and Sem_Ref Sum Ref: location [Type, Num,Gcnd [ Person, Head that contains semantic information about the IENTITY:V constituent's head if this information is available. The semantic role THEME has the same attributes as the semantic role AGENT, i.e. ENIITY 4 Cat: NP the difference is that THEME is the object of the Identifier: Z clause and AGENT is the subject. Head: flower Numb, Gend, Pets: Finally, the semantic role MODIFIER has plural, fem, third the following attributes: Cat that contains the Sem_Ref: thing syntactic category of the constituent; Identifier Figure 3. Interlingua representation of with the value of the discourse marker; Prep example (1). with the preposition of the constituent and ENTITY, which is the object of the PP and On the left side of Figure 3 the new objects contains the same attributes as the THEME. or entities of the discourse are represented. One clause can have more than one MODIFIER These objects are named ENTITIES and depending on the number of PP that it has. It is contain the following attributes: Cat, Identifier, important to emphasize that all this information Head, Number, Gender, Person and Sem_Ref, is extracted from the SS of the constituents due to they can represent an AGENT, a parsed in the clause. THEME or an object in a MODIFIER. As said before, instead of representing the On the right side, the CLAUSES of the clauses independently, we are interested in the text are represented in a simplified way. They interlingua representation of the whole input contain the semantic role. ACTION with its text. With the global representation of the input attributes and the semantic roles AGENT, text we will be able to generate intrasentential THEME and MODIFIER that have appeared and intersentential anaphora. Furthermore, it will in the clause. These semantic roles are linked to be possible to solve and generate coreference the ENTITIES that they refer to. It also contains chains. Thereby, the scheme of Figure 2 is the identifier of the sentence in which the extended in order to represent all the discourse CLAUSE appears (Sentence lD) and the using the clauses as main units of this Conjunction that joints two or more CLAUSES representation. In Figure 3 the interlingua in a sentence.

45 In the picture, four ENTITIES and two differences between the Spanish and English CLAUSES can be distinguished. The languages in the generation of the pronoun. ENTITIES are as follows: ENTITY 1 ('boy'), These differences are what we have named ENTITY 2 ('mountain), ENTITY 3 ('garden') discrepancies (a study of Spanish-English- and ENTITY 4 ('flower'). Moreover, a relation Spanish discrepancies is showed in Peral et aL between two ENTITIES (number 1 and number (1999)). In syntactic generation the following 2) appears in the picture due to the ENTITY1 discrepancies can be found: syntactic (NP) contains a MODIFIER (PP). discrepancies and Spanish elliptical zero-subject The CLAUSE 1 contains: Sentence 1D constructions. (' 1'), ACTION ('be'), AGENT ('V', the link to 4.1.1 Syntactic discrepancies ENTITY 1) and MODIFIER (which is a PP and contains the link to ENTITY 3). The This discrepancy is due to the fact that the CLAUSE 2 contains: Sentence_lD ('2'), surface structures of the Spanish sentences are 'ACTION ('catch'), AGENT (which is a more flexible than the English ones. The PRONOUN and contains the link to ENTITY 1) constituents of the Spanish sentences can appear and THEME ('Z', the link to ENTITY 4). without a specific order in the sentence. In order The coreference chain can be identified to carry out a correct generation into English, we thanks to AGENTS of CLAUSE 1 and must firstly reorganize the Spanish sentence. CLAUSE 2 ('the boys' and 'they') have their Nevertheless, in the English-Spanish translation, links to the same ENTITY. As can be seen, in general, this reorganization is not necessary. these links can occur between constituents of Let us see an example with the Spanish different clauses or different sentences. Then, sentence the global system is able to generate (2) A Pedro Io vi ayer. intersentential anaphora and identify the (I saw Peter yesterday.) coreferenee chains of the text. In (2), the object of the verb, A Pedro (to 4 Generation module Peter), appears before the verb (in the position of the theoretically subject) and the subject is The Generation module takes the interlingua omitted (this phenomena is usual in Spanish and representation of the text as input and generates it will be explained in the next subsection). The it into the target language. In this paper, we are PP A Pedro (to Peter) functions as an indirect only describing the generation of pronouns. The object of the verb (because it has the preposition generation phase is split into two modules: A (to)). We can find out the subject since the syntactic generation and morphological verb is in first person and singular, so the subject generation. In the next two subsections they will would be the pronoun Yo (1). Moreover, there is be studied in detail. Although the approach a pronoun, lo (him) that functions as presented here is multilingual, we have focused complement of the verb vi (saw). This pronoun on the generation into the Spanish and English in Spanish refers to the object of the verb, Peter, languages. when it is moved from its theoretical place after 4.1 Syntactic generation the verb (as it occurs in this sentence). As explained before, it is possible to In syntactic generation the interlingua identify the semantic roles (AGENT, ACTION, representation is converted by 'transformational etc.) of the previous constituents in the rules' into an ordered surface-structure tree, with CLAUSE applying a series of heuristics. Once appropriate labeling of the leaves with target the semantic roles of the constituents have been language grammatical functions and features. established, they will be stored in the interlingua The basic task of syntactic generation is to order representation. The generation into English will constituents in the correct sequence for the target be a new clause in which the order of the language. However, the aim of this work is only constituents is the usual in English: AGENT, the generation of pronominal anaphora into the ACTION, THEME and MODIFIERS. target language, so we have only focused on the

46 4.1.2 Elliptical zero-subject constructions H3 After the sentence has been divided into (zero-pronouns) clauses, a noun phrase or a pronoun is sought, for each clause, through the clause constituents As commented before, the Spanish language on the left-hand side of the verb, unless it is allows to omit the pronominal subject of the imperative or impersonal. Such a noun phrase sentences. These omitted pronouns are usually or pronoun must agree in person and number named zero-pronouns. While in other languages, with the verb of the clause. zero-pronouns may appear in either the subject's Sometimes, gender information of the or the object's grammatical position, (e.g. pronoun can be obtained when the verb is Japanese), in Spanish texts, zero-pronouns only copulative. For example, in: appear in the position of the subject. In English (3) Pedroj vio a Anak en el parque. Ok Estaba texts, this sort of pronoun occurs far less muy guapa. frequently, as the use of them are generally (Peterj saw Annk in the park. Shek was very compulsory in the language. Nevertheless, some beautiful.) examples can be found: "Ross carefully folded In this example, the verb is his trousers and ~.climbed into bed". (The estaba (was) copulative, so that its subject must agree in symbol ~ shows the position of the omitted gender and number with its object whenever the pronoun).. Target languages with typical object can have either a masculine or a feminine elliptical (zero) constructions corresponding to linguistic form (guapo: masc, guapa: fem). We source English pronouns are Italian, Thai, can therefore get information about its gender Chinese or Japanese. from the object, guapa ("beautiful" in its In order to generate Spanish zero-pronouns feminine form) which automatically assigns it to into English, they must first be located in the the feminine gender so the omitted pronoun text (ellipsis detection), and then resolved would have to be she rather than he. (anaphora resolution). At the ellipsis detection After the zero-pronoun has been detected, stage, information about the zero-pronoun (e.g. SUPAR inserts the pronoun (with its information person, gender, and number) must first be of person, gender and number) in the position in obtained from the verb of the clause and then which it has been omitted. This pronoun will be used to identify the antecedent of the zero- detected and resolved in the following module of pronoun (resolution stage). The detection anaphora resolution. After that, ISS generates the process depends on the knowledge about the interlingua representation of the text. structure of the language itself, which gives us In the example (3), two CLAUSES are clues to the use of each type of zero-pronoun. identified. In the second CLAUSE the zero- The resolution of zero-pronouns has been pronoun is detected (third person, singular and implemented in SUPAR. As we may work on feminine -she-) and solved (third person, unrestricted texts to which partial parsing is singular and feminine -Ann-). So the AGENT of applied, zero-pronouns must also be detected this CLAUSE is the PRONOUN she and it has a when we do not dispose of full syntactic link to the ENTITY Ann (the chosen information. Once the input text has been split antecedent). into clauses after applying the heuristic H1, the Now, the generation of Spanish zero- next problem consists of the detection of the pronouns into English is easy because all the omission of the subject from each clause. information that it is needed is located in the If partial parsing techniques have been interlingua representatio.n. The English applied, we can establish the following heuristic pronoun's information is extracted in the to detect the omission of the subject from each following way: number and person information clause: are obtained from the PRONOUN and gender information is obtained from the Head of its antecedent.

47 4.2 Morphological generation 4.2.2 Gender discrepancies In the morphological generation we mainly have English has less morphological information than to treat and solve number and gender Spanish. With reference to plural personal discrepancies in the generation of pronouns. pronouns, the pronoun we can be translated into nosotros (masculine) or nosotras (feminine), you 4.2.1 Number discrepancies into ustedes (masculine/feminine), vosotros This problem is generated by the discrepancy (masculine) or vosotras (feminine) and they into between words of different languages that ellos or elias. Furthermore, the singular personal express the same concept. These words can be pronoun it can be translated into dl/dste referred to a singular pronoun in the source (masculine) or ella/dsta (feminine). For language and to a plural pronoun in the target example: language. For example, in English the concept (6) Women~ were in the shop. They~ were buying .people is plural, whereas in Spanish is singular. gifts for their husbands. (4) Tile stadium was full of peoplej. They~ were (7) Las mujeresl estaban en la tienda. Ellasi very angry withthe referee. estaban comprando regains pard sus maridos. (5) El estadio estaba /leno de gentei. ~stai In Spanish, the plural name mujeres estaba muy enfadada con el ~rbitro. (women) is feminine and is replaced by the In (4), it can be observed that the name personal pronoun elias (plural feminine) (7), people in English has been replaced with the whereas in English they is valid for masculine as plural pronoun they, whereas in Spanish (5) the well as for feminine (6). ", name gente has been replaced with the singular These discrepancies do not always mean pronoun dsta (it). Gender discrepancies also that Spanish anaphors bear more information exist in the translation of other languages such than English one. For example, Spanish as in the German-English translation. possessive adjectives ~ casa) do not carry In order to take into account number gender information whereas English possessive discrepancies in the generation of the pronoun adjectives do (his~her house). into the target language a set of morphological We can find similar discrepancies among (number) rules is constructed. other languages. For example, in the In the generation of the pronoun They into French-German translation, gender is assigned Spanish in the example (4), the interlingua arbitrarily in both languages (although in French representation has a PRONOUN ('they', third is not as arbitrarily as in German). The English- person and plural) that it is linked to the German translation, like English-Spanish, ENTITY ('police', plural). For the correct supposes a translation from a language with generation into Spanish the following neutral gender into a language that assigns morphological rule is constructed: gender grammatically. pronoun + third..person + plural + antecedent As commented, it is important to (~olice') ~ ~sta (pronoun, third person, emphasize that the omission of the pronominal feminine and singular) subject is very usual in Spanish. If we want to stress the subject of a clause or distinguish The left-hand side of the morphological between different possible subjects, we will have rule contains the interlingua representation of to write the pronominal subject. Otherwise, the pronoun and the right-hand side contains the pronoun in the target language. pronominal subject could be omitted. We are interested, however, in the correct generation of In the same way, a set of morphological rules is constructed in order to generate English pronouns, and therefore, they will never be omitted. pronouns. Next, an example of these rules is shown: Thanks to the fact that our system solves only personal pronouns in third person, we have pronoun + third_person + singular + antecedent only studied gender discrepancies in the ('policla~ ~ they (pronoun, third person and generation of the third person pronouns. The plural)

48 study has been divided into pronouns with In the pronoun generation into English, the subject role and pronouns with complement role. set of morphological rules of Figure 6 is a) Pronouns with subject role. This kind of applied: pronouns can be identified in the interlingua pron + third_parson + sing + antec (person with masculine gander) ~ him representation because they have the semantic pron + third_person + sing + antec (person with feminine gender) ~ her role of AGENT in a CLAUSE. Their pron + third_person + sing + antec (animal orthing) ...) it antecedents are established with the links to the pron ÷ thirdperson + plural + antec (person) ~ them ENTITIES. The main problem in the pronoun Figure 6. generation into English consists of the In the process of generating a pronoun with generation of pronoun it. If we have a pronoun the semantic role of THEME into Spanish, the with the following attributes: masculine, set of morphological rules of Figure 7 is singular and third person in the interlingua applied: representation, this can be generated into the Spanish pron6uns he or it. If the antecedent of the pronoun refers to a person, we will generate pron ÷ third_person ÷ plural ~ les it into he. If the antecedent of the pronoun is an Figure 7. animal or a thing we will generate it into it. On the other hand, if the pronoun is in a These characteristics of the antecedent can be MODIFIER, the rules of Spanish generation obtained from the semantic information stored in will be as shown in Figure 8: its attribute Sem_Ref. A similar strategy is used to generate the pronouns she or it. With pron + third_person + sing + antec (masculine gender) ~ ~1 reference to plural personal pronouns: pron + third_person + sing + antec (feminine gender) ...) ella masculine/feminine, plural and third person, pron + third.person + plural + antec (masculine gender) ~ ellos pron + third_person + plural + antec (feminine gender) ~ elias they are generated into the English pronoun they. Figure 8. In Figure 4, the set of morphological rules to treat gender discrepancies in English generation of pronouns is shown: 5 Evaluation

pron + third_person + masculine + sing + antec (person) ~ he The syntactic generation and morphological pmn + thirdperson + masculine + sing + antec (animal or thing) --) it generation modules of our approach have been pron + thirdperson + feminine + sing + antec (person) ..-) she evaluated. To do so, one experiment for each pron + third_person + feminine + sing + antec (animal orthing) ~ it pron + thitrlperson + feminine/masculine + plural ~ they module has been accomplished. In the first one, the generation of Spanish zero-pronouns into Figure 4. English, using the techniques described above in subsection 4.1.2, has been evaluated6. In the In Spanish generation, the main problem second one, the generation of English consists of the translation of pronoun it The set pronouns into Spanish ones has been evaluated. of morphological rules to treat this case is shown In this experiment number and gender in Figure 5: discrepancies and their resolution, described pron + third_person + sing + antec (animal with masculine gender) ~ ~1 above in section 4.2, have been taken into [ pron + third_person + sing + antec (thing with masculine gender) .-) dste account. ] pron + third_person + sing + antec (animal with feminine gender) ~ ella l pmn + third_person + sing + antec (thing with feminine gender) .-y 6sta With reference to the first experiment, our computational system has been trained with a Figure 5. b) Pronouns with complement role. This kind of pronouns can be identified in the interlingua 6 Syntactic discrepancies has not been evaluated due representation because they have the semantic to the aim of this work is only the pronominal anaphora generation into the target language, so the role of THEME or they are in a MODIFIER in evaluation of the generation of the whole sentence a CLAUSE. into the target language has been omitted.

49 handmade corpus7 that contains 106 zero- justified for several reasons. One important pronouns. With this training, we have extracted reason is the non-detection of impersonal the degree of importance of the preferences that verbs by the POS tagger. Two other are used in the anaphora resolution module of reasons are. the lack of semantic the system. Furthermore, we have been able to information and the inaccuracy of the check and correct the techniques used in the grammar used. It is important to note that detection and generation of zero-pronouns into 46% of the verbs in these corpora have English. After that, we have carried out a blind their subjects omitted. It shows quite evaluation on unrestricted texts using the set of clearly the importance of this phenomenon preferences and the generation techniques in Spanish. learned during the training phase. In this case, b) Evaluating anaphora resolution. In this partial parsing of the text with no semantic task, the evaluation of zero-pronoun information has been used. resolution is accomplished. Of the 1,599 With regard to unrestricted texts, our verbs classified in these two corpora, 734 system has been run on two different Spanish of them have zero-pronouns. Only 228 of corpora: a) a fragment of the Spanish version of them 9, however, are in third person and will The Blue Book corpus (15,571 words), which be anaphorically resolved. A success rate of contains the handbook of the International 75% was attained for the 228 zero- Telecommunications Union CCITT, and b) a pronouns. By "successful resolutions" we fragment of the Lexesp corpus (9,746 words), mean that the solutions offered by our which contains ten Spanish texts from different system agree with the solutions offered by genres and authors. These texts are taken mainly two human experts. from newspapers. These corpora have been c) Evaluating zero-pronoun generation. The POS-tagged. Having worked with different generation of the 228 Spanish zero- genres and disparate authors, we feel that the pronouns into English has been evaluated. applicability of our proposal to other sorts of The following results in the generation texts is assured. have been obtained: a success rate of 70% To evaluate the generation of Spanish zero- in Lexesp and a success rate of 899'o in The pronouns into English three tasks have been Blue Book. In general (both corpora) a accomplished: a) the evaluation of the detection success rate of 75% has been achieved. The of zero-pronouns, b) the evaluation of anaphora errors are mainly produced by fails in resolution and c) the evaluation of generation. anaphora resolution and fails in the a) Evaluating the detection of zero-pronouns. generation of pronouns he/she/it (some To do this, verbs have been classified into heuristics 10, which have failed sometimes, two categories: 1) verbs whose subjects have been applied due to the used corpora have been omitted, and 2) verbs whose do not include semantic information). subjects have not. We have obtained a In the second experiment, we have success rate s of 88% on 1,599 classified evaluated the generation of Spanish personal verbs, with no significant differences seen pronouns with subject role into the English ones. between the corpora. We should also A fragment of the English version of The Blue remark that a success rate of 98% has been Book corpus (70,319 words) containing 165 obtained in the detection of verbs whose subjects were omitted, whereas only 80% was achieved for verbs whose subjects 9 The remaining pronouns are not in third person or were not. This lower success rate is they are cataphoric (the antecedent appears after the anaphor) or exophoric (the antecedent does not appear, linguistically, in the text). This corpus contains sentences with zero-pronouns J0 For instance: "all the pronouns in third person and made by different researchers of our Research Group. singular whose antecedents are proper nouns have g By "success rate", we mean the number of verbs boon translated into he (antecedent with masculine successfully classified, divided by the total number of gender) or she (antecedent with feminine gender); verbs in the text. otherwise they have been translated into it".

50 pronouns with subject role has been used in Information. In Proceedings of the Second AMTA order to carry out a blind evaluation. A success SIG-1L Workshop on lnterlinguas (Philadelphia, rate of 85.41% has been achieved. The errors are USA, 1998). mainly produced by fails in anaphora resolution Ferrfindez, A.; Palomar, M. and Moreno, L. (1998) and in the correct choice of the gender of the Anaphora resolution in unrestricted texts with antecedent's Head in Spanish. With reference to partial parsing. In Proceedings of the 36th Annual Meeting of the Association for Computational the choice of the gender of the antecedent's Linguistics and 17th International Conference on Head, an electronic dictionary has been used in Computational Linguistics, COLING - ACL'98 order to translate the original English word into (Montreal, Canada, 1998). pp. 385-391. the Spanish one, and subsequently, the gender is Ferrfindez, A.; Palomar, M. and Moreno, L. (2000) extracted from the Spanish word. Several An empirical approach to Spanish anaphora problems have occurred when using this resolution. To appear in Machine Translation electronic dictionary: (Special Issue on anaphora resolution in Machine 1) ' the word to be translated does not appear in Translation). 2000. the dic-tionary, and therefore, a heuristic is Leavitt, J.R.R.; Lonsdale, D. and Franz, A. (1994) A applied to assign the gender Reasoned Interlingua for knowledge-Based 2) the correct sense of the English word is not Machine Translation. In Proceedings of CSCS1-94. chosen, and therefore, the gender could be Miyoshi, H.; Ogino, T. and Sugiyarna, K. (1997) assigned incorrectly. EDR's Concept Classification and Description for Interlingual Representation. In Proceedings of the Conclusion First Workshop on lnterlinguas (San Diego, USA, 1997). In this paper a complete approach to solve and Peral, J.; Palomar, M. and Ferffmdez, A. (1999) generate pronominal anaphora in the Spanish Coreference-oriented Interlingual Slot Structure and English languages is presented. The and Machine Translation. In Proceedings of ACL approach works on unrestricted texts to which Workshop on Coreference and its Applications partial parsing techniques have been applied. (College Park, Maryland, USA, 1999). pp. 69-76. After the parsing and solving pronominal Witkam, A.P.M. (1983) Distributed language anaphora, an interlingua representation (based translation: feasibility study of multi!ingual facility on semantic roles and features) of the whole text for videotex information networks. BSO, Utrecht. is obtained. The representation of the whole text is one of the main advantages of our system due to several problems, that are hardly solved by the majority of MT systems, can be treated and solved. These problems are the generation of intersentential anaphora, the detection of coreference chains and the generation of Spanish zero-pronouns into English. Generation of zero- pronouns and Spanish personal pronouns has been evaluated obtaining a success rate of 75% and 85.41% respectively.

References Appelo, L. and Landsbergen, J. (1986) The machine translation project Rose. In Proceedings of I. International Conference on the State of the Art in Machine Translation in America, Asia and Europe, 1A1-MT'86 (Saarbr0cken). pp. 34-51. Castell6n, I.; Fern~Sndez, A.; Mart|, M.A.; Morante, R. and V~zquez, G. (1998) An lnterlingua Representation Based on the Lexieo-Semantie