Improving Machine Translation of Null Subjects in Italian and Spanish
Total Page:16
File Type:pdf, Size:1020Kb
Improving Machine Translation of Null Subjects in Italian and Spanish Lorenza Russo, Sharid Loaiciga,´ Asheesh Gulati Language Technology Laboratory (LATL) Department of Linguistics – University of Geneva 2, rue de Candolle – CH-1211 Geneva 4 – Switzerland {lorenza.russo, sharid.loaiciga, asheesh.gulati}@unige.ch Abstract (2000) has shown that 46% of verbs in their test corpus had their subjects omitted. Continuation Null subjects are non overtly expressed of this work by Rello and Ilisei (2009) has found subject pronouns found in pro-drop lan- that in a corpus of 2,606 sentences, there were guages such as Italian and Spanish. In this study we quantify and compare the oc- 1,042 sentences without overtly expressed pro- currence of this phenomenon in these two nouns, which represents an average of 0.54 null languages. Next, we evaluate null sub- subjects per sentence. As for Italian, many anal- jects’ translation into French, a “non pro- yses are available from a descriptive and theoret- drop” language. We use the Europarl cor- ical perspective (Rizzi, 1986; Cardinaletti, 1994, pus to evaluate two MT systems on their among others), but to the best of our knowledge, performance regarding null subject trans- there are no corpus studies about the extent this lation: Its-2, a rule-based system devel- 2 oped at LATL, and a statistical system phenomenon has. built using the Moses toolkit. Then we Moreover, althought null elements have been add a rule-based preprocessor and a sta- largely treated within the context of Anaphora tistical post-editor to the Its-2 translation Resolution (AR) (Mitkov, 2002; Le Nagard and pipeline. A second evaluation of the im- Koehn, 2010), the problem of translating from proved Its-2 system shows an average in- and into a pro-drop language has only been dealt crease of 15.46% in correct pro-drop trans- with indirectly within the specific context of MT, lations for Italian-French and 12.80% for as Gojun (2010) points out. Spanish-French. We address three goals in this paper: i) to com- pare the occurrence of a same syntactic feature in 1 Introduction Italian and Spanish, ii) to evaluate the translation Romance languages are characterized by some of null subjects into French, that is, a “non pro- morphological and syntactical similarities. Ital- drop” language; and, iii) to improve the transla- ian and Spanish, the two languages we are inter- tion of null subjects in a rule-based MT system. ested in here, share the null subject parameter, Next sections follow the above scheme. also called the pro-drop parameter, among other 2 Null subjects in source corpora characteristics. The null subject parameter refers to whether the subject of a sentence is overtly ex- We worked with the Europarl corpus (Koehn, pressed or not (Haegeman, 1994). In other words, 2005) in order to have a parallel comparative cor- due to their rich morphology, Italian and Span- pus for Italian and Spanish. From this corpus, ish allow non lexically-realized subject pronouns we manually analyzed 1,000 sentences in both (also called null subjects, zero pronouns or pro- languages. From these 1,000 sentences (26,757 drop).1 words for Italian, 27,971 words for Spanish), we From a monolingual point of view, regarding identified 3,422 verbs for Italian and 3,184 for Spanish, previous work by Ferrandez´ and Peral 2Poesio et al. (2004) and Rodr´ıguez et al. (2010), for in- 1Henceforth, the terms will be used indiscriminately. stance, focused on anaphora and deixis. 81 Proceedings of the EACL 2012 Student Research Workshop, pages 81–89, Avignon, France, 26 April 2012. c 2012 Association for Computational Linguistics Spanish. We then counted the occurrences of We found a total number of non-expressed pro- verbs with pro-drop and classified them in two nouns in our corpus comparable to those obtained categories: personal pro-drop3 and impersonal by Rodr´ıguez et al. (2010) on the Live Memo- pro-drop4, obtaining a total amount of 1,041 pro- ries corpus for Italian and by Recasens and Mart´ı drop in Italian and 1,312 in Spanish. Table 1 (2010) on the Spanish AnCora-Co corpus (Table shows the results in percentage terms. 2). Note that in both of these studies, they were interested in co-reference links, hence they did Total Pers. Impers. Total not annotate impersonal pronouns, claiming they Verbs pro- pro- pro- are rare. On the other hand, we took all the pro- drop drop drop drop pronouns into account, including impersonal IT 3,422 18.41% 12.01% 30.42% ones. ES 3,182 23.33% 17.84% 41.17% Corpus Language Result Table 1: Results obtained in the detection of pro-drop. Our corpus IT 3.89% Live Memories IT 4.5% Results show a higher rate of pro-drop in Span- Our corpus ES 4.69% ish (10.75%). It has 4.92% more personal pro- AnCora-Co ES 6.36% drop and 5.83% more impersonal pro-drop than Italian. The contrast of personal pro-drop is due to Table 2: Null-subjects in our corpus compared to Live Memories and AnCora-Co corpora. Percentages are a syntactic difference between the two languages. calculated with respect to the total number of words. In Spanish, sentences like (1a.) make use of two pro-drop pronouns while the same syntactic struc- ture uses a pro-drop pronoun and an infinitive clause in Italian (1b.), hence, the presence of more 3 Baseline machine translation of null personal pro-drop in Spanish. subjects The 1,000 sentences of our corpus were trans- (1) a. ES pro le pido (1.sg) que pro inter- lated from both languages into French (IT→FR; venga (3.sg) con el prestigio de su ES→FR) in order to assess if personal pro-drop cargo. and impersonal pro-drop were correctly identi- b. IT pro le chiedo (1.sg) di intervenire fied and translated. We tested two systems: Its-2 (inf.) con il prestigio della sua carica. (Wehrli et al., 2009), a rule-based MT system de- I ask you to intervene with the pres- veloped at the LATL; and a statistical system built tige of your position. using the Moses toolkit out of the box (Koehn et al., 2007). The latter was trained on 55,000 sen- The difference of impersonal pro-drop, on the tence pairs from the Europarl corpus and tuned on other hand, is due to the Spanish use of an im- 2,000 additional sentence pairs, and includes a 3- personal construction (2a.) with the “se” parti- gram language model. cle. Spanish follows the schema “se + finite verb Tables 3 and 4 show percentages of correct, in- + non-finite verb”; Italian follows the schema “fi- correct and missing translations of personal and nite verb + essere (to be) + past participle” (2b.). impersonal null subjects calculated on the basis of We considered this construction formed by one the number of personal and impersonal pro-drop more verb in Italian than in Spanish as shown in found in the corpus. examples (2a.) and (2b.). This also explains the We considered the translation correct when the difference in the total amount of verbs (Table 1). null pronoun is translated by an overt pronoun with the correct gender, person and number fea- (2) a. ES Se podra´ corregir. tures in French; otherwise, we considered it in- b. IT Potra` essere modificato. correct. Missing translation refers to cases where It can be modified. the null pronoun is not generated at all in the tar- get language. 3Finite verbs with genuinely referential subjects (i.e. I, you, s/he, we, they). We chose these criteria because they allow us 4Finite verbs with non-referential subjects (i.e. it). to evaluate the single phenomenon of null subject 82 Its-2 Pair Pro-drop Correct Incorrect Missing personal 66.34% 3.49% 30.15% IT→FR impersonal 16.78% 18.97% 64.23% average 46.78% 9.6% 43.61% personal 55.79% 3.50% 40.70% ES→FR impersonal 29.29% 11.40% 59.29% average 44.28% 6.93% 48.78% Table 3: Percentages of correct, incorrect and missing translation of zero-pronouns obtained by Its-2. Average is calculated on the basis of total pro-drop in corpus. Moses Pair Pro-drop Correct Incorrect Missing personal 71.59% 1.11% 27.30% IT→FR impersonal 44.76% 11.43% 43.79% average 61% 5.18% 33.81% personal 72.64% 2.02% 25.34% ES→FR impersonal 54.56% 2.45% 42.98% average 64.78% 2.21% 33% Table 4: Percentages of correct, incorrect and missing translation of zero-pronouns obtained by Moses. Average is calculated on the basis of total pro-drop in corpus. translation. BLEU and similar MT metrics com- lem is closely related with that of AR: as the sys- pute scores over a text as a whole. For the same tem does not have any module for AR, it cannot reason, human evaluation metrics based on ade- detect the gender of the antecedent, rendering the quacy and fluency were not suitable either (Koehn correct translation infeasible. and Monz, 2006). The number of correct translation of imper- Moses generally outperforms Its-2 (Tables 3 sonal pro-drop is very low for both pairs. ES→FR and 4). Results for the two systems demon- reached 29.29%, while IT→FR only 16.78% (Ta- strate that instances of personal pro-drop are bet- ble 3). The reason for these percentages is a reg- ter translated than impersonal pro-drop for the ular mistranslation or a missing translation.