<<

EACL 2006 Workshop on Multilingual Question Answering - MLQA06

The Affect of on the Performance of - English QA System

Azzah Al-Maskari Mark Sanderson Dept. of Information Studies, Dept. of Information Studies, University of Sheffield, University of Sheffield, Sheffield, S10 2TN, UK Sheffield, S10 2TN, UK [email protected] [email protected]

According to the Global Reach web site Abstract (2004), shown in Figure 1, it could be estimated that an English speaker has access to around 23 The aim of this paper is to investigate times more digital documents than an Arabic how much the effectiveness of a Ques- speaker. One can conclude from the given infor- tion Answering (QA) system was af- mation shown in the Figure that cross-language fected by the performance of Machine is potentially very useful when the required in- Translation (MT) based question transla- formation is not available in the users’ language. tion. Nearly 200 questions were selected from TREC QA tracks and ran through a Dutch 1.90% question answering system. It was able to Arabic, Portuegese 1.70% answer 42.6% of the questions correctly 3.3% in a monolingual run. These questions Italian, 4.1% Korean 4.2% were then translated manually from Eng- English, lish into Arabic and back into English us- French 4.5% 39.6% ing an MT system, and then re-applied to German the QA system. The system was able to 7.4% answer 10.2% of the translated questions. Japanese An analysis of what sort of translation er- 9% ror affected which questions was con- Spanish, ducted, concluding that factoid type 9.6% Chinses, questions are less prone to translation er- 14.7% ror than others. Figure 1: Online language Population (March, 2004) 1 Introduction The goal of a QA system is to find answers to Increased availability of on-line text in languages questions in a large collection of documents. The other than English and increased multi-national overall accuracy of a QA system is directly af- collaboration have motivated research in Cross- fected by its ability to correctly analyze the ques- Language Information Retrieval (CLIR). The tions it receives as input, a Cross Language goal of CLIR is to help searchers find relevant Question Answering (CLQA) system will be documents when their query terms are chosen sensitive to any errors introduced during question from a language different from the language in translation. Many researchers criticize the MT- which the documents are written. Multilinguality based CLIR approach. The reason for their criti- has been recognized as an important issue for the cism mostly stem from the fact that the current future of QA (Burger et al. 2001). The multilin- translation quality of MT is poor, in addition MT gual QA task was introduced for the first time in system are expensive to develop and their appli- the Cross-Language Evaluation Forum CLEF- cation degrades the retrieval efficiency due to the 2003. cost of the linguistic analysis.

9 EACL 2006 Workshop on Multilingual Question Answering - MLQA06

This paper investigates the extent to which source language to the target language. How- MT error affects QA accuracy. It is divided as ever, most of them are focused on European lan- follows: in section 2 relevant previous work on guage pairs. To our knowledge, only one past cross-language retrieval is described, section 3 example of research has investigated the per- explains the experimental approach which in- formance of a cross-language Arabic-English cludes the procedure and systems employed, it QA system Rosso et al (2005). The QA system also discuss the results obtained, section 4 draws used by Rosso et al (2005) is based on a system conclusions and future research on what im- reported in Del Castillo (2004). Their experiment provements need to be done for MT systems. was carried out using the question corpus of the CLEF-2003 competition. They used questions in 2 Related Research English and compared the answers with those obtained after the translation back into English CLIR is an active area, extensive research on from an Arabic question corpus which was CLIR and the effects of MT on QA systems’ re- manually translated. For the Arabic-English trieval effectiveness has been conducted. Lin and translation process, an automatic machine trans- Mitamua (2004) point out that the quality of lator, the TARJIM Arabic-English machine translation is fully dependent upon the MT sys- translation system, was used. Rosso el al re- tem employed. ported a decrease of QA accuracy by more than Perret (2004) proposed a question answering 30% which was caused by the translation proc- system designed to search French documents in ess. response to French queries. He used automatic Work in the Rosso paper was limited to a sin- translation resources to translate the original que- gle QA and MT system and also did not analyze ries from (Dutch, German, Italian, Portuguese, types of errors or how those errors affected dif- Spanish, English and Bulgarian) and reports the ferent types of QA questions. Therefore, it was performance level in the monolingual task was decided to conduct further research on MT sys- 24.5% dropping to 17% in the bilingual task. A tems and its affect on the performance in QA similar experiment was conducted by Plamondon systems. This paper presents an extension on the and Foster (2003) on TREC questions and meas- previous mentioned study, but with more diverse ured a drop of 44%, and in another experiment ranges of TREC data set using different QA sys- using Babelfish, the performance dropped even tem and different MT system. more, 53%. They believe that CLEF questions were easier to process because they did not in- 3 Experimental Approach clude definition questions, which are harder to translate. Furthermore, Plamondon and Foster To run this experiment, 199 questions were ran- (2004) compare the cross language version of domly compiled from the TREC QA track, their Quantum QA system with the monolingual namely from TREC-8, TREC-9, TREC-11, English version on CLEF questions and note the TREC-2003 and TREC-2004, to be run through performance of a cross-language system (French AnswerFinder, the results of which are discussed questions and English documents) was 28% in section 3.1. The selected 199 English TREC lower than the monolingual system using IBM1 questions were translated into Arabic by one of translation. the authors (who is an Arabic speaker), and then Tanev et al. (2004) note that DIOGENE sys- fed into Systran to translate them into English. tem, which relies on the Multi-WordNet, per- The analysis of translation is discussed in detail forms 15% better in the monolingual (Italian- in section 3.2. Italian) than cross-language task (Italian- English). In Magnini et al.’s (2004) report for the 3.1 Performance of AnswerFinder year 2003, the average performance of correct The 199 questions were run over AnswerFinder; answers on monolingual tasks was 41% and 25% divided as follows: 92 factoid questions, 51 defi- in the bilingual tasks. In addition in the year nition questions and 56 list questions. The an- 2004, the average accuracy in the monolingual swers were manually assessed following an as- tasks was 23.7% and 14.7% in bilingual tasks. sessment scheme similar to the answer categories As elucidated above, much research has been in iCLEF 2004: conducted to evaluate the effectiveness of QA systems in a cross language platform by employ- • Correct: if the answer string is valid and ing MT systems to translate the queries from the supported by the snippets

10 EACL 2006 Workshop on Multilingual Question Answering - MLQA06

• Non-exact: if the answer string is miss- Below is a discussion of Systran’ translation ing some information, but the full answer accuracy and the problems that occurred during is found in the snippets. translation of the TREC QA track questions.

• Wrong: if the answer string and the snippets are missing important informa- Type of Translation Error Percentage tion or both the answer string and the Wrong Transliteration 45.7% snippets are wrong compared with the an- Wrong Sense 31% swer key. Wrong Word Order 25% • No answer: if the system does not return Wrong Pronoun 13.5% any answer at all. Table 3. Types of Translation Errors

Table 1 provides an overall view, the system cor- Wrong Transliteration rectly answered 42.6% of these questions, Wrong transliteration is the most common error whereas 25.8% wrongly, 23.9% no answer and that encountered during translation. Translitera- 8.1% non-exactly. Table 2 illustrates Answer- tion is the process of replacing words in the Finder’ abilities to answer each type of these source language with their phonetic equivalent in questions separately. the target language. Al-Onaizan and Knight (2002) state that transliterating names from Ara- Answer Type bic into English is a non-trivial task due to the

Correct 42.6% differences in their sound and writing system.

Non exact 8.1% Also, there is no one-to-one correspondence be-

Wrong 25.8% tween Arabic sounds and English sounds. For

No Answer 23.9% example P and B are both mapped to the single are mapped“ ﻩ“ and ”ح“ Arabic ;”ب“ Arabic letter Table 1. Overall view of AnswerFinder Mono- into English H. lingual Accuracy 1 2 Original text Who is Aga Khan ? 2 1 ﻣﻦ ﻳﻜﻮن اﺟﺎ ﺧﺎن؟ Factoid Definition List Arabic version 1 Correct 63 6 15 Translation From [EEjaa] be- 2 Not exact 1 6 9 (wrong) trayed ? Wrong 22 15 13 Table 4. Incorrect use of translation when trans- No answer 6 23 18 literation should have been used Table 2. Detail analysis of AnswerFinder Per- 1 2 formance-monolingual run Original text Who is Hasan Rohani 2 1 ﻣﻦ ﻳﻜﻮن ﺣﺴﻦ روﺣﺎﻧﻲ؟ Arabic version To measure the performance of Answer- Translation From spiritual2 goodness1 Finder, recall (ratio of relevant items retrieved to (wrong) is? all relevant items in a collection) and precision Table 5. Wrong translation of person’s name (the ratio of relevant items retrieved to all re- trieved items) were calculated. Thus, recall and Transliteration mainly deals with proper precision and F-measure for AnswerFinder are, names errors when MT doesn’t recognize them 0.51 and 0.76, 0.6 respectively. as a proper name and translates them instead of transliterating them. Systran produced 91 ques- 3.2 Systran Translation tions (45.7%) with wrong transliteration. It trans- Most of the errors noticed during the transla- lated some names literally, especially those with tion process were of the following types: wrong a descriptive meaning. Table 4 provides an ex- transliteration, wrong word sense, wrong word ample of such cases where “Aga” was wrongly order, and wrong pronoun translations. Table 3 transliterated; and “khan” was translated to “be- lists Systran’s translation errors to provide cor- tray” where it should have been transliterated. rect transliteration 45.7%, wrong word senses This can also be seen in table 5; “Hassan Ro- (key word) 31%, wrong word order 25%, and hani” was translated literally as “Spiritual good- wrong translation of pronoun 13.5%. ness”.

11 EACL 2006 Workshop on Multilingual Question Answering - MLQA06

Wrong Word Sense Original text What are the names of the Wrong translation of words can occur when a space shuttles? ﻣﺎ أﺳﻤﺎء اﻟﻤﻜﻮك اﻟﻔﻀﺎﺋﻲ؟ single word can have different senses according Arabic version to the context in which the word is used. Word Translation What space name of the sense problems are commoner in Arabic than in a (wrong) shuttle? language like English as Arabic vowels (written Table 7. Wrong word order as diacritics) are largely unwritten in most texts. Systran translated 63 questions (31.2%) with Wrong Pronoun to ”وﻩ“ at least one wrong word sense, 25% of them Systran frequently translated the pronoun could have been resolved by adding diacritics. “air” in place of “him” or “it” as shown in table Table 6 illustrates an error resulting from 8. Table 9 shows an example pronoun problems; is translated into “her”, instead ”اﻩ“ Systran’s failure to translate words correctly the pronoun even after diacritics have been added; the term of “it”, which refers to “building” in this case. psychology) was wrongly translated) ”ﻋِﻠﻢ اﻟﻨٌﻔﺲ“ as “flag of breath”. The Arabic form is a com- Original text Who is Colin Powell? ﻣَﻦ هُﻮ آﻮﻟﻦ ﺑﺎول؟ pound phrase; however Systran translated each Arabic version word individually even after diacritics were Translation From Collin Powell air? added. (wrong) Table 8. Wrong translation of “who” Original text Who was the father of psy- chology? Original text What buildings had Frank de- ?signed ﻣَﻦ أب ﻋِﻠﻢ اﻟﻨٌﻔﺲ -Arabic ver ﻣﺎ اﻟﻤﺒﺎﻧﻲ اﻟﺘﻲ ﺻﻤﻤﻬﺎ ﻓﺮاﻧﻚ؟ sion Arabic version Translation From father flag of the Translation What the buildings which de- (wrong) breath? (wrong) signed her Frank? Table 6. Example of incorrect translation due to Table 9. wrong pro-drop of translation of “it” incorrect sense choice It has been observed that Systran translation These errors can occur when a single word can errors exhibited some clear regularities for cer- have different senses according to the con-text in tain questions as might be expected from a rule- which the word is used. They also occur due to based system. As shown in tables 2,3,4,7 the was translated to “from” instead of ” ﻣَ ﻦُ“ the diacritization in Arabic language. Arabic term writing involves diacritization (vowel), which is “who”. This problem is related to the recognition largely ignored in modern texts. Ali (2003) gives of diacritization. a good example that can make an English speaker grasp the complexity caused by dropping Arabic English Translation Arabic diacritization. Suppose that vowels are Word Returned Expected Big size ﺣﺠﻢ dropped from an English word and the result is Tall, big long أﻃﻮل :sm’. The possibilities of the original word are‘ Ground earth اﻷرض .some, same, sum, and semi Create locate ﻳﻮﺟﺪ Systran translated 63 questions out of 199 Nation country دول with wrong word sense, 25% of them (31.2%) Far distance أﺑﻌﺪ .can be resolved by adding diacritization Many most, much أآﺜﺮ Wrong Word Order Table 10. List of wrong key word returned by Word order errors occurred when the translated Systran words were in order that made no sense and this problem produced grammatically ill formed sen- The evaluator observed that Systran’ propen- tences that the QA system was unable to process. sity to produce some common wrong sense trans- Systran translated 25% of the questions with lations which lead to change the meaning of the wrong word orders which lead to the construc- questions, table 10 shows some of these common tion of ungrammatical questions. Table 7 shows sense translation. an example of wrong word order.

12 EACL 2006 Workshop on Multilingual Question Answering - MLQA06

3.3 The Effectiveness of AnswerFinder from 0.6 to 0.2. Altogether, in multilingual re- combined with Systran’ Translation trieval task precision and recall are 40.6% and 30%, respectively, below the monolingual re- After translating the 199 questions using Systran, trieval task. they were passed to AnswerFinder. Figure 2 il- lustrates the system’s abilities to answer the Question Correctly Correctly answered original and the translated questions; Answer- Type translated after translation Finder was initially able to answer 42.6% of the Factoid (9/92) (19/92) questions while after translation, its accuracy to return correct answers dropped to 10.2%. Its Defn. (1/51) (0/51) List (2/56) (3/56) failure to return any answer increased by 35.5% Total 12 13 (from 23.9% to 59.4%); in addition, non-exact Table 11. Types of questions translated and an- answers decreased by 6.6% while wrong answers increased by 3.6% (from 25.4% to 28.9%). swered correctly in the bilingual run

Translation Error Type Question Type original 59.4 Word Sense 4 factoid % translated Word Order 2 factoid, 2 list Diacritics 2 factoid 42.6 Transliteration 1 factoid Total 11 questions 28.9 Table 12. Classification of wrongly translated 25.4 23.9 questions but correctly answered

10.2 Effectiveness Before Trans- After Trans- 8.1 Original/Translated Success( measure lation lation 1.5 Recall 0.51 0.12 Precision 0.76 0.41 correct not exact wrong no answer F-measure 0.6 0.2 Figure 2. Answering Original/Translated Ques- Table 13. Effectiveness measures of Answer- tions (%) Finder

AnswerFinder was able to answer 23 trans- 4 Conclusions lated questions out of 199. Out of these 23 ques- tions, 12 were correctly translated and 11 exhib- Systran was used to translate 199 TREC ques- ited some translation errors. Looking closely at tions from Arabic into English. We have scruti- the 12 corrected translated question (shown in nized the quality of Systran’s translation through table 11), 9 of them are of the factoid type, one out this paper. Several translation errors ap- of definition type and two of the list type. And peared during translations which are of the type: by investigating the other 11 questions that ex- wrong transliteration, wrong word sense, wrong hibited some translation errors (shown in table word order and wrong pronoun. The translated 12), it can be noticed that 9 of them are factoid questions were fed into AnswerFinder, which and 2 are list types. Our assumption for Systran’ had a big impact on its accuracy in returning cor- ability to translate factoid questions more than rect answers. AnswerFinder was seriously af- definition and list questions is that they exhibited fected by the relatively poor output of Systran; less pronouns, in contrast to definition and list its effectiveness was degraded by 32.4%. This questions where many proper names were in- conclusion confirms Rosso et al (2005) findings cluded. using different QA system, different test sets and In total, Systran has significantly reduced An- different machine translation system. Our results swerFinder’s ability to return correct answers by validate their results which they concluded that 32.4%. Table 13 shows recall, precision and F- translation of the queries from Arabic into Eng- measure before and after translation, the value of lish has reduced the accuracy of QA system by recall before translation is 0.51 and after transla- more than 30%. tion has dropped down to 0.12. Similarly, the We recommend using multiple MT to give a precision value has gone down from 0.76 to 0.41, wider range of translation to choose from, hence, accordingly the F-measure also dropped down correct translation is more likely to appear in

13 EACL 2006 Workshop on Multilingual Question Answering - MLQA06 multiple MT systems than in a single MT sys- Strzalkowski, T., Voorhees, E., Weishedel, R., tem. However, it is essential to note that in some (2001) Issues, Tasks and Program Structures to cases MT systems may all disagree with one an- Roadmap Research in Question & Answering other in providing correct translation or they may (Q&A). Technical report, National Institute of agree on the wrong translation. Standards and Technology (NIST) It should also be borne in mind that some Del Castillo, M., Montes y Gómez, M., and Vil- keywords are naturally more important than oth- laseñor, L. (2004) QA on the web: A preliminary th ers, so in a question-answering setting it is more study for . Proceedings of the 5 important to translate them correctly. Some key- Mexican International Conference on Computer words may not be as important, and some key- Science (ENC04), Colima, Mexico. words due to the incorrect analysis of the English Hull, D. A., and Grefenstette, G. (1996) Querying question sentence by the Question Analysis across languages: a dictionary-based approach to module, may even degrade the translation and multilingual information retrieval. Research and question-answering performance. Development in Information Retrieval, pp46-57. We believe there are ways to avoid the MT er- Lin, F., & Mitamura, T. (2004) Keyword Translation rors that discussed previously (i.e. wrong trans- from English to Chinese for Multilingual QA. In: literation, wrong word senses, wrong word order, The 6th Conference of the Association for Machine and wrong translation of pronoun). Below are Translation in the Americas (AMTA-2004) some suggestions to overcome such problems: Magnini, B., Vallin, A., Ayache C., Erbach, G., • One solution is to make some adjust- Peñas, A., Rijke, M. Rocha, P., Simov, K., Sut- cliffe, R. (2004) Overview of the CLEF 2004 ments (Pre or Post- processing) to the Multi-lingual Question Answering Track. In: Pro- question translation process to minimize ceedings of CLEF 2004 the effects of translation by automatically correcting some regular errors using a Perrot, L. (2004). Question Answering system for the regular written expression. . In: Proceedings of CLEF 2004 Plamondon, L. and Foster, G. (2003) Quantum, a • Another possible solution is to try build- French/English Cross-Language Question Answer- ing an interactive MT system by providing ing System. Proceedings of CLEF 2003 users more than one translation to pick the most accurate one, we believe this will of- Tanev, H., Negri, M., Magnini, B., Kouylekov, M. (2004) The Diogenes Question Answering System fer a great help in resolving word sense at CLEF-2004. Proceedings of CLEF 2004 problem. This is more suit-able for expert users of a language. Yaser Al-Onaizan and Kevin Knight. (2002) Translat- ing Named Entities Using Monolingual and Bi- In this paper, we have presented the errors as- lingual Resources. In: Proceedings of the ACL sociated with machine translation which indi- Workshop on Computational Approaches to Se- cates that the current state of MT is not very reli- mitic Languages able for cross language QA. Much work has been done in the area of machine translation for CLIR; however, the evaluation often focuses on re- trieval effectiveness rather than translation cor- rectness.

Acknowledgement We would like to thank EU FP6 project BRICKS (IST-2002-2.3.1.12) and Ministry of Manpower, Oman, for partly funding this study. Thanks are also due to Mark Greenwood for helping us with access to his AnswerFinder system.

References Burger, J., Cardie, C., Chaudhri, V., Gaizauskas, R., Harabagiu, S., Israel, D., Jacquemin, C., Lin,C., Maiorano, S., Miller, G., Moldovan, D., Ogden, B., Prager, J., Riloff, E., Singhal, A., Shrihari, R.,

14