The Effect of Machine Translation on the Performance of Arabic
Total Page:16
File Type:pdf, Size:1020Kb
EACL 2006 Workshop on Multilingual Question Answering - MLQA06 The Affect of Machine Translation on the Performance of Arabic- English QA System Azzah Al-Maskari Mark Sanderson Dept. of Information Studies, Dept. of Information Studies, University of Sheffield, University of Sheffield, Sheffield, S10 2TN, UK Sheffield, S10 2TN, UK [email protected] [email protected] According to the Global Reach web site Abstract (2004), shown in Figure 1, it could be estimated that an English speaker has access to around 23 The aim of this paper is to investigate times more digital documents than an Arabic how much the effectiveness of a Ques- speaker. One can conclude from the given infor- tion Answering (QA) system was af- mation shown in the Figure that cross-language fected by the performance of Machine is potentially very useful when the required in- Translation (MT) based question transla- formation is not available in the users’ language. tion. Nearly 200 questions were selected from TREC QA tracks and ran through a Dutch 1.90% Arabic, question answering system. It was able to Portuegese 1.70% answer 42.6% of the questions correctly 3.3% in a monolingual run. These questions Italian, 4.1% Korean 4.2% were then translated manually from Eng- English, lish into Arabic and back into English us- French 4.5% 39.6% ing an MT system, and then re-applied to German the QA system. The system was able to 7.4% answer 10.2% of the translated questions. Japanese An analysis of what sort of translation er- 9% ror affected which questions was con- Spanish, ducted, concluding that factoid type 9.6% Chinses, questions are less prone to translation er- 14.7% ror than others. Figure 1: Online language Population (March, 2004) 1 Introduction The goal of a QA system is to find answers to Increased availability of on-line text in languages questions in a large collection of documents. The other than English and increased multi-national overall accuracy of a QA system is directly af- collaboration have motivated research in Cross- fected by its ability to correctly analyze the ques- Language Information Retrieval (CLIR). The tions it receives as input, a Cross Language goal of CLIR is to help searchers find relevant Question Answering (CLQA) system will be documents when their query terms are chosen sensitive to any errors introduced during question from a language different from the language in translation. Many researchers criticize the MT- which the documents are written. Multilinguality based CLIR approach. The reason for their criti- has been recognized as an important issue for the cism mostly stem from the fact that the current future of QA (Burger et al. 2001). The multilin- translation quality of MT is poor, in addition MT gual QA task was introduced for the first time in system are expensive to develop and their appli- the Cross-Language Evaluation Forum CLEF- cation degrades the retrieval efficiency due to the 2003. cost of the linguistic analysis. 9 EACL 2006 Workshop on Multilingual Question Answering - MLQA06 This paper investigates the extent to which source language to the target language. How- MT error affects QA accuracy. It is divided as ever, most of them are focused on European lan- follows: in section 2 relevant previous work on guage pairs. To our knowledge, only one past cross-language retrieval is described, section 3 example of research has investigated the per- explains the experimental approach which in- formance of a cross-language Arabic-English cludes the procedure and systems employed, it QA system Rosso et al (2005). The QA system also discuss the results obtained, section 4 draws used by Rosso et al (2005) is based on a system conclusions and future research on what im- reported in Del Castillo (2004). Their experiment provements need to be done for MT systems. was carried out using the question corpus of the CLEF-2003 competition. They used questions in 2 Related Research English and compared the answers with those obtained after the translation back into English CLIR is an active area, extensive research on from an Arabic question corpus which was CLIR and the effects of MT on QA systems’ re- manually translated. For the Arabic-English trieval effectiveness has been conducted. Lin and translation process, an automatic machine trans- Mitamua (2004) point out that the quality of lator, the TARJIM Arabic-English machine translation is fully dependent upon the MT sys- translation system, was used. Rosso el al re- tem employed. ported a decrease of QA accuracy by more than Perret (2004) proposed a question answering 30% which was caused by the translation proc- system designed to search French documents in ess. response to French queries. He used automatic Work in the Rosso paper was limited to a sin- translation resources to translate the original que- gle QA and MT system and also did not analyze ries from (Dutch, German, Italian, Portuguese, types of errors or how those errors affected dif- Spanish, English and Bulgarian) and reports the ferent types of QA questions. Therefore, it was performance level in the monolingual task was decided to conduct further research on MT sys- 24.5% dropping to 17% in the bilingual task. A tems and its affect on the performance in QA similar experiment was conducted by Plamondon systems. This paper presents an extension on the and Foster (2003) on TREC questions and meas- previous mentioned study, but with more diverse ured a drop of 44%, and in another experiment ranges of TREC data set using different QA sys- using Babelfish, the performance dropped even tem and different MT system. more, 53%. They believe that CLEF questions were easier to process because they did not in- 3 Experimental Approach clude definition questions, which are harder to translate. Furthermore, Plamondon and Foster To run this experiment, 199 questions were ran- (2004) compare the cross language version of domly compiled from the TREC QA track, their Quantum QA system with the monolingual namely from TREC-8, TREC-9, TREC-11, English version on CLEF questions and note the TREC-2003 and TREC-2004, to be run through performance of a cross-language system (French AnswerFinder, the results of which are discussed questions and English documents) was 28% in section 3.1. The selected 199 English TREC lower than the monolingual system using IBM1 questions were translated into Arabic by one of translation. the authors (who is an Arabic speaker), and then Tanev et al. (2004) note that DIOGENE sys- fed into Systran to translate them into English. tem, which relies on the Multi-WordNet, per- The analysis of translation is discussed in detail forms 15% better in the monolingual (Italian- in section 3.2. Italian) than cross-language task (Italian- English). In Magnini et al.’s (2004) report for the 3.1 Performance of AnswerFinder year 2003, the average performance of correct The 199 questions were run over AnswerFinder; answers on monolingual tasks was 41% and 25% divided as follows: 92 factoid questions, 51 defi- in the bilingual tasks. In addition in the year nition questions and 56 list questions. The an- 2004, the average accuracy in the monolingual swers were manually assessed following an as- tasks was 23.7% and 14.7% in bilingual tasks. sessment scheme similar to the answer categories As elucidated above, much research has been in iCLEF 2004: conducted to evaluate the effectiveness of QA systems in a cross language platform by employ- • Correct: if the answer string is valid and ing MT systems to translate the queries from the supported by the snippets 10 EACL 2006 Workshop on Multilingual Question Answering - MLQA06 • Non-exact: if the answer string is miss- Below is a discussion of Systran’ translation ing some information, but the full answer accuracy and the problems that occurred during is found in the snippets. translation of the TREC QA track questions. • Wrong: if the answer string and the snippets are missing important informa- Type of Translation Error Percentage tion or both the answer string and the Wrong Transliteration 45.7% snippets are wrong compared with the an- Wrong Sense 31% swer key. Wrong Word Order 25% • No answer: if the system does not return Wrong Pronoun 13.5% any answer at all. Table 3. Types of Translation Errors Table 1 provides an overall view, the system cor- Wrong Transliteration rectly answered 42.6% of these questions, Wrong transliteration is the most common error whereas 25.8% wrongly, 23.9% no answer and that encountered during translation. Translitera- 8.1% non-exactly. Table 2 illustrates Answer- tion is the process of replacing words in the Finder’ abilities to answer each type of these source language with their phonetic equivalent in questions separately. the target language. Al-Onaizan and Knight (2002) state that transliterating names from Ara- Answer Type bic into English is a non-trivial task due to the Correct 42.6% differences in their sound and writing system. Non exact 8.1% Also, there is no one-to-one correspondence be- Wrong 25.8% tween Arabic sounds and English sounds. For No Answer 23.9% example P and B are both mapped to the single are mapped“ ﻩ“ and ”ح“ Arabic ;”ب“ Arabic letter Table 1. Overall view of AnswerFinder Mono- into English H. lingual Accuracy 1 2 Original text Who is Aga Khan ? 2 1 ﻣﻦ ﻳﻜﻮن اﺟﺎ ﺧﺎن؟ Factoid Definition List Arabic version 1 Correct 63 6 15 Translation From [EEjaa] be- 2 Not exact 1 6 9 (wrong) trayed ? Wrong 22 15 13 Table 4. Incorrect use of translation when trans- No answer 6 23 18 literation should have been used Table 2.