Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs
Total Page:16
File Type:pdf, Size:1020Kb
Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs Anil Kumar Singh Samar Husain LTRC, IIIT LTRC, IIIT Gachibowli, Hyderabad Gachibowli, Hyderabad India - 500019 India - 500019 a [email protected] s [email protected] Abstract than 95%, and usually 98 to 99% and above). The evaluation is performed in terms of precision, and Several algorithms are available for sen- sometimes also recall. The figures are given for one tence alignment, but there is a lack of or (less frequently) more corpus sizes. While this systematic evaluation and comparison of does give an indication of the performance of an al- these algorithms under different condi- gorithm, the variation in performance under varying tions. In most cases, the factors which conditions has not been considered in most cases. can significantly affect the performance Very little information is given about the conditions of a sentence alignment algorithm have under which evaluation was performed. This gives not been considered while evaluating. We the impression that the algorithm will perform with have used a method for evaluation that the reported precision and recall under all condi- can give a better estimate about a sen- tions. tence alignment algorithm's performance, We have tested several algorithms under differ- so that the best one can be selected. We ent conditions and our results show that the per- have compared four approaches using this formance of a sentence alignment algorithm varies method. These have mostly been tried significantly, depending on the conditions of test- on European language pairs. We have ing. Based on these results, we propose a method evaluated manually-checked and validated of evaluation that will give a better estimate of the English-Hindi aligned parallel corpora un- performance of a sentence alignment algorithm and der different conditions. We also suggest will allow a more meaningful comparison. Our view some guidelines on actual alignment. is that unless this is done, it will not be possible to pick up the best algorithm for certain set of con- 1 Introduction ditions. Those who want to align parallel corpora Aligned parallel corpora are collections of pairs of may end up picking up a less suitable algorithm for sentences where one sentence is a translation of the their purposes. We have used the proposed method for comparing four algorithms under different con- other. Sentence alignment means identifying which ditions. Finally, we also suggest some guidelines for sentence in the target language (TL) is a translation using these algorithms for actual alignment. of which one in the source language (SL). Such cor- pora are useful for statistical NLP, algorithms based 2 Sentence Alignment Methods on unsupervised learning, automatic creation of re- sources, and many other applications. Sentence alignment approaches can be categorized Over the last fifteen years, several algorithms have as based on sentence length, word correspondence, been proposed for sentence alignment. Their perfor- and composite (where more than one approaches are mance as reported is excellent (in most cases not less combined), though other techniques, such as cog- 99 Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 99–106, Ann Arbor, June 2005. c Association for Computational Linguistics, 2005 nate matching (Simard et al., 1992) were also tried. Translation of a text can be fairly literal or it can Word correspondence was used by Kay (Kay, 1991; be a recreation, with a whole range between these Kay and Roscheisen, 1993). It was based on the idea two extremes. Paragraphs and/or sentences can be that words which are translations of each other will dropped or added. In actual corpora, there can even have similar distributions in the SL and TL texts. be noise (sentences which are not translations at all Sentence length methods were based on the intuition and may not even be part of the actual text). This can that the length of a translated sentence is likely to be happen due to fact that the texts have been extracted similar to that of the source sentence. Brown, Lai from some other format such as web pages. While and Mercer (Brown et al., 1991) used word count as translating, sentences can also be merged or split. the sentence length, whereas Gale and Church (Gale Thus, the SL and TL corpora may differ in size. and Church, 1991) used character count. Brown, Lai All these factors affect the performance of an al- and Mercer assumed prior alignment of paragraphs. gorithm in terms of, say, precision, recall and F- Gale and Church relied on some previously aligned measure. For example, we can expect the perfor- sentences as `anchors'. Wu (Wu, 1994) also used mance to worsen if there is an increase in additions, lexical cues from corpus-specific bilingual lexicon deletions, or noise. And if the texts were translated for better alignment. fairly literally, statistical algorithms are likely to per- Word correspondence was further developed in form better. However, our results show that this does IBM Model-1 (Brown et al., 1993) for statistical not happen for all the algorithms. machine translation. Melamed (Melamed, 1996) The linguistic distance between SL and TL can also used word correspondence in a different (geo- also play a role in performance. The simplest mea- metric correspondence) way for sentence alignment. sure of this distance is in terms of the distance on Simard and Plamondon (Simard and Plamondon, the family tree model. Other measures could be the 1998) used a composite method in which the first number of cognate words or some measure based pass does alignment at the level of characters as on syntactic features. For our purposes, it may not in (Church, 1993) (itself based on cognate match- be necessary to have a quantitative measure of lin- ing) and the second pass uses IBM Model-1, fol- guistic distance. The important point is that for lan- lowing Chen (Chen, 1993). The method used by guages that are distant, some algorithms may not Moore (Moore, 2002) also had two passes, the first perform too well, if they rely on some closeness be- one being based on sentence length (word count) and tween languages. For example, an algorithm based the second on IBM Model-1. Composite methods on cognates is likely to work better for English- are used so that different approaches can compli- French or English-German than for English-Hindi, ment each other. because there are fewer cognates for English-Hindi. It won't be without a basis to say that Hindi is 3 Factors in Performance more distant from English than is German. English and German belong to the Indo-Germanic branch As stated above, the performance of a sentence whereas Hindi belongs to the Indo-Aryan branch. alignment algorithm depends on some identifiable There are many more cognates between English and factors. We can even make predictions about German than between English and Hindi. Similarly, whether the performance will increase or decrease. as compared to French, Hindi is also distant from However, as the results given later show, the algo- English in terms of morphology. The vibhaktis of rithms don't always behave in a predictable way. For Hindi can adversely affect the performance of sen- example, one of the algorithms did worse rather than tence length (especially word count) as well as word better on an `easier' corpus. This variation in perfor- correspondence based algorithms. From the syntac- mance is quite significant and it cannot be ignored tic point of view, Hindi is a comparatively free word for actual alignment (table-1). Some of these factors order language, but with a preference for the SOV have been indicated in earlier papers, but these were (subject-object-verb) order, whereas English is more not taken into account while evaluating, nor were of a fixed word order and SVO type language. For their effects studied. sentence length and IBM model-1 based sentence 100 alignment, this doesn't matter since they don't take corpus sizes. They, also compared the performance the word order into account. However, Melamed's of several (till then known) algorithms. algorithm (Melamed, 1996), though it allows `non- In most of the other cases, evaluation was per- monotonic chains' (thus taking care of some differ- formed on only one corpus type and one corpus size. ence in word order), is somewhat sensitive to the In some cases, certain other factors were considered, word order. As Melamed states, how it will fare but not very systematically. In other words, there with languages with more word variation than En- wasn't an attempt to study the effect of various fac- glish and French is an open question. tors described earlier on the performance. In some Another aspect of the performance which may not cases, the size used for testing was too small. One seem important from NLP-research point of view, is other detail is that size was sometimes mentioned in its speed. Someone who has to use these algorithms terms of number of words, not number of sentences. for actual alignment of large corpora (say, more than 1000 sentences) will have to realize the importance 5 Evaluation Measures of speed. Any algorithm which does worse than We have used local (for each run) as well as global O(n) is bound to create problems for large sizes. Ob- (over all the runs) measures of performance of an viously, an algorithm that can align 5000 sentences algorithm. These measures are: in 1 hour is preferable to the one which takes three days, even if the latter is marginally more accurate. Precision (local and global) • Similarly, the one which takes 2 minutes for 100 sen- Recall (local and global) tences, but 16 minutes for 200 sentences will be dif- • ficult to use for practical purposes.