DOMCAT: a Bilingual Concordancer for Domain-Specific Computer Assisted Translation

DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation Ming-Hong Bai1,2 Yu-Ming Hsieh1,2 Keh-Jiann Chen1 Jason S. Chang2 1 Institute of Information Science, Academia Sinica, Taiwan 2 Department of Computer Science, National Tsing-Hua University, Taiwan [email protected], [email protected], [email protected], [email protected] lexicographers, human translators and second Abstract language learners (Bowker and Barlow 2004; Bourdaillet et al., 2010; Gao 2011). In this paper, we propose a web-based Identifying the translation equivalents, bilingual concordancer, DOMCAT 1 , for translation spotting, is the most challenging part of domain-specific computer assisted a bilingual concordancer. Recently, most of the translation. Given a multi-word expression existing bilingual concordancers spot translation as a query, the system involves retrieving equivalents in terms of word alignment-based sentence pairs from a bilingual corpus, method. (Jian et al., 2004; Callison-Burch et al., identifying translation equivalents of the 2005; Bourdaillet et al., 2010). However, word query in the sentence pairs (translation alignment-based translation spotting has some spotting) and ranking the retrieved sentence drawbacks. First, aligning a rare (low frequency) pairs according to the relevance between term may encounter the garbage collection effect the query and the translation equivalents. (Moore, 2004; Liang et al., 2006) that cause the To provide high-precision translation term to align to many unrelated words. Second, the spotting for domain-specific translation statistical word alignment model is not good at tasks, we exploited a normalized many-to-many alignment due to the fact that correlation method to spot the translation translation equivalents are not always correlated in equivalents. To ranking the retrieved lexical level. Unfortunately, the above effects will sentence pairs, we propose a correlation be intensified in a domain-specific concordancer function modified from the Dice coefficient because the queries are usually domain-specific for assessing the correlation between the terms, which are mostly multi-word low-frequency query and the translation equivalents. The terms and semantically non-compositional terms. performances of the translation spotting Wu et al. (2003) employed a statistical module and the ranking module are association criterion to spot translation equivalents evaluated in terms of precision-recall in their bilingual concordancer. The association- measures and coverage rate respectively. based criterion can avoid the above mentioned effects. However, it has other drawbacks in 1 Introduction translation spotting task. First, it will encounter the contextual effect that causes the system incorrectly A bilingual concordancer is a tool that can retrieve spot the translations of the strongly collocated aligned sentence pairs in a parallel corpus whose context. Second, the association-based translation source sentences contain the query and the spotting tends to spot the common subsequence of translation equivalents of the query are identified a set of similar translations instead of the full in the target sentences. It helps not only on finding translations. Figure 1 illustrates an example of translation equivalents of the query but also contextual effect, in which ‘Fan K'uan’ is presenting various contexts of occurrence. As a incorrectly spotted as part of the translation of the result, it is extremely useful for bilingual query term ‘ 谿山行旅圖’ (Travelers Among Mountains and Streams), which is the name of the 1 http://ckip.iis.sinica.edu.tw/DOMCAT/ 55 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 55–60, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics painting painted by ‘Fan K'uan/范寬’ since the The remainder of this paper is organized as painter’s name is strongly collocated with the follows. Section 2 describes the DOMCAT system. name of the painting. In Section 3, we describe the evaluation of the DOMCAT system. Section 4 contains some Sung , Travelers Among Mountains and Streams , Fan concluding remarks. K'uan 宋谿山行旅圖范寬 2 The DOMCAT System Figure 1. ‘Fan K'uan’ may be incorrectly spotted as Given a query, the DOMCAT bilingual 谿山行旅圖 part of the translation of ‘ ’, if pure concordancer retrieves sentence pairs and spots association method is applied. translation equivalents by the following steps: Figure 2 illustrates an example of common 1. Retrieve the sentence pairs whose source subsequence effect, in which ‘清明上河圖’ (the sentences contain the query term. River During the Qingming Festival/ Up the River 2. Extract translation candidate words from the During Qingming) has two similar translations as retrieved sentence pairs by the normalized quoted, but the Dice coefficient tends to spot the correlation criterion. common subsequences of the translations. 3. Spot the candidate words for each target (Function words are ignored in our translation sentence and rank the sentences by spotting.) normalized the Dice coefficient criterion. Expo 2010 Shanghai-Treasures of Chinese Art Along In step 1, the query term can be a single word, a the River During the Qingming Festival phrase, a gapped sequence and even a regular 2010 上海世博會華夏百寶篇清院本清明上河圖 expression. The parallel corpus is indexed by the Oversized Hanging Scrolls and Handscrolls Up the suffix array to efficiently retrieve the sentences. River During Qingming The step 2 and step 3 are more complicated and 巨幅名畫清沈源清明上河圖 will be described from Section 2.1 to Section 2.3. Figure 2. The Dice coefficient tends to spot the common subsequences ‘River During Qingming’. 2.1 Extract Translation Candidate Words Bai et al. (2009) proposed a normalized frequency criterion to extract translation After the queried sentence pairs retrieved from the equivalents form sentence aligned parallel corpus. parallel corpus, we can extract translation This criterion takes lexical-level contexture effect candidate words from the sentence pairs. We into account, so it can effectively resolve the above compute the local normalized correlation with mentioned effect. But the goal of their method is to respect to the query term for each word e in each find most common translations instead of spotting target sentence. The local normalized correlation translations, so the normalized frequency criterion is defined as follows: tends to ignore rare translations. p(e | f ) | q | In this paper, we propose a bilingual f q i i concordancer, DOMCAT, for computer assisted lnc(e;q,e,f ) (1) p(e | f j ) | f | domain-specific term translation. To remedy the f jf above mentioned effects, we extended the normalized frequency of Bai et al. (2009) to a where q denotes the query term, f denotes the normalized correlation criterion to spot translation source sentence and e denotes the target sentence, equivalents. The normalized correlation inherits is a small smoothing factor. The probability p(e|f) the characteristics of normalized frequency and is is the word translation probability derived from the adjusted for spotting rare translations. These entire parallel corpus by IBM Model 1 (Brown et characteristics are especially important for a al., 1993). The sense of local normalized domain-specific bilingual concordancer to spot correlation of e can be interpreted as the translation pairs of low-frequency and semantically probability of word e being part of translation of non-compositional terms. the query term q under the condition of sentence pair (e, f). 56 Once the local normalized correlation is from unrelated words much better than the Dice computed for each word in retrieved sentences, we coefficient. compute the normalized correlation on the The rationale behind the normalized correlation retrieved sentences. The normalized correlation is is that the nc value is the strength of word e the average of all lnc values and defined as follows: generated by the query compared to that of generated by the whole sentence. As a result, the 1 n normalized correlation can easily separate the nc(e;q) lnc(e;q,e(i) ,f (i) ) (2) n words generated by the query term from the words i1 generated by the context. On the contrary, the Dice coefficient counts the frequency of a co-occurred where n is the number of retrieved sentence pairs. word without considering the fact that it could be After the nc values for the words of the retrieved generated by the strongly collocated context. target sentences are computed, we can obtain a translation candidate list by filtering out the words with lower nc values. 2.2 Translation Spotting To compare with the association-based method, Once we have a translation candidate list and we also sorted the word list by the Dice coefficient respective nc values, we can spot the translation defined as follows: equivalents by the following spotting algorithm. For each target sentence, first, spot the word with 2 freq(e,q) highest nc value. Then extend the spotted sequence dice(e,q) (3) freq(e) freq(q) to the neighbors of the word by checking their nc values of neighbor words but skipping function where freq is frequency function which computes words. If the nc value is greater than a threshold θ, frequencies from the parallel corpus. add the word into spotted sequence. Repeat the extending process until no word can be added to Candidate words NC the spotted sequence. mountain 0.676 The following is the pseudo-code for the stream 0.442 algorithm: traveler 0.374 among 0.363 S is the target sentence sung 0.095 H is the spotted word sequence k'uan 0.090 θis the threshold of translation candidate words Figure 3(a). Candidate words sorted by nc values. Initialize: Candidate words Dice H← traveler 0.385 emax←S[0] reduced 0.176 Foreach ei in S: stream 0.128 If nc(ei) > nc(emax): ← k'uan 0.121 emax ei θ fan 0.082 If nc(emax ) : among 0.049 add emax to H mountain 0.035 Repeat until no word add to H ← Figure 3(b).

Load more