Measure Word Generation for English-Chinese SMT Systems
Total Page:16
File Type:pdf, Size:1020Kb
Measure Word Generation for English-Chinese SMT Systems Dongdong Zhang1, Mu Li1, Nan Duan2, Chi-Ho Li1, Ming Zhou1 1Microsoft Research Asia 2Tianjin University Beijing, China Tianjin, China {dozhang,muli,v-naduan,chl,mingzhou}@microsoft.com words per sentence respectively. Unlike in Chinese, Abstract there is no special set of measure words in English. Measure words are usually used for mass nouns Measure words in Chinese are used to indi- and any semantically appropriate nouns can func- cate the count of nouns. Conventional sta- tion as the measure words. For example, in the tistical machine translation (SMT) systems do not perform well on measure word generation phrase three bottles of water, the word bottles acts due to data sparseness and the potential long as a measure word. Countable nouns are almost 2 distance dependency between measure words never modified by measure words . Numerals and and their corresponding head words. In this indefinite articles are directly followed by counta- paper, we propose a statistical model to gen- ble nouns to denote the quantity of objects. erate appropriate measure words of nouns for Therefore, in the English-to-Chinese machine an English-to-Chinese SMT system. We mod- translation task we need to take additional efforts el the probability of measure word generation to generate the missing measure words in Chinese. by utilizing lexical and syntactic knowledge For example, when translating the English phrase from both source and target sentences. Our 三本书 model works as a post-processing procedure three books into the Chinese phrases “ ”, over output of statistical machine translation where three corresponds to the numeral “三” and systems, and can work with any SMT system. books corresponds to the noun “书”, the Chinese Experimental results show our method can measure word “本” should be generated between achieve high precision and recall in measure the numeral and the noun. word generation. In most statistical machine translation (SMT) 1 Introduction models (Och et al., 2004; Koehn et al., 2003; Chiang, 2005), some of measure words can be In linguistics, measure words (MW) are words or generated without modification or additional morphemes used in combination with numerals or processing. For example, in above translation, the demonstrative pronouns to indicate the count of phrase translation table may suggest the word three nouns1, which are often referred to as head words be translated into “三”, “三本”, “三只”, etc, and (HW). the word books into “书”, “书本”, “名册” (scroll), Chinese measure words are grammatical units etc. Then the SMT model selects the most likely and occur quite often in real text. According to our combination “三本书” as the final translation re- survey on the measure word distribution in the sult. In this example, a measure word candidate set Chinese Penn Treebank and the test datasets distri- consisting of “本” and “只” can be generated by buted by Linguistic Data Consortium (LDC) for bilingual phrases (or synchronous translation rules), Chinese-to-English machine translation evaluation, and the best measure word “本” from the measure the average occurrence is 0.505 and 0.319 measure 2 There are some exceptional cases, such as “100 head of cat- 1 The uncommon cases of verbs are not considered. tle”. But they are very uncommon. 89 Proceedings of ACL-08: HLT, pages 89–96, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics Pudong 's de- for vigorously promoting shanghai and constructing a modern econom- velopment and undertaking ic , trade , and financial center . opening up is a century-spanning 振兴/上海/ ,/ 建设 /现代化 /经济 。 浦东/开发/ 一 项 /跨/世 工程 / 、/ 贸易/ 、 /金融/ 中心/ 的/ 开放/ 是/ 纪/ Figure 1. Example of long distance dependency between MW and its modified HW word candidate set can be selected by the SMT ate measure words, collecting the measure word decoder. However, as we will show below, existing candidate set and selecting the best measure word. SMT systems do not deal well with the measure Our method is performed as a post-processing pro- word generation in general due to data sparseness cedure of the output of SMT systems. The advan- and long distance dependencies between measure tage is that it can be easily integrated into any SMT words and their corresponding head words. system. Experimental results show our method can Due to the limited size of bilingual corpora, significantly improve the quality of measure word many measure words, as well as the collocations generation. We also compared the performance of between a measure and its head word, cannot be our model based on different contextual informa- well covered by the phrase translation table in an tion, and show that both large-scale monolingual SMT system. Moreover, Chinese measure words data and parallel bilingual data can be helpful to often have a long distance dependency to their generate correct measure words. head words which makes language model ineffec- Position Occurrence Position Occurrence tive in selecting the correct measure words from 1 39.5% -1 0 the measure word candidate set. For example, in Figure 1 the distance between the measure word 2 15.7% -2 0 3 4.7% -3 8.7% “项” and its head word “工程” (undertaking) is 15. 4 1.4% -4 6.8% In this case, an n-gram language model with n<15 cannot capture the MW-HW collocation. Table 1 5 2.1% -5 4.3% shows the relative position’s distribution of head >5 8.8% <-5 8.0% words around measure words in the Chinese Penn Table 1. Position distribution of head words Treebank, where a negative position indicates that the head word is to the left of the measure word 2 Our Method and a positive position indicates that the head word 2.1 Measure word generation in Chinese is to the right of the measure word. Although lots of measure words are close to the head words they In Chinese, measure words are obligatory in cer- modify, more than sixteen percent of measure tain contexts, and the choice of measure word words are far away from their corresponding head usually depends on the head word’s semantics (e.g., words (the absolute distance is more than 5). shape or material). The set of Chinese measure To overcome the disadvantage of measure word words is a relatively close set and can be classified generation in a general SMT system, this paper into two categories based on whether they have a proposes a dedicated statistical model to generate corresponding English translation. Those not hav- measure words for English-to-Chinese translation. ing an English counterpart need to be generated We model the probability of measure word gen- during translation. For those having English trans- eration by utilizing rich lexical and syntactic lations, such as “米” (meter), “吨” (ton), we just knowledge from both source and target sentences. use the translation produced by the SMT system Three steps are involved in our method to generate itself. According to our survey, about 70.4% of measure words: Identifying the positions to gener- measure words in the Chinese Penn Treebank need 90 to be explicitly generated during the translation words in the set are pronouns such as “该” (this), process. “那” (that) and “若干” (several). In the SMT out- In Chinese, there are generally stable linguistic put, the positions after these words are also identi- collocations between measure words and their head fied as candidate positions to generate measure words. Once the head word is determined, the col- words. located measure word can usually be selected ac- cordingly. However, there is no easy way to identi- 2.4 Candidate measure word generation fy head words in target Chinese sentences since for To avoid high computation cost, the measure word most of the time an SMT output is not a well candidate set only consists of those measure words formed sentence due to translation errors. Mistake which can form valid MW-HW collocations with of head word identification may cause low quality their head words. We assume that all the surround- of measure word generation. In addition, some- ing words within a certain window size centered on times the head word itself is not enough to deter- the given position to generate a measure word are mine the measure word. For example, in Chinese potential head words, and require that a measure sentences “他家有 5 口人” (there are five people word candidate must collocate with at least one of in his family) and “总共有 5 个人参加了会议” (a the surrounding words. Valid MW-HW colloca- total of five people attended the meeting), where tions are mined from the training corpus and a sep- “人” (people) is the head word collocated with two arate lexicon resource. different measure words “口” and “个”, we cannot There is a possibility that the real head word is determine the measure word just based on the head outside the window of given size. To address this word “人”. problem, we also use a source window centered on the position ps, which is aligned to the target meas- 2.2 Framework ure word position pt. The link between ps and pt In our framework, a statistical model is used to can be inferred from SMT decoding result. Thus, generate measure words. The model is applied to the chance of capturing the best measure word in- SMT system outputs as a post-processing proce- creases with the aid of words located in the source dure. Given an English source sentence, an SMT window. For example, given the window size of 10, decoder produces a target Chinese translation, in although the target head word “工程” (undertaking) which positions for measure word generation are in Figure 1 is located outside the target window, its identified.