A Unigram Orientation Model for Statistical

Christoph Tillmann IBM T.J. Watson Research Center Yorktown Heights, NY 10598 [email protected]

Abstract b5 airspace In this paper, we present a unigram segmen- b4 tation model for statistical machine transla- Lebanese b tion where the segmentation units are blocks: 3 pairs of phrases without internal structure. The violate b2 segmentation model uses a novel orientation warplanes component to handle swapping of neighbor b1 blocks. During training, we collect block un- Israeli igram counts with orientation: we count how A A A t A A A often a block occurs to the left or to the right of l l l n l l l some predecessor block. The orientation model T H A t m j l is shown to improve translation performance A r s h j w b } over two models: 1) no block re-ordering is b r k A y n r y A l A used, and 2) the block swapping is controlled A P } n only by a . We show exper- t y y imental results on a standard Arabic-English l y translation task. P

1 Introduction Figure 1: An Arabic-English block translation example taken from the devtest set. The Arabic are roman- In recent years, phrase-based systems for statistical ma- ized. chine translation (Och et al., 1999; Koehn et al., 2003; Venugopal et al., 2003) have delivered state-of-the-art performance on standard translation tasks. In this pa- section 3, block swapping where only a language per, we present a phrase-based unigram system similar model is used to compute probabilities between neighbor to the one in (Tillmann and Xia, 2003), which is ex- blocks fails to improve translation performance. (Wu, tended by an unigram orientation model. The units of 1996; Zens and Ney, 2003) present re-ordering models translation are blocks, pairs of phrases without internal that make use of a straight/inverted orientation model that structure. Fig. 1 shows an example block translation us- is related to our work. Here, we investigate in detail ing five Arabic-English blocks ¢¡¢£¥¤¥¤¥¤¥£¦ ¨§ . The unigram the effect of restricting the re-ordering to neighbor orientation model is trained from word-aligned training block swapping only. data. During decoding, we view translation as a block In this paper, we assume a block generation process that segmentation process, where the input sentence is seg- generates block sequences from bottom to top, one block mented from left to right and the target sentence is gener- at a time. The score of a successor block depends on its

ated from bottom to top, one block at a time. A monotone predecessor block © and on its orientation relative to the ¢¡

block sequence is generated except for the possibility to block ¨© . In Fig. 1 for example, block is the predeces-

swap a pair of neighbor blocks. The novel orientation sor of block , and block is the predecessor of block ¥© model is used to assist the block swapping: as shown in ¨ . The target clump of a predecessor block is adja- ¢¡

cent to the target clump of a successor block . A right occurs to the left of block . Although the joint block

¨ £¦ ¥¡ © ¢¡ ¨ ¨ adjacent predecessor block is a block where addition- ¡ consisting of the two smaller blocks and

ally the source clumps are adjacent and the source clump has not been seen in the training data, we can still profit

of ¨© occurs to the right of the source clump of . A left from the fact that block occurs more frequently with

adjacent predecessor block is defined accordingly. left than with right orientation. In our Arabic-English

¤£ £¦¥§£ ,) ¨

¡ ¡©¨ ¨

!4365 ¢¡

During decoding, we compute the score of a training data, block has been seen ¡ times

£ ¥ £ ,+ ¨

¡ ¡

¨ !87 block sequence with orientation as a product of with left orientation, and ¡ with right orien- block bigram scores: tation, i.e. it is always involved in swapping. This intu-

£ ition is formalized using unigram counts with orientation.



£ £

£¦¥ ¡ £¦¥ ¡ £ £¦¥

¨ ¡ ¡ ¨

  

 The orientation model is related to the distortion model ¡

¢¡ (1)

¡ 

 in (Brown et al., 1993), but we do not compute a block

¥ £¦ £¦

alignment during training. We rather enumerate all rele-  where  is a block and is a three-valued

vant blocks in some order. Enumeration does not allow

orientation component linked to the block  (the orienta- ¡ ¥ us to capture position dependent distortion probabilities,

tion  of the predecessor block is ignored.). A block

¥ 

but we can compute statistics about adjacent block prede- "!  has right orientation ( ) if it has a left adjacent

cessors. predecessor. Accordingly, a block  has left orientation

¥ Our baseline model is the unigram monotone model de- ( #!$ ) if it has a right adjacent predecessor. If a block scribed in (Tillmann and Xia, 2003). Here, we select

has neither a left or right adjacent predecessor, its orien- 

¥ blocks from word-aligned training data and unigram

%!

 tation is neutral ( ). The neutral orientation is not ¨

block occurrence counts ¡ are computed: all blocks modeled explicitly in this paper, rather it is handled as a for a training sentence pair are enumerated in some order

default case as explained§ below. In Fig. 1, the orienta-

¥  £ £¦ £¦ £ ¡

¨ and we count how often a given block occurs in the par-

!   tion sequence is ¡ , i.e. block and 1 ¨§ allel training data . The training algorithm yields a list

block are generated using left orientation. During de- 3

¥  9

¨ of about blocks per training sentence pair. In this pa- !

coding most blocks have right orientation ¡ , since per, we make extended use of the baseline enumeration

the block translations are mostly monotone.

¤£ £¦¥§£

¡ ¨

¡ procedure: for each block , we additionally enumerate ©

We try to find a block sequence with orientation ¡

£ £¦¥§£

¡ ¨ ¡ all its left and right predecessors . No optimal block that maximizes ¢¡ . The following three types

segmentation is needed to compute the predecessors: for

©

of parameters are used to model the block bigram score

£¦¥ ¡ £¦¥ ¡

¨ each block , we check for adjacent predecessor blocks

   

¡ in Eq. 1:

 that also occur in the enumeration list. We compute left

,)

¨

¨

& ¡

Two unigram count-based models: ¡ and orientation counts as follows:



¥

¨ ¨ ¡

¡ . We compute the unigram probability of

(' 

 *)

<

¨ ¨

! :

¡ ¡

a block based on its occurrence count . The 2

©

blocks are counted from word-aligned training data. ; right adjacent predecessor of

We also collect unigram counts with orientation: a

*) ,+

¨ ¨

¥© ¡

left count ¡ and a right count . These Here, we enumerate all adjacent predecessors of block ¥©

counts are defined via an enumeration process and over all training sentence pairs. The identity of is ig-

¥

¨

*)

¨ ¡

are used to define the orientation model : ¡

-' nored. is the number of times the block succeeds

some right adjacent predecessor block ¥© . The ’right’ ori-

,+

¨

,/ ¨

entation count ¡ is defined accordingly. Note, that

 ,) *+

¥ £¦*

¡

¨>= ¨%0 ¨

¨

!

. !

¡ ¡ ¡

*) *+

¡ ¨ 2

¨10 in general the unigram count :

('

¡ ¡ during enumeration, a block might have both left and

& Trigram language model: The block language right adjacent predecessors, either a left or a right adja-

¡

¨  ¦ cent predecessor, or no adjacent predecessors at all. The model score ¡ is computed as the proba-  orientation count collection is illustrated in Fig. 2: each

bility of the first target word in the target clump of

 given the final two words of the target clump of time a block has a left or right adjacent predecessor in

¡  . the parallel training data, the orientation counts are incre- mented accordingly.

The three models are combined in a log-linear way, as The decoding orientation restrictions are illustrated in

¥  shown in the following section. Fig 3: a monotone block sequence with right ( ! )

1

2 Orientation Unigram Model We keep all blocks for which ?A@CBED¢FHG and the phrase

length is less or equal I . No other selection criteria are applied. ?A@CBEDLF

The basic idea of the orientation model can be illustrated For the JLK%MONQP model, we keep all blocks for which ¥ as follows: In the example translation in Fig. 1, block R . N (b) += 1 N (b) += 1 L R b4 b4 o =R o =N 4 b3 4 b3 b b o =R o =L 3 b2 3 o =R o =N b’ b’ 2 b1 2 b1 b2 o =R o =R 1 1

Figure 2: During training, blocks are enumerated in some Figure 3: During decoding, a mostly monotone block se-

¥  ¨ order: for each block , we look for left and right adjacent Q!

quence with ¡ orientation is generated as shown ©

predecessors . in the left picture. In the right picture, block swapping

generates block to the left of block . The blocks

orientation is generated. If a block is skipped e.g. block and ¡ do not have a left or right adjacent predecessor.

¨

¨ in Fig 3 by first generating block then block , the

¨ ¥

! 

¤£ £¦¥§£

¡©¨ block is generated using left orientation . Since ¡

the block translation is generated from bottom-to-top, the over all block segmentations with orientation ¡ for

¨

¡ which the source phrases yield a segmentation of the in-

£¦ ¥©

blocks and do not have adjacent predecessors below ¨

¡

¨

¡ 

¦ put sentence. Swapping involves only blocks for

¡ *) 

them: they are generated by a default model ¨ 

which ¡ for the successor block , e.g. the ¨§ without orientation component. The orientation model ¨

blocks and in Fig 1. We tried several thresholds for

*)

is given in Eq. 2, the default model is given in Eq. 3. ¨

£¦¥ £¦* ¡¥£¦¥ ¡

¨

¡

#.   

 , and performance is reduced significantly only if

¡ *) 

The block bigram model in ¨

 7

¡ . No other parameters are used to control © Eq. 1 is defined as: ¥

the block swapping. In particular the orientation of the

£¦¥ £¦* ¡ £¦¥ ¡

©

¨

 #     !

¡ (2) predecessor block is ignored: in future work, we might



©

¤ ¡ ¤ ¥ £

¨£¢¥¤ ¨£¢§¦ ¨£¢¥

   

! take into account that a certain predecessor block typi-

¡ ¡ ¡

 ('©¨

 cally precedes other blocks.

¡ ¥ ¡

<

0 0

7 ! 

where 2 and the orientation of the  predecessor is ignored. The are chosen to be optimal 3 Experimental Results on the devtest set (the optimal parameter setting is shown

in Table. 1). Only two parameters have to be optimized The translation system is tested on an Arabic-to-English

<

7 

due to the constraint that the have to sum to 2 . The translation task. The training data comes from the UN

£¦¥  ¡ £¦¥ ¡ ¡

¨ ¨

<

  !    ! ¦ 

3 ¡

default model ¡ is

§ §

2 2   news sources: million Arabic and million En-

defined as: glish words. The training data is sentence-aligned yield-

 

 

¤ ¡ £ ¡

¨£¢ ¨ ¨£¢

¤ ¦

2

¦   ! 

¦ ing million training sentence pairs. The Arabic data

¡ ¡

¡ (3)

 

 is romanized, some punctuation tokenization and some

  

<

¡

0

7 !



where 2 . The are not optimized sepa- number classing are carried out on the English and the



¢¥¤

!

Arabic training data. As devtest set, we use testing

rately, rather we define: ¢¥¤£ ¢§¦ .

<

 

Straightforward normalization over all successor blocks data provided by LDC, which consists of 7 sen-

563 

in Eq. 2 and in Eq. 3 is not feasible: there are tens of mil- tences with ¥¥ Arabic words with reference trans-

lions of possible successor blocks . In future work, nor- lations. As a blind test set, we use MT 03 Arabic-English 

malization over a restricted successor set, e.g. for a given DARPA evaluation test set consisting of 969 sentences

<

5 

source input sentence, all blocks that match this sen- with 9 Arabic words. 7

tence might be useful for both training and decoding. The Three systems are evaluated in our experiments:  is the segmentation model in Eq. 1 naturally prefers translations baseline block unigram model without re-ordering. Here,

that make use of a smaller number of blocks which leads monotone block alignments are generated: the blocks

to a smaller number of factors in Eq. 1. Using fewer ’big-  have only left predecessors (no blocks are swapped).

ger’ blocks to carry out the translation generally seems This is the model presented in (Tillmann and Xia, 2003). <

to improve translation performance. Since normalization For the  model, the sentence is translated mostly

does not influence the number of blocks used to carry out monotonously, and only neighbor blocks are allowed to



< < !

the translation, it might be less important for our segmen- be swapped (at most block is skipped). The  <

tation model. model allows for the same block swapping as the  We use a DP-based beam search procedure similar to the model, but additionally uses the orientation component one presented in (Tillmann and Xia, 2003). We maximize described in Section 2: the block swapping is controlled experiments. The ratio ¨ itself is not currently used in the Table 1: Effect of the orientation model on Arabic- orientation model. The orientation model mostly effects English test data: LDC devtest set and DARPA MT 03 blocks where the Arabic and English words are verbs or blind test set.

nouns. As shown in Fig. 1, the orientation model uses

) ¨

¨

Test Unigram Setting BLEUr4n4 the orientation probability ¡ for the noun block ,

¡ ¡

¡ 

¨

¡

Model ¡ and only the default model for the adjective block . Al-

¡

< <



 7 ¥£¢ 7 7 5 5

  9

2 2 2

Dev test 2 though the noun block might occur by itself without ad-

¡

<

  

7 5 7 363¤¢ 7 7 ¥

 jective, the swapping is not controlled by the occurrence

2 2 2 2

¢¡

¡ ¡



<  <



5 7 7 ¢ 7 7 

 969   9¥

2 2 2 2 2 of the adjective block (which does not have adjacent

¡

< <

¥

 7 ¢ 7 7

5 predecessors). We rather model the fact that a noun block

  9 9 

2 2 2 2 ¥©

Test

¡

<

¥ 

7 7 ¢ 7 7

5 is typically preceded by some block . This situation

  ¥ 9

2 2 2 2

¡ ¡



<  <



7 3 ¢ 7 7 7

5 seems typical for the block swapping that occurs on the

9  969   

2 2 2 2 2 evaluation test set. Acknowledgment Table 2: Arabic-English example blocks from the de- vtest set: the Arabic phrases are romanized. The example This work was partially supported by DARPA and mon- blocks were swapped in the development test set transla- itored by SPAWAR under contract No. N66001-99-2- tions. The counts are obtained from the parallel training 8916. The paper has greatly profited from discussion with

data. Kishore Papineni and Fei Xia.

¤¥(@CBED ?¤¦ @CBED Arabic-English blocks ?

('exhibition' § 'mErD') 97 32

('added' § 'wADAf') 285 68 References

('said' § 'wqAl') 872 801 Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della ('suggested § 'AqtrH') 356 729 Pietra, and Robert L. Mercer. 1993. The Mathematics

('terrorist attacks' § hjmAt ArhAbyp') 14 27 of Statistical Machine Translation: Parameter Estima- tion. Computational Linguistics, 19(2):263–311.

< Philipp Koehn, Franz Josef Och, and Daniel Marcu.

7 

by the unigram orientation counts. The  and mod- 2003. Statistical Phrase-Based Translation. In Proc.

els use the block bigram model in Eq. 3: all blocks of the HLT-NAACL 2003 conference, pages 127–133,

¥ 

¨ !

are generated with neutral orientation ¡ , and only Edmonton, Canada, May.

¨ 

two components, the block unigram model ¡ and the



¡

Franz-Josef Och, Christoph Tillmann, and Hermann Ney.

¨

¦  block bigram score ¡ are used.

 1999. Improved Alignment Models for Statistical Ma- Experimental results are reported in Table 1: three BLEU chine Translation. In Proc. of the Joint Conf. on Em- results are presented for both devtest set and blind test pirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC 99), pages 20–28, set. Two scaling parameters are set on the devtest set and College Park, MD, June. copied for use on the blind test set. The second column shows the model name, the third column presents the op- Christoph Tillmann and Fei Xia. 2003. A Phrase-based Unigram Model for Statistical Machine Translation. In timal weighting as obtained from the devtest set by car- Companian Vol. of the Joint HLT and NAACL Confer- rying out an exhaustive grid search. The fourth column ence (HLT 03), pages 106–108, Edmonton, Canada, shows BLEU results together with confidence intervals June.

(Here, the word casing is ignored). The block swapping



 < Ashish Venugopal, Stephan Vogel, and Alex Waibel.

model  obtains a statistical significant improve- 2003. Effective Phrase Translation Extraction from 7

ment over the baseline  model. Interestingly, the swap- Alignment Models. In Proc. of the 41st Annual Conf. <

ping model  without orientation performs worse than of the Association for Computational Linguistics (ACL

7 03), pages 319–326, Sapporo, Japan, July. the baseline  model: the word-based trigram language model alone is too weak to control the block swapping: Dekai Wu. 1996. A Polynomial-Time Algorithm for Sta- the model is too unrestrictive to handle the block swap- tistical Machine Translation. In Proc. of the 34th An- ping reliably. Additionally, Table 2 presents devtest set nual Conf. of the Association for Computational Lin- example blocks that have actually been swapped. The guistics (ACL 96), pages 152–158, Santa Cruz, CA, June. training data is unsegmented, as can be seen from the first two blocks. The block in the first line has been seen Richard Zens and Hermann Ney. 2003. A Comparative  times more often with left than with right orientation. Study on Reordering Constraints in Statistical Machine

Translation. In Proc. of the 41st Annual Conf. of the

7 563 ! © 

' ¨ Blocks for which the ratio is bigger than 2

©  Association for Computational Linguistics (ACL 03),  are likely candidates for swapping in' our Arabic-English pages 144–151, Sapporo, Japan, July.