Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention

Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention † † ‡ † † Huadong Chen , Shujian Huang ,∗ David Chiang , Xinyu Dai , Jiajun Chen †State Key Laboratory for Novel Software Technology, Nanjing University chenhd,huangsj,daixinyu,chenjj @nlp.nju.edu.cn { } ‡Department of Computer Science and Engineering, University of Notre Dame [email protected] Abstract of (sub)words are based purely on their contexts, but the potentially rich information inside the unit Natural language sentences, being hierarchi- itself is seldom explored. Taking the Chinese word cal, can be represented at different levels of 被打伤 (bei-da-shang) as an example, the three granularity, like words, subwords, or characters. But most neural machine translation sys- characters in this word are a passive voice marker, tems require the sentence to be represented as “hit” and “wound”, respectively. The meaning of a sequence at a single level of granularity. It the whole word, “to be wounded”, is fairly com- can be difficult to determine which granularity positional. But this compositionality is ignored if is better for a particular translation task. In this the whole word is treated as a single unit. paper, we improve the model by incorporating Secondly, obtaining the word or sub-word multiple levels of granularity. Specifically, we boundaries can be non-trivial. For languages like propose (1) an encoder with character attention which augments the (sub)word-level rep- Chinese and Japanese, a word segmentation step resentation with character-level information; is needed, which must usually be trained on la- (2) a decoder with multiple attentions that en- beled data. For languages like English and Ger- able the representations from different levels man, word boundaries are easy to detect, but sub- of granularity to control the translation cooper- word boundaries need to be learned by methods atively. Experiments on three translation tasks like BPE. In both cases, the segmentation model demonstrate that our proposed models out- is trained only in monolingual data, which may re- perform the standard word-based model, the sult in units that are not suitable for translation. subword-based model and a strong character- based model. On the other hand, there have been multiple efforts to build models operating purely at the 1 Introduction character level (Ling et al., 2015a; Yang et al., 2016; Lee et al., 2017). But splitting this finely Neural machine translation (NMT) models (Britz can increase potential ambiguities. For example, et al., 2017) learn to map from source lan- the Chinese word 红茶 (hong-cha) means “black guage sentences to target language sentences tea,” but the two characters means “red” and “tea,” via continuous-space intermediate representa- respectively. It shows that modeling the charac- tions. Since word is usually thought of as the ba- ter sequence alone may not be able to fully uti- ff sic unit of language communication (Jackendo , lize the information at the word or sub-word level, 1992), early NMT systems built these represen- which may also lead to an inaccurate representa- tations starting from the word level (Sutskever tion. A further problem is that character sequences et al., 2014; Bahdanau et al., 2015; Cho et al., are longer, making them more costly to process 2014; Weng et al., 2017). Later systems tried us- with a recurrent neural network model (RNN). ing smaller units such as subwords to address the While both word-level and character-level in- problem of out-of-vocabulary (OOV) words (Sen- formation can be helpful for generating better rep- nrich et al., 2016; Wu et al., 2016). resentations, current research which tries to ex- Although they obtain reasonable results, these ploit both word-level and character-level informa- word or sub-word methods still have some poten- tion only composed the word-level representation tial weaknesses. First, the learned representations by character embeddings with the word boundary ∗ Corresponding author. information (Ling et al., 2015b; Costa-jussa` and 1284 Proceedings of NAACL-HLT 2018, pages 1284–1293 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics Fonollosa, 2016) or replaces the word represen- sentence: tation with its inside characters when encounter- −→hl = GRU(h−−→l 1, sl; −→θ ) ing the out-of-vocabulary words (Luong and Man- − (1) ning, 2016; Wu et al., 2016). In this paper, we pro- ←−hl = GRU(h←−−l 1, sl; ←−θ ) pose a novel encoder-decoder model that makes − use of both character and word information. More where sl is the l-th source word’s embedding, specifically, we augment the standard encoder to GRU is a gated recurrent unit, −→θ and ←−θ are the pa- attend to individual characters to generate better rameters of forward and backward GRU, respec- source word representations (§3.1). We also aug- tively; see Cho et al. (2014) for a definition. ment the decoder with a second attention that at- The annotation of each source word xl is ob- tends to the source-side characters to generate bet- tained by concatenating the forward and backward ter translations (§3.2). hidden states: ff −→hl To demonstrate the e ectiveness of the pro- ←→hl = . posed model, we carry out experiments on ←−hl three translation tasks: Chinese-English, English- The whole sequence of these annotations is used Chinese and English-German. Our experiments by the decoder. show that: (1) the encoder with character attention achieves significant improvements over the 2.2 Decoder standard word-based attention-based NMT system The decoder is a forward RNN with GRUs pre- and a strong character-based NMT system; (2) in- dicting the translation y word by word. The prob- corporating source character information into the ability of generating the j-th word y j is: decoder by our multi-scale attention mechanism t j 1 yields a further improvement, and (3) our mod- − P(y j y< j, x; θ) = softmax( d j ) ifications also improve a subword-based NMT | model. To the best of our knowledge, this is the c j first work that uses the source-side character in- where t j 1 is the word embedding of the ( j 1)-th formation for all the (sub)words in the sentence to − − target word, d j is the decoder’s hidden state of enhance a (sub)word-based NMT model in both time j, and c j is the context vector at time j. The the encoder and decoder. state d j is computed as 2 Neural Machine Translation t j 1 d j = GRU d j 1, − ; θd . − c j Most NMT systems follow the encoder-decoder " # ! framework with attention mechanism proposed by The attention mechanism computes the context Bahdanau et al. (2015). Given a source sentence vector ci as a weighted sum of the source annotations, x = x x x and a target sentence y = 1 ··· l ··· L I y y y , we aim to directly model the trans- 1 ··· j ··· J c j = α j,l←→hl (2) lation probability: Xi=1 where the attention weight α ji is J P(y x; θ) = P(y j y< j, x; θ), exp (e ji) | | α ji = (3) 1 I exp (e ) Y i0=1 ji0 where θ is a set of parameters and y< j is the and P sequence of previously generated target words. T e jl = va tanh (Wad j 1 + Ua←→hl ) (4) Here, we briefly describe the underlying frame- − work of the encoder-decoder NMT system. where va, Wa and Ua are the weight matrices of the attention model, and e jl is an attention model 2.1 Encoder that scores how well d j 1 and ←→hl match. − Following Bahdanau et al. (2015), we use a With this strategy, the decoder can attend to the bidirectional RNN with gated recurrent units source annotations that are most relevant at a given (GRUs) (Cho et al., 2014) to encode the source time. 1285 Word the starting and ending character position, respec- sl sl+1 Embeddings tively, of word x . Then o o are the inside l pl ··· ql ′ h ′ hl l+1 characters of word xl; o1 op 1 and oq +1 oK ··· l− l ··· are the outside characters of word xl. h h Annotations l-1 l The encoder is an RNN that alternates between O cl Attention reading (sub)word embeddings and character- I Mechanism cl level information. At each time step, we first read the word embedding: α O α O αI αI αI α O α O l,p-2l l,p-1l l,pl l,p+1l l,ql l,q+1l l,q+2l Character O O O O O O −→ p-2 p-1 p p+1 Oq q+1 q+2l −→hl0 = GRU(h−−→l 1, sl; θ0) (5) l l l l l l Embeddings − Then we use the attention mechanisms to com- Figure 1: Forward encoder with character attention at pute character context vectors for the inside char- time step l. The encoder alternates between reading I acters: word embeddings and character context vectors. cl and c O denotes the inside and outside character-level con- ql l I I text vectors of the l-th word, respectively. cl = αlmom = mXpl exp (e ) αI = lm 3 Character Enhanced Neural Machine lm ql exp (elm ) Translation m0=pl 0 I I I e = vP tanh (W −→h 0 + U o ). In this section, we present models which make use lm · l m O of both character-level and word-level information The outside character context vector cl is calcu- in the encoder-decoder framework. lated in a similar way, using a different set of parameters, i.e.

Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support