Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention
Total Page:16
File Type:pdf, Size:1020Kb
Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention † † ‡ † † Huadong Chen , Shujian Huang ,∗ David Chiang , Xinyu Dai , Jiajun Chen †State Key Laboratory for Novel Software Technology, Nanjing University chenhd,huangsj,daixinyu,chenjj @nlp.nju.edu.cn { } ‡Department of Computer Science and Engineering, University of Notre Dame [email protected] Abstract of (sub)words are based purely on their contexts, but the potentially rich information inside the unit Natural language sentences, being hierarchi- itself is seldom explored. Taking the Chinese word cal, can be represented at different levels of 被打伤 (bei-da-shang) as an example, the three granularity, like words, subwords, or charac- ters. But most neural machine translation sys- characters in this word are a passive voice marker, tems require the sentence to be represented as “hit” and “wound”, respectively. The meaning of a sequence at a single level of granularity. It the whole word, “to be wounded”, is fairly com- can be difficult to determine which granularity positional. But this compositionality is ignored if is better for a particular translation task. In this the whole word is treated as a single unit. paper, we improve the model by incorporating Secondly, obtaining the word or sub-word multiple levels of granularity. Specifically, we boundaries can be non-trivial. For languages like propose (1) an encoder with character atten- tion which augments the (sub)word-level rep- Chinese and Japanese, a word segmentation step resentation with character-level information; is needed, which must usually be trained on la- (2) a decoder with multiple attentions that en- beled data. For languages like English and Ger- able the representations from different levels man, word boundaries are easy to detect, but sub- of granularity to control the translation cooper- word boundaries need to be learned by methods atively. Experiments on three translation tasks like BPE. In both cases, the segmentation model demonstrate that our proposed models out- is trained only in monolingual data, which may re- perform the standard word-based model, the sult in units that are not suitable for translation. subword-based model and a strong character- based model. On the other hand, there have been multiple efforts to build models operating purely at the 1 Introduction character level (Ling et al., 2015a; Yang et al., 2016; Lee et al., 2017). But splitting this finely Neural machine translation (NMT) models (Britz can increase potential ambiguities. For example, et al., 2017) learn to map from source lan- the Chinese word 红茶 (hong-cha) means “black guage sentences to target language sentences tea,” but the two characters means “red” and “tea,” via continuous-space intermediate representa- respectively. It shows that modeling the charac- tions. Since word is usually thought of as the ba- ter sequence alone may not be able to fully uti- ff sic unit of language communication (Jackendo , lize the information at the word or sub-word level, 1992), early NMT systems built these represen- which may also lead to an inaccurate representa- tations starting from the word level (Sutskever tion. A further problem is that character sequences et al., 2014; Bahdanau et al., 2015; Cho et al., are longer, making them more costly to process 2014; Weng et al., 2017). Later systems tried us- with a recurrent neural network model (RNN). ing smaller units such as subwords to address the While both word-level and character-level in- problem of out-of-vocabulary (OOV) words (Sen- formation can be helpful for generating better rep- nrich et al., 2016; Wu et al., 2016). resentations, current research which tries to ex- Although they obtain reasonable results, these ploit both word-level and character-level informa- word or sub-word methods still have some poten- tion only composed the word-level representation tial weaknesses. First, the learned representations by character embeddings with the word boundary ∗ Corresponding author. information (Ling et al., 2015b; Costa-jussa` and 1284 Proceedings of NAACL-HLT 2018, pages 1284–1293 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics Fonollosa, 2016) or replaces the word represen- sentence: tation with its inside characters when encounter- −→hl = GRU(h−−→l 1, sl; −→θ ) ing the out-of-vocabulary words (Luong and Man- − (1) ning, 2016; Wu et al., 2016). In this paper, we pro- ←−hl = GRU(h←−−l 1, sl; ←−θ ) pose a novel encoder-decoder model that makes − use of both character and word information. More where sl is the l-th source word’s embedding, specifically, we augment the standard encoder to GRU is a gated recurrent unit, −→θ and ←−θ are the pa- attend to individual characters to generate better rameters of forward and backward GRU, respec- source word representations (§3.1). We also aug- tively; see Cho et al. (2014) for a definition. ment the decoder with a second attention that at- The annotation of each source word xl is ob- tends to the source-side characters to generate bet- tained by concatenating the forward and backward ter translations (§3.2). hidden states: ff −→hl To demonstrate the e ectiveness of the pro- ←→hl = . posed model, we carry out experiments on ←−hl three translation tasks: Chinese-English, English- The whole sequence of these annotations is used Chinese and English-German. Our experiments by the decoder. show that: (1) the encoder with character atten- tion achieves significant improvements over the 2.2 Decoder standard word-based attention-based NMT system The decoder is a forward RNN with GRUs pre- and a strong character-based NMT system; (2) in- dicting the translation y word by word. The prob- corporating source character information into the ability of generating the j-th word y j is: decoder by our multi-scale attention mechanism t j 1 yields a further improvement, and (3) our mod- − P(y j y< j, x; θ) = softmax( d j ) ifications also improve a subword-based NMT | model. To the best of our knowledge, this is the c j first work that uses the source-side character in- where t j 1 is the word embedding of the ( j 1)-th formation for all the (sub)words in the sentence to − − target word, d j is the decoder’s hidden state of enhance a (sub)word-based NMT model in both time j, and c j is the context vector at time j. The the encoder and decoder. state d j is computed as 2 Neural Machine Translation t j 1 d j = GRU d j 1, − ; θd . − c j Most NMT systems follow the encoder-decoder " # ! framework with attention mechanism proposed by The attention mechanism computes the context Bahdanau et al. (2015). Given a source sentence vector ci as a weighted sum of the source annota- tions, x = x x x and a target sentence y = 1 ··· l ··· L I y y y , we aim to directly model the trans- 1 ··· j ··· J c j = α j,l←→hl (2) lation probability: Xi=1 where the attention weight α ji is J P(y x; θ) = P(y j y< j, x; θ), exp (e ji) | | α ji = (3) 1 I exp (e ) Y i0=1 ji0 where θ is a set of parameters and y< j is the and P sequence of previously generated target words. T e jl = va tanh (Wad j 1 + Ua←→hl ) (4) Here, we briefly describe the underlying frame- − work of the encoder-decoder NMT system. where va, Wa and Ua are the weight matrices of the attention model, and e jl is an attention model 2.1 Encoder that scores how well d j 1 and ←→hl match. − Following Bahdanau et al. (2015), we use a With this strategy, the decoder can attend to the bidirectional RNN with gated recurrent units source annotations that are most relevant at a given (GRUs) (Cho et al., 2014) to encode the source time. 1285 Word the starting and ending character position, respec- sl sl+1 Embeddings tively, of word x . Then o o are the inside l pl ··· ql ′ h ′ hl l+1 characters of word xl; o1 op 1 and oq +1 oK ··· l− l ··· are the outside characters of word xl. h h Annotations l-1 l The encoder is an RNN that alternates between O cl Attention reading (sub)word embeddings and character- I Mechanism cl level information. At each time step, we first read the word embedding: α O α O αI αI αI α O α O l,p-2l l,p-1l l,pl l,p+1l l,ql l,q+1l l,q+2l Character O O O O O O −→ p-2 p-1 p p+1 Oq q+1 q+2l −→hl0 = GRU(h−−→l 1, sl; θ0) (5) l l l l l l Embeddings − Then we use the attention mechanisms to com- Figure 1: Forward encoder with character attention at pute character context vectors for the inside char- time step l. The encoder alternates between reading I acters: word embeddings and character context vectors. cl and c O denotes the inside and outside character-level con- ql l I I text vectors of the l-th word, respectively. cl = αlmom = mXpl exp (e ) αI = lm 3 Character Enhanced Neural Machine lm ql exp (elm ) Translation m0=pl 0 I I I e = vP tanh (W −→h 0 + U o ). In this section, we present models which make use lm · l m O of both character-level and word-level information The outside character context vector cl is calcu- in the encoder-decoder framework. lated in a similar way, using a different set of pa- rameters, i.e.