CytonMT: an Efficient Neural Machine Translation Open-source Toolkit Implemented in C++

Xiaolin Wang Masao Utiyama Eiichiro Sumita Advanced Translation Research and Development Promotion Center National Institute of Information and Communications Technology, Japan {xiaolin.wang,mutiyama,eiichiro.sumita}@nict.go.jp

Abstract • ByteNet (Kalchbrenner et al., 2016)9

This paper presents an open-source neural ma- • ConvS2S (Gehring et al., 2017)10 chine translation toolkit named CytonMT1. The toolkit is built from scratch only using • Tensor2Tensor (Vaswani et al., 2017)11 C++ and NVIDIA’s GPU-accelerated libraries. The toolkit features training efficiency, code • Marian (Junczys-Dowmunt et al., 2018)12 simplicity and translation quality. Benchmarks show that CytonMT accelerates the training These open-source NMT toolkits are undoubt- speed by 64.5% to 110.8% on neural net- edly excellent software. However, there is a com- works of various sizes, and achieves compet- mon issue – they are all written in script languages itive translation quality. with dependencies on third-party GPU platforms (see Table 1) except Marian, which is developed 1 Introduction simultaneously with our toolkit. Neural Machine Translation (NMT) Using script languages and third-party GPU has made remarkable progress over the platforms is a two-edged sword. On one hand, past few years (Sutskever et al., 2014; it greatly reduces the workload of coding neural Bahdanau et al., 2014; Wu et al., 2016). Just networks. On the other hand, it also causes two like Moses (Koehn et al., 2007) does for statistic problems as follows, machine translation (SMT), open-source NMT • The running efficiency drops, and profiling toolkits contribute greatly to this progress, and optimization also become difficult, as the including but not limited to, direct access to GPUs is blocked by the lan- • RNNsearch-LV (Jean et al., 2015)2 guage interpreters or the platforms. NMT systems typically require days or weeks to • Luong-NMT (Luong et al., 2015a)3 train, so training efficiency is a paramount concern. Slightly faster training can make the • DL4MT by Kyunghyun Cho et al.4 difference between plausible and impossible experiments (Klein et al., 2017). • BPE-char (Chung et al., 2016)5 The researchers using these toolkits may 6 • • Nematus (Sennrich et al., 2017) be constrained by the platforms. Unex- plored computations or operations may be- • OpenNMT (Klein et al., 2017)7 come disallowed or unnecessarily inefficient • Seq2seq (Britz et al., 2017)8 on a third-party platform, which lowers the chances of developing novel neural network 1 https://github.com/arthurxlw/cytonMt techniques. 2https://github.com/sebastien-j/LV groundhog 3https://github.com/lmthang/nmt.hybrid 4https://github.com/nyu-dl/dl4mt-tutorial 9https://github.com/paarthneekhara/byteNet-tensorflow 5https://github.com/nyu-dl/dl4mt-cdec (unofficial) and others. 6https://github.com/EdinburghNLP/nematus 10https://github.com/facebookresearch/fairseq 7https://github.com/OpenNMT/OpenNMT-py 11https://github.com/tensorflow/tensor2tensor 8https://github.com/google/seq2seq 12https://github.com/marian-nmt/marian

133 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (System Demonstrations), pages 133–138 Brussels, Belgium, October 31–November 4, 2018. c 2018 Association for Computational Linguistics ( ( Toolkit Language Platform  RNNsearch-LV Python ,GroundHog Luong-NMT Matlab Matlab 9(*0: DL4MT Python Theano /0!.#($1 (' ($ BPE-char Python Theano )*+#,,-&. Nematus Python Theano "# ; <=>5? ; <=? OpenNMT Lua Torch 8((#&(-&   Seq2seq Python Tensorflow 204#!

ByteNet Python Tensorflow <5? <6? <7? 5? <=? <=@5? ;A ;A ;A ;A ; ; ;( " " ( ( ConvS2S Lua  !"#$$2/3 /0!.#($$2/3 Tensor2Tensor Python Tensorflow 204#!$7 204#!$7

Marian C++ – ! ! CytonMT C++ –  !"#$$2/3 /0!.#($$2/3 204#!$6 204#!$6

! !  !"#$$2/3 /0!.#($$2/3 Table 1: Languages and Platforms of Open-source 204#!$5 204#!$5

  NMT toolkits.  !"#$$%&' ( /0!.#($$%&' ( )*+#,,-&. )*+#,,-&.

    #$%&' ( ( CytonMT is developed to address this issue, in  ! "  hopes of providing the community an attractive alternative. The toolkit is written in C++ which Figure 1: Architecture of CytonMT. is the genuine official language of NVIDIA – the manufacturer of the most widely-used GPU hard- trates the architecture. The conditional probability ware. This gives the toolkit an advantage on effi- of a translation given a source sentence is formu- ciency when compared with other toolkits. lated as,

Implementing in C++ also gives CytonMT great m hji flexibility and freedom on coding. The researchers log p(y x) = X log(p(yj Ho ) | | who are interested in the real calculations inside j=1 m neural networks can trace source codes down to hji = X log(softmaxyj (tanh(WoHo + Bo))) (1) kernel functions, matrix operations or NVIDIA’s j=1 APIs, and then modify them freely to test their hji hji Ho = att(Hs,Ht ), (2) novel ideas. F The code simplicity of CytonMT is compara- where x is a source sentence; y=(y1,...,ym) is ble to those NMT toolkits implemented in script a translation; Hs is a source-side top-layer hidden hji languages. This owes to an open-source general- state; Ht is a target-side top-layer hidden state; purpose neural network library in C++, named Cy- hji Ho is a state generated by an attention model tonLib, which is shipped as part of the source Fatt; Wo and Bo are the weight and bias of an code. The library defines a simple and friendly output embedding. pattern for users to build arbitrary network archi- The toolkit adopts the multiplicative attention tectures in the cost of two lines of genuine C++ model proposed by Luong et al. (2015a), because code per layer. it is slightly more efficient than the additive variant proposed by Bahdanau et al. (2014). This issue is CytonMT achieves competitive translation addressed in Britz et al. (2017) and Vaswani et al. quality, which is the main purpose of NMT (2017). Figure 2 illustrates the model, formulated toolkits. It implements the popular framework of as , attention-based RNN encoder-decoder. Among hiji hii hji a = softmax( a(H ,H )) st F s t the reported systems of the same architecture, it hii hji eFa(Hs ,Ht ) ranks at top positions on the benchmarks of both = , (3) n hii hji eFa(Hs ,Ht ) WMT14 and WMT17 English-to-German tasks. Pi=1 hii hji hii⊤ hji The following of this paper presented the details a(H ,H ) = H WaH , (4) F s t s t of CytonMT from the aspects of method, imple- n hji hiji hii mentation, benchmark, and future works. Cs = X ast Hs , (5) i=1 hji hji 2 Method Cst = [Cs; Ht ], (6) hji hji The toolkit approaches to the problem of ma- Ho = tanh(WcCst ), (7) chine translation using the attention-based RNN encoder-decoder proposed by Bahdanau et al. where Fa is a scoring function for alignment; Wa (2014) and Luong et al. (2015a). Figure 1 illus- is a matrix for linearly mapping target-side hidden

134 class Attention: public Network { DuplicateLayer dupHt; // declare components !"# LinearLayer linearHt;  4& MultiplyHsHt multiplyHsHt; %& $"'()( *" + SoftmaxLayer softmax; WeightedHs weightedHs; !"# ! $ !"# ! $   0$&1+#$.  !"#$ "#$ Concatenate concateCsHt; 2$" LinearLayer linearCst;  !"# 1 *2 3&2  ActivationLayer actCst; ,/*( %& $"'() ,#(-'./!# 4 !5# 4 !7#  !6# 4 !0#  4  Variable* init(LinearLayer* linHt, 4 !"# LinearLayer linCst, Variable hs,  * *   %&'()*++,%-. -(8* ++,%-. Variable* ht) ,/*(+0 ,/*(+0 { Variable* tx; tx=dupHt.init(ht); // make two copies layers.push_back(&dupHt); Figure 2: Architecure of Attention Model. tx=linearHt.init(linHt, tx); // WaHt layers.push_back(&linearHt);

tx=multiplyHsHt.init(hs, tx); // Fa layers.push_back(&multiplyHsHt); states into a space comparable to the source-side; tx=softmax.init(tx); //ast hiji hji layers.push_back(&softmax); ast is an alignment coefficient; Cs is a source- j side context; Ch i is a context derived from both tx=weightedHs.init(hs, tx); // Cs st layers.push_back(&weightedHs); sides. tx=concateCsHt.init(tx, &dupHt.y1); // Cst layers.push_back(&concateCsHt);

3 Implementation tx=linearCst.init(linCst, tx); // WcCst layers.push_back(&linearCst); The toolkit consists of a general purpose neural tx=actCst.init(tx, CUDNN_ACTIVATION_TANH);// Ho layers.push_back(&actCst); network library, and a neural machine translation return tx; //pointer to result system built upon the library. The neural network } library defines a class named Network to facili- }; tate the construction of arbitrary neural networks. Users only need to inherit the class, declare com- Figure 3: Complete Code of Attention Model For- ponents as data members, and write down two mulated by Equations 3 to 7 lines of codes per component in an initialization function. For example, the complete code of the void LinearLayer::forward() { attention network formulated by the equations 3 cublasXgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N, to 7 is presented in Figure 3. This piece of code dimOutput, num, dimInput, &one, w.data, w.ni, x.data, dimInput, fulfills the task of building a neural network as fol- &zero, y.data, dimOutput) lows, } void LinearLayer::backward() { cublasXgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_N, • The class of Variable stores numeric values dimInput, num, dimOutput, and gradients. Through passing the pointers &one, w.data, w.ni, y.grad.data, dimOutput, &beta, x.grad.data, dimInput)); of Variable around, component are connected }

together. void LinearLayer::calculateGradient() { cublasXgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_T, dimInput, dimOutput, num, • The data member of layers collects all the &one, x.data, dimInput, y.grad.data, dimOutput, components. The base class of Network will &one, w.grad.data, w.grad.ni)); } call the functions forward, backward and cal- void EmbeddingLayer::forward() culateGradient of each component to per- { ... form the actual computation. embedding_kernel<<>>(words, firstOccurs, len, dim, stride, wholeData, y.data, true); The codes of actual computation are organized } in the functions forward, backward and calculate- Gradient for each type of component. Figure 4 Figure 4: Codes of Performing Actual Computa- presents some examples. Note that these codes tion. have been slightly simplified for illustration.

135 4 Benchmarks Data Set # Sent. # Words Source Target 4.1 Settings WMT14 Train.(standard) 4,500,966 113,548,249 107,259,529 CytonMT is tested on the widely-used bench- Dev. (tst2013) 3,000 64,807 63,412 marks of the WMT14 and WMT17 English- Test (tst2014) 3,003 67,617 63,078 WMT17 to-German tasks (Bojar et al., 2017) (Ta- Train.(standard) 4,590,101 118,768,285 112,009,072 ble 2). Both datasets are processed and con- Train.(back trans.) 10,000,000 190,611,668 149,198,444 verted using byte-pair encoding(Gage, 1994; Dev. (tst2016) 2,999 64,513 62,362 Test (tst2017) 3,004 64,776 60,963 Schuster and Nakajima, 2012) with a shared source-target vocabulary of about 37000 to- Table 2: WMT English-to-German corpora. kens. The WMT14 corpora are processed by 13 the scripts from Vaswani et al. (2017) . The Hyperparameter Value WMT17 corpora are processed by the scripts Embedding Size 512 from Junczys-Dowmunt et al. (2018)14, which Hidden State Size 512 Encoder/Decoder Depth 2 includes 10 million back-translated sentence pairs Encoder Bidirectional for training. RNN Type LSTM The benchmarks were run on an Intel Xeon Dropout 0.2 Label Smooth. 0.1 CPU E5-2630 @ 2.4Ghz and a GPU Quadro Optimizer SGD M4000 (Maxwell) that had 1664 CUDA cores Learning Rate 1.0 @ 773 MHz, 2,573 GFLOPS . The software is Learning Rate Decay 0.7 Beam Search Size 10 CentOS 6.8, CUDA 9.1 (driver 387.26), CUDNN Length Penalty 0.6 7.0.5, Theano 1.0.1, Tensorflow 1.5.0. Netmaus, Torch and OpenNMT are the latest version in De- Table 3: Hyperparameter Settings. cember 2017. Marian is the last version in May 2018. CytonMT is run with the hyperparameters set- OpenNMT is the fastest baseline, while CytonMT tings presented by Table 3 unless stated otherwise. achieves a speed up versus it by 64.5% to 110.8%. The settings provide both fast training and com- Moreover, CytonMT shows a consistent tendency petitive translate quality according to our experi- to speed up more on larger networks. ments on a variety of translation tasks. Dropout is applied to the hidden states between non-top re- 4.3 Comparison on Translation Quality current layers Rs, Rt and output Ho according to Table 5 compares the BLEU of CytonMT with (Wang et al., 2017). Label smoothing estimates the reported results from the systems of the the marginalized effect of label-dropout during same architecture (attention-based RNN encoder- training, which makes models learn to be more un- decoder). BLEU is calculated on cased, to- sure (Szegedy et al., 2016). This improved BLEU kenized text to be comparable to previous scores (Vaswani et al., 2017). Length penalty is work (Sutskever et al., 2014; Luong et al., 2015b; applied using the formula in (Wu et al., 2016). Wu et al., 2016; Zhou et al., 2016). The settings of CytonMT on WMT14 follows 4.2 Comparison on Training Speed Table 3, while the settings on WMT17 adopt a Four baseline toolkits and CytonMT train mod- depth of 3 and a hidden state size of 1024 as els using the settings of hyperparameters in Ta- the training set is three times larger. The cross ble 3. The number of layers and the size of embed- dings and hidden states varies, as large networks Embed./State Size 512 512 1024 1024 are often used in real-world applications to achieve Enc./ Dec. Layers 2 4 2 4 higher accuracy at the cost of more running time. Nematus 1875 1190 952 604 OpenNMT 2872 2038 1356 904 Table 4 presents the training speed of differ- Seq2Seq 1618 1227 854 599 ent toolkits measured in source tokens per sec- Marian 2630 1832 1120 688 ond. The results show that the training speed CytonMT 4725 3751 2571 1906 > of CytonMT is much higher than the baselines. speedup 64.5% 84.1% 89.6% 110.8%

13 https://github.com/tensorflow/tensor2tensor Table 4: Training Speed Measured in Source To- 14https://github.com/marian-nmt/marian- examples/tree/master/wmt2017-uedin kens per Second.

136 System Open Src. BLEU of benefiting the community and promoting the WMT14 Nematus(Klein,2017) √ 18.25 field. We look forward to hearing feedback from OpenNMT(Klein,2017) √ 19.34 the community. RNNsearch-LV(Jean,2015) √ 19.4 The future work of CytonMT will be contin- Deep-Att(Zhou,2016) 20.6 Luong-NMT(Luong,2015) √ 20.9 ued in two directions. One direction is to fur- BPE-Char(Chung,2016) √ 21.5 ther optimize the code for GPUs, such support- Seq2seq(Britz, 2017) √ 22.19 ing multi-GPU. The problem we used to have is CytonMT √ 22.67 GNMT (Wu, 2015) 24.61 that GPUs proceed very fast in the last few years. WMT17 For example, the microarchitectures of NVIDIA Nematus(Sennrich,2017) √ 27.5 GPUs evolve twice during the development of Cy- CytonMT √ 27.63 Marian(Junczys,2018) √ 27.7 tonMT, from Maxwell to Pascale, and then to Volta. Therefore, we have not explored cutting- Table 5: Comparing BLEU with Public Records. edge GPU techniques as the coding effort may be outdated quickly. Multi-GPU machines are com- mon now, so we plan to support them. entropy of the development set is monitored ev- The other direction is to support latest NMT ar- 1 1 ery 12 epoch on WMT14 and every 36 epoch on chitectures such ConvS2S (Gehring et al., 2017) WMT17, approximately 400K sentence pairs. If and Transformer (Vaswani et al., 2017). In these the entropy has not decreased by max(0.01 × architectures, recurrent structures are replaced by learning rate, 0.001) in 12 times, learning rate convolution or attention structures. Their high per- decays by 0.7 and the training restarts from the formance indicates that the new structures suit the previous best model. The whole training proce- translation task better, so we also plan to support dure terminates when no improvement is made them in the future. during two neighboring decays of learning rate. The actual training took 28 epochs on WMT14 and 12 epochs on WMT17. References Table 5 shows that CytonMT achieves the com- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- petitive BLEU points on both benchmarks. On gio. 2014. Neural machine translation by jointly learning to align and translate. Proceedings of the WMT14, it is only outperformed by Google’s pro- 3rd International Conference on Learning Repre- duction system (Wu et al., 2016), which is very sentations., pages 1–15. much larger in scale and much more demanding on hardware. On WMT17, it achieves the same Ondrejˇ Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, level of performance with Marian, which is high Matthias Huck, Philipp Koehn, Qun Liu, Varvara among the entries of WMT17 for a single sys- Logacheva, Christof Monz, Matteo Negri, Matt tem. Note that the start-of-the-art scores on these Post, Raphael Rubino, Lucia Specia, and Marco benchmarks have been recently pushed forward by Turchi. 2017. Findings of the 2017 conference novel network architectures such as Gehring et al. on machine translation (wmt17). In Proceedings of the Second Conference on Machine Translation, (2017), Vaswani et al. (2017) and Shazeer et al. Volume 2: Shared Task Papers, pages 169–214, (2017) Copenhagen, Denmark. Association for Computa- tional Linguistics. 5 Conclusion Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural This paper introduces CytonMT – an open- machine translation architectures. In Proceedings source NMT toolkit – built from scratch only of the 2017 Conference on Empirical Methods in using C++ and NVIDIA’s GPU-accelerated li- Natural Language Processing, pages 1442–1451, braries. CytonMT speeds up training by more than Copenhagen, Denmark. Association for Computa- 64.5%, and achieves competitive BLEU points on tional Linguistics. WMT14 and WMT17 corpora. The source code Junyoung Chung, Kyunghyun Cho, and Yoshua Ben- of CytonMT is simple because of CytonLib – an gio. 2016. A character-level decoder without ex- open-source general purpose neural network li- plicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the As- brary – contained in the toolkit. Therefore, Cy- sociation for Computational Linguistics (Volume 1: tonMT is an attractive alternative for the research Long Papers), pages 1693–1703, Berlin, Germany. community. We open-source this toolkit in hopes Association for Computational Linguistics.

137 Philip Gage. 1994. A new algorithm for data compres- Mike Schuster and Kaisuke Nakajima. 2012. Japanese sion. The C Users Journal, 12(2):23–38. and korean voice search. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE Interna- Jonas Gehring, Michael Auli, David Grangier, Denis tional Conference on, pages 5149–5152. IEEE. Yarats, and Yann N Dauphin. 2017. Convolutional Sequence to Sequence Learning. ArXiv e-prints. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- dra Birch, Barry Haddow, Julian Hitschler, Marcin Sebastien´ Jean, Kyunghyun Cho, Roland Memisevic, Junczys-Dowmunt, Samuel L”aubli, Antonio Vale- and Yoshua Bengio. 2015. On using very large rio Miceli Barone, Jozef Mokry, and Maria Nadejde. target vocabulary for neural machine translation. 2017. Nematus: a toolkit for neural machine trans- In Proceedings of the 53rd Annual Meeting of the lation. In Proceedings of the Software Demonstra- Association for Computational Linguistics and the tions of the 15th Conference of the European Chap- 7th International Joint Conference on Natural Lan- ter of the Association for Computational Linguistics, guage Processing (Volume 1: Long Papers), pages pages 65–68, Valencia, Spain. Association for Com- 1–10, Beijing, China. Association for Computa- putational Linguistics. tional Linguistics. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Marcin Junczys-Dowmunt, Roman Grundkiewicz, Jeff Dean. 2017. Outrageously large neural net- Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, works: The sparsely-gated mixture-of-experts layer. Tom Neckermann, Frank Seide, Ulrich Germann, CoRR, abs/1701.06538. Alham Fikri Aji, Nikolay Bogoychev, Andr F. T. Martins, and Alexandra Birch. 2018. Marian: Fast Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. neural machine translation in c++. arXiv preprint Sequence to sequence learning with neural net- arXiv:1804.00344. works. In Advances in neural information process- ing systems, pages 3104–3112. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron¨ van den Oord, Alex Graves, and Koray Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Kavukcuoglu. 2016. Neural machine translation in Jonathon Shlens, and Zbigniew Wojna. 2016. Re- linear time. CoRR, abs/1610.10099. thinking the inception architecture for computer vi- sion. Proceedings of the Conference on Computer Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Vision and , pages 2818–2826. Senellart, and Alexander Rush. 2017. Opennmt: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Open-source toolkit for neural machine translation. Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz In Proceedings of ACL, System Demonstrations, Kaiser, and Illia Polosukhin. 2017. Attention is all pages 67–72, Vancouver, Canada. Association for you need. In I. Guyon, U. V. Luxburg, S. Bengio, Computational Linguistics. H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors, Advances in Neural Information Pro- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris cessing Systems 30, pages 6000–6010. Curran As- Callison-Burch, Marcello Federico, Nicola Bertoldi, sociates, Inc. Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source Xiaolin Wang, Masao Utiyama, and Eiichiro Sumita. toolkit for statistical machine translation. In Pro- 2017. Empirical study of dropout scheme for neu- ceedings of ACL, Interactive Poster and Demonstra- ral machine translation. In Proceedings of the 16th tion Sessions, pages 177–180. Association for Com- Machine Translation Summit, pages 1–15. putational Linguistics. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Thang Luong, Hieu Pham, and Christopher D. Man- Le, Mohammad Norouzi, Wolfgang Macherey, ning. 2015a. Effective approaches to attention- Maxim Krikun, Yuan Cao, Qin Gao, Klaus based neural machine translation. In Proceedings of Macherey, et al. 2016. Google’s neural ma- the 2015 Conference on Empirical Methods in Nat- chine translation system: Bridging the gap between ural Language Processing, pages 1412–1421, Lis- human and machine translation. arXiv preprint bon, Portugal. Association for Computational Lin- arXiv:1609.08144. guistics. Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, Xu. 2016. Deep recurrent models with fast-forward and Wojciech Zaremba. 2015b. Addressing the rare connections for neural machine translation. Trans- word problem in neural machine translation. In Pro- actions of the Association for Computational Lin- ceedings of the 53rd Annual Meeting of the Associ- guistics, 4:371–383. ation for Computational Linguistics and the 7th In- ternational Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 11–19, Beijing, China. Association for Computational Lin- guistics.

138