Cytonmt: an Efficient Neural Machine Translation Open-Source Toolkit

CytonMT: an Efficient Neural Machine Translation Open-source Toolkit Implemented in C++ Xiaolin Wang Masao Utiyama Eiichiro Sumita Advanced Translation Research and Development Promotion Center National Institute of Information and Communications Technology, Japan {xiaolin.wang,mutiyama,eiichiro.sumita}@nict.go.jp Abstract • ByteNet (Kalchbrenner et al., 2016)9 This paper presents an open-source neural ma- • ConvS2S (Gehring et al., 2017)10 chine translation toolkit named CytonMT1. The toolkit is built from scratch only using • Tensor2Tensor (Vaswani et al., 2017)11 C++ and NVIDIA’s GPU-accelerated libraries. The toolkit features training efficiency, code • Marian (Junczys-Dowmunt et al., 2018)12 simplicity and translation quality. Benchmarks show that CytonMT accelerates the training These open-source NMT toolkits are undoubt- speed by 64.5% to 110.8% on neural net- edly excellent software. However, there is a com- works of various sizes, and achieves compet- mon issue – they are all written in script languages itive translation quality. with dependencies on third-party GPU platforms (see Table 1) except Marian, which is developed 1 Introduction simultaneously with our toolkit. Neural Machine Translation (NMT) Using script languages and third-party GPU has made remarkable progress over the platforms is a two-edged sword. On one hand, past few years (Sutskever et al., 2014; it greatly reduces the workload of coding neural Bahdanau et al., 2014; Wu et al., 2016). Just networks. On the other hand, it also causes two like Moses (Koehn et al., 2007) does for statistic problems as follows, machine translation (SMT), open-source NMT • The running efficiency drops, and profiling toolkits contribute greatly to this progress, and optimization also become difficult, as the including but not limited to, direct access to GPUs is blocked by the lan- • RNNsearch-LV (Jean et al., 2015)2 guage interpreters or the platforms. NMT systems typically require days or weeks to • Luong-NMT (Luong et al., 2015a)3 train, so training efficiency is a paramount concern. Slightly faster training can make the • DL4MT by Kyunghyun Cho et al.4 difference between plausible and impossible experiments (Klein et al., 2017). • BPE-char (Chung et al., 2016)5 The researchers using these toolkits may 6 • • Nematus (Sennrich et al., 2017) be constrained by the platforms. Unex- plored computations or operations may be- • OpenNMT (Klein et al., 2017)7 come disallowed or unnecessarily inefficient • Seq2seq (Britz et al., 2017)8 on a third-party platform, which lowers the chances of developing novel neural network 1 https://github.com/arthurxlw/cytonMt techniques. 2https://github.com/sebastien-j/LV groundhog 3https://github.com/lmthang/nmt.hybrid 4https://github.com/nyu-dl/dl4mt-tutorial 9https://github.com/paarthneekhara/byteNet-tensorflow 5https://github.com/nyu-dl/dl4mt-cdec (unofficial) and others. 6https://github.com/EdinburghNLP/nematus 10https://github.com/facebookresearch/fairseq 7https://github.com/OpenNMT/OpenNMT-py 11https://github.com/tensorflow/tensor2tensor 8https://github.com/google/seq2seq 12https://github.com/marian-nmt/marian 133 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (System Demonstrations), pages 133–138 Brussels, Belgium, October 31–November 4, 2018. c 2018 Association for Computational Linguistics ( ( Toolkit Language Platform RNNsearch-LV Python Theano,GroundHog Luong-NMT Matlab Matlab 9(*0: DL4MT Python Theano /0!.#($1 (' ($ BPE-char Python Theano )*+#,,-&. Nematus Python Theano "# ; >5? ; " OpenNMT Lua Torch 8((#&(-& Seq2seq Python Tensorflow 204#! ByteNet Python Tensorflow <5" <6" <7" <B" >5? " @5" ;A ;A ;A ;A ; ; ;( " " ( ( ConvS2S Lua Torch !"#$$2/3 /0!.#($$2/3 Tensor2Tensor Python Tensorflow 204#!$7 204#!$7 Marian C++ – ! ! CytonMT C++ – !"#$$2/3 /0!.#($$2/3 204#!$6 204#!$6 ! ! !"#$$2/3 /0!.#($$2/3 Table 1: Languages and Platforms of Open-source 204#!$5 204#!$5 NMT toolkits. !"#$$%&' ( /0!.#($$%&' ( )*+#,,-&. )*+#,,-&. #$%&' ( ( CytonMT is developed to address this issue, in ! " hopes of providing the community an attractive alternative. The toolkit is written in C++ which Figure 1: Architecture of CytonMT. is the genuine official language of NVIDIA – the manufacturer of the most widely-used GPU hard- trates the architecture. The conditional probability ware. This gives the toolkit an advantage on effi- of a translation given a source sentence is formu- ciency when compared with other toolkits. lated as, Implementing in C++ also gives CytonMT great m hji flexibility and freedom on coding. The researchers log p(y x) = X log(p(yj Ho ) | | who are interested in the real calculations inside j=1 m neural networks can trace source codes down to hji = X log(softmaxyj (tanh(WoHo + Bo))) (1) kernel functions, matrix operations or NVIDIA’s j=1 APIs, and then modify them freely to test their hji hji Ho = att(Hs,Ht ), (2) novel ideas. F The code simplicity of CytonMT is compara- where x is a source sentence; y=(y1,...,ym) is ble to those NMT toolkits implemented in script a translation; Hs is a source-side top-layer hidden hji languages. This owes to an open-source general- state; Ht is a target-side top-layer hidden state; purpose neural network library in C++, named Cy- hji Ho is a state generated by an attention model tonLib, which is shipped as part of the source Fatt; Wo and Bo are the weight and bias of an code. The library defines a simple and friendly output embedding. pattern for users to build arbitrary network archi- The toolkit adopts the multiplicative attention tectures in the cost of two lines of genuine C++ model proposed by Luong et al. (2015a), because code per layer. it is slightly more efficient than the additive variant proposed by Bahdanau et al. (2014). This issue is CytonMT achieves competitive translation addressed in Britz et al. (2017) and Vaswani et al. quality, which is the main purpose of NMT (2017). Figure 2 illustrates the model, formulated toolkits. It implements the popular framework of as , attention-based RNN encoder-decoder. Among hiji hii hji a = softmax( a(H ,H )) st F s t the reported systems of the same architecture, it hii hji eFa(Hs ,Ht ) ranks at top positions on the benchmarks of both = , (3) n hii hji eFa(Hs ,Ht ) WMT14 and WMT17 English-to-German tasks. Pi=1 hii hji hii⊤ hji The following of this paper presented the details a(H ,H ) = H WaH , (4) F s t s t of CytonMT from the aspects of method, imple- n hji hiji hii mentation, benchmark, and future works. Cs = X ast Hs , (5) i=1 hji hji 2 Method Cst = [Cs; Ht ], (6) hji hji The toolkit approaches to the problem of ma- Ho = tanh(WcCst ), (7) chine translation using the attention-based RNN encoder-decoder proposed by Bahdanau et al. where Fa is a scoring function for alignment; Wa (2014) and Luong et al. (2015a). Figure 1 illus- is a matrix for linearly mapping target-side hidden 134 class Attention: public Network { DuplicateLayer dupHt; // declare components !"# LinearLayer linearHt; 4 MultiplyHsHt multiplyHsHt; %& $"'()( *" + SoftmaxLayer softmax; WeightedHs weightedHs; !"# ! $ !"# ! $ 0$&1+#$. !"#$ "#$ Concatenate concateCsHt; 2$" LinearLayer linearCst; !"# 1 *2 3&2 ActivationLayer actCst; ,/*( %& $"'() ,#(-'./!# 4 !5# 4 !7# !6# 4 !0# 4 Variable* init(LinearLayer* linHt, 4 !"# LinearLayer linCst, Variable hs, * * %&'()*++,%-. -(8* ++,%-. Variable* ht) ,/*(+0 ,/*(+0 { Variable* tx; tx=dupHt.init(ht); // make two copies layers.push_back(&dupHt); Figure 2: Architecure of Attention Model. tx=linearHt.init(linHt, tx); // WaHt layers.push_back(&linearHt); tx=multiplyHsHt.init(hs, tx); // Fa layers.push_back(&multiplyHsHt); states into a space comparable to the source-side; tx=softmax.init(tx); //ast hiji hji layers.push_back(&softmax); ast is an alignment coefficient; Cs is a source- j side context; Ch i is a context derived from both tx=weightedHs.init(hs, tx); // Cs st layers.push_back(&weightedHs); sides. tx=concateCsHt.init(tx, &dupHt.y1); // Cst layers.push_back(&concateCsHt); 3 Implementation tx=linearCst.init(linCst, tx); // WcCst layers.push_back(&linearCst); The toolkit consists of a general purpose neural tx=actCst.init(tx, CUDNN_ACTIVATION_TANH);// Ho layers.push_back(&actCst); network library, and a neural machine translation return tx; //pointer to result system built upon the library. The neural network } library defines a class named Network to facili- }; tate the construction of arbitrary neural networks. Users only need to inherit the class, declare com- Figure 3: Complete Code of Attention Model For- ponents as data members, and write down two mulated by Equations 3 to 7 lines of codes per component in an initialization function. For example, the complete code of the void LinearLayer::forward() { attention network formulated by the equations 3 cublasXgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N, to 7 is presented in Figure 3. This piece of code dimOutput, num, dimInput, &one, w.data, w.ni, x.data, dimInput, fulfills the task of building a neural network as fol- &zero, y.data, dimOutput) lows, } void LinearLayer::backward() { cublasXgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_N, • The class of Variable stores numeric values dimInput, num, dimOutput, and gradients. Through passing the pointers &one, w.data, w.ni, y.grad.data, dimOutput, &beta, x.grad.data, dimInput)); of Variable around, component are connected } together. void LinearLayer::calculateGradient() { cublasXgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_T, dimInput, dimOutput, num, • The data member of layers collects all the &one, x.data,

Cytonmt: an Efficient Neural Machine Translation Open-Source Toolkit

Part Vi Neural Machine Translation, Seq2seq and Attention 2

Deep Transformer Models for Time Series Forecasting:The Influenza

Deep Learning

Autograph: Imperative-Style Coding with Graph-Based Performance

Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction

Convolutional Attention-Based Seq2seq Neural Network for End-To-End ASR

Arxiv:2104.08301V2 [Cs.CL] 7 Jul 2021

Distilling the Knowledge of BERT for Sequence-To-Sequence ASR

Towards Russian Text Generation Problem Using Openai's GPT-2

A Survey on Long Short-Term Memory Networks for Time Series Prediction

Time-Series Prediction Approaches to Forecasting Deformation in Sentinel-1 Insar Data

Seq2seq Model Deep Learning Lecture 9