An Alternative Approach to Training Sequence-To-Sequence Model for Machine Translation

Colby College Digital Commons @ Colby Honors Theses Student Research 2017 An alternative approach to training Sequence-to-Sequence model for Machine Translation Vivek Sah Colby College Follow this and additional works at: https://digitalcommons.colby.edu/honorstheses Part of the Artificial Intelligence and Robotics Commons, and the Other Computer Sciences Commons Colby College theses are protected by copyright. They may be viewed or downloaded from this site for the purposes of research and scholarship. Reproduction or distribution for commercial purposes is prohibited without written permission of the author. Recommended Citation Sah, Vivek, "An alternative approach to training Sequence-to-Sequence model for Machine Translation" (2017). Honors Theses. Paper 949. https://digitalcommons.colby.edu/honorstheses/949 This Honors Thesis (Open Access) is brought to you for free and open access by the Student Research at Digital Commons @ Colby. It has been accepted for inclusion in Honors Theses by an authorized administrator of Digital Commons @ Colby. An alternative approach to training Sequence-to-Sequence model for Machine Translation By: Vivek Sah Advisor: Stephanie Taylor Honors Thesis Computer Science Department Colby College Waterville, ME May 9, 2017 Contents 1 Abstract 2 2 Introduction 2 2.1 Neural Networks . 2 2.2 Recurrent Neural Networks . 4 2.3 Machine Translation . 5 3 Sequence to Sequence models for English-French Translation 7 3.1 The Model . 8 3.2 Implementing Seq2Seq model . 9 3.2.1 Initialization . 9 3.2.2 Encoder . 10 3.2.3 Decoder . 10 3.2.4 Loss function . 10 3.2.5 Data preparation . 11 3.3 Performance . 11 3.3.1 Effect of optimizers . 12 3.3.2 Effect of Bidirectional Encoder . 13 3.3.3 Making it Deep . 14 4 An alternative approach 15 4.1 Performance . 16 5 Conclusion 17 6 Acknowledgments 17 7 References 18 1 1 Abstract Machine translation is a widely researched topic in the field of Natural Language Processing and most recently, neural network models have been shown to be very effective at this task. The model, called sequence-to-sequence model, learns to map an input sequence in one language to a vector of fixed dimensionality and then map that vector to an output sequence in another language without any human intervention provided that there is enough training data. Focusing on English-French translation, in this paper, I present a way to simplify the learning process by replacing English input sentences by word-by-word translation of those sentences. I found that this approach improves the performance of a sequence-to-sequence model which is 3-layer deep and has a bidirectional LSTM encoder by more than 30% on the same dataset. 2 Introduction Deep neural networks have been shown to very effective at wide variety of tasks including computer vision, speech recognition, image captioning and many others. The way they work is that provided enough training data, these models learn to approximate functions which map the input data to the output data. We will briefly discuss how neural networks work in general and then talk about recurrent neural networks to develop understanding of sequence to sequence models. 2.1 Neural Networks The building block of a neural network is called a neuron. Each neuron takes in inputs and gives an output. Let's look at how a sigmoid neuron works. Let's say x1; x2; x3 are inputs to this neuron. Figure 1: A simple neuron 2 Each input is associated with a weight value. To calculate the output for this neuron, we multiply the inputs by their wights and add them. We then P add a bias term. Let's call this quantity z, that is, z = xiwi + b. Next, we squash the z value using a sigmoid gate (called activation gate). Overall, we get: 1 output = 1 + e−z . A neural network consists of layers of such interconnected neurons. In general, it consist of an input layer, one or may hidden hidden layers and an output layer. A simplified diagram would look like: Figure 2: A neural network All the nodes in the input layer take in an input, calculate output and pass it as input to the nodes in the hidden layer. The hidden layer nodes do the same and pass their output as input to other hidden layer nodes or to one of the output nodes. The ouput of output layer nodes is the output of the whole neural network. Each of the interconnections between the nodes corresponds to a weight variable. To make the network learn, we calculate the loss between the predicted output and desired output using a loss function and minimize it using an optimization function such as gradient descent and backpropagation. The details of these algorithms are not discussed here. They basically find out how much each variable, such as weights, affects the loss and by how much they should be changed so as to drive down the loss. 3 Each training step involves calculating loss and updating weights and biases to decrease it. With enough data and training data, these models learn to map inputs to outputs. 2.2 Recurrent Neural Networks The simplified neural network that we just discussed has one major limitation. It has fixed size input and is thus suited to tasks where the input sequence varies. This is especially true when we are dealing with languages. Let's look at the task of predicting next word in a sentence given previous words in the sentence. For this purpose, a fixed input layer neural network would not work. To address this limitation, we use a variant of neural network called Recurrent Neural Network. It consists of single cell, which takes in a input, calculate its call state, and then gives an output. With each new input, the RNN cell updates its state which can be thought of as updating memory. However, in practice, we roll out the network depending on the length of input sequence and visualize it as: Figure 3: A recurrent Neural Network With each input, xt, the RNN cell updates its cell state st with the following rule: st = tanh(Uxt + W st−1) . 4 Here, sT −1 is the previous cell state and tanh squashes the value between -1 and 1. To calculate the output, we use the following rule: ot = softmax(V st) At time step t, these networks can also be thought of as regular neural network except that the weights connecting input nodes (U) to the hidden nodes(RNN cell) are same across all inputs. Similarly, the weights connecting hidden nodes, W, are also same and so are the weights connecting each hidden node to the output. With this analogy, we can see that the regular optimization and backpropagation algorithms can be applied to train such networks. However, this version of RNN does not do very well on longer sequences because of something called vanishing gradient problem which makes the network forget what it saw a certain time ago. The solution to this problem is to replace the vanilla RNN cell with something called LSTM (Long Short Term Memory) cell. The details of how it works and solves the problem of remembering long term dependencies can be found in this paper [3]. 2.3 Machine Translation Translating documents using machines is a very old problem. To do this, one could employ someone who knows both the source language and the target language. However, this is very slow in practice. So, scientists have been working for several years to come up with a automated machine translation system. Before looking at how they did it, let's think about how would one approach the problem of translation. • Word by word translation: This is the simplest way to approach this problem. One can just replace each word in the source sentence with the corresponding word in the target language. Figure 4: Example of word by word translation 5 • Use grammar rules of the source and target language: Trans- lating word by word ignores context and might not work well. Instead, one can try to incorporate grammar rules to divide the sentences into chunks of words, translate and reorder if necessary. For example: Figure 5: Example of using grammar rules In fact, the second approach was the one which scientists building the early translation systems used. They brought in linguists who were expert in the interested language. They would program each and every language rule to build the translation system. However, this approach was only effective for official documents which were written in standard grammar. For normal communication, this method was not very effective. • Use statistics: Since rule based system did not work for normal human communication, new models were developed which used statistics. They used parallel corpora to train the models. Parallel corpora contains the same text written in multiple languages.In fact, ancient Egyptian hieroglyphics were decoded using the Rosetta stone which had a parallel translation into Greek.Instead of generating one translation,the statistical model looks at multiple translation of the same group of words in the training data and scores them based on their probability of occurring in the training data. It then generates all the possible sentences. For example (in Figure 6), it could generate I want to go to more pretty the beach. It could also generate I want to go to the prettiest beach. It then scores each possible sentence based on how frequently the chunks in sentence occur and how normal they sound based on the training examples. The second 6 Figure 6: Example of using statistics sentence definitely sounds more English and contains chunks which are more likely to occur in an English sentence.

Load more