Recurrent Neural Network

Thanh Nguyen [email protected] series

• What covered thus far • Neural net and back-prop algorithm – Ch.6 • Train a deep net – Ch.8 • Convolutional neural net (CNN) – Ch.9 • Today’s topic – Ch.10 • Recurrent neural net (RNN) for modelling sequence • Long Short Term Memory (LSTM) for long-term dependencies • What remained (and probably interesting) in the book • Factor models and auto-encoders – Ch.13-14 • Deep Generative Model with an appealing GAN – Ch.20 to sequence - Motivation • So far, we mainly deal with prediction/classification problems where we consider data as independent feature vectors • Though, the data could exhibit themselves as time-series/sequences • E.g. Email spam classification • Document/sentence/phrase representation though “one-hot” feature vector (e.g. bag of words) • In fact, English phrase/sentence is a sequence of words with order, syntactic and semantic dependencies • “Alice went to the beach. There she built a …” • “You shall know a word by the company it keeps” (J.R.Firth, 1957) Sequential data in nature and applications

• Natural language – sentence is a sequence of words or characters • Language modelling (LM) for machine translation (e.g. Translate) • Question Answering (an application of Recursive net) • Image Captioning (combination with CNN) • Text to speech • Speech – sequence of sound waves • (e.g. Siri, Google Assistant...) • Programming language – statement is a sequence of APIs or tokens (e.g. for, while, if-else…) • Program synthesis: English description to Java code • Cursive handwriting – sequence of coordinates of the tip of pen • Video – sequence of frames over time Memoryless models

• Autoregressive model • E.g. second-order AR to predict stock price, temperature…

• Feed-forward neural network • Generalize autoregressive model with a hidden unit • E.g. Bengio’s neural probabilistic language model: Linear Dynamical Systems

• e.g. Kalman Filter, PHD Filter for real-time tracking systems • Generative models with a real-valued hidden state • Hidden state has linear dynamics (i.e. motion equation) • Make prediction from the hidden state Hidden Markov Model (HMM)

• Mid-1970s, model for speech recognition during • HMMs have a discrete state with transition between states • Outputs produced by a state are stochastic Recurrent network – architecture Recurrent net (cont.)

• Powerful with three properties: • Hidden state allows the model to store information about the past. • Parameter sharing over time • Non-linear dynamics • RNN is a universal approximator that can compute any functions computable by computers (i.e. Turing machine)!!! • Limitation: • The power of RNN makes it very hard to train • Typical RNNs can have difficulty dealing with long-range dependencies RNN design patterns

e.g. Video classification e.g. Sentiment Analysis Frame level Sentence  sentiment (+/-)

e.g. Image Captioning e.g. Machine Translation Image  sequence of words English  French RNN with output recurrent

• Not hidden-to-hidden recurrent connections • Less powerful unless the output is very high-dimensional and rich Train RNN with output

• Advantage: • Training can be parallelized because computation over all time steps are decoupled • Teacher forcing training technique: • At train time, feed correct output as input at next time step RNN for language modelling

• Unfolded RNN is a directed acyclic computation graph • We can train it as a regular feed-forward neural net Train RNN with back-prop

• Back-prop through time algorithm (BPTT) Truncated BPTT

• Backward propagation is truncated in the middle, but forward computation is kept as normal Difficulties in training an RNN

• Exploding or vanishing gradients: • Weights are small, gradients shrink exponentially • Weights are big, gradients grow exponentially • Typical feed-forward nets can cope with these effects because of its few hidden layers • An RNN trained on very long sequences (e.g. 100 time steps) easily suffers exploding or vanishing gradients • To train ~10 million parameters on CPU takes months • I used to train a RNN on CPU for Java-to-C# migration with 10^8 parameters within a month! • If you have a GPU (e.g. Titan Z) and a proper parallel scheme, it takes 20x faster (i.e. 1-2 days) for the same network Effective ways to train an RNN

• Long Short Term Memory • Designed to remember values for long time • Hessian Free Optimization • Deal with vanishing gradient problem by using fancy • Refer to HF Optimizer (Martens and Sutskever, 2011) for details • Echo State Network

• Good initialization with momentum Long Short Term Memory

• Solve problem of getting an RNN to remember things longer (like hundreds of time steps) • Hochreiter and Schmidhuber (1997) designed a memory cell using logistic and linear units with multiplicative interactions Long Short Term Memory (cont.)

• Architecture of LSTM Long Short Term Memory (cont.) A few facts about RNN

• Canadian Press uses cursive handwriting recoginition • RNN model with LSTM by Alex Graves • Coordinate (x, y, p) of pen as sequence • Demo website • Google Translate: Sequence-to-sequence learning with RNN (LSTM) • Before 2016, it is used by statistical machine translation with Phrase-based technique, alignment model with IBM models (e.g. 1 through 5) • https://arxiv.org/pdf/1609.08144.pdf Reference • Ian Goodfellow et al. Deep Learning, Chapter 10 • Lecture notes and videos, Oxford, Hilary term 2017 https://github.com/oxford-cs-deepnlp-2017/lectures • CS224d: Deep Learning for NLP, Stanford http://cs224d.stanford.edu/syllabus.html • Neural Network for Machine Learning, Coursera. Lectures 7 and 8 • Blogs: • Andrej Kaparthy, The Unreasonable Effectiveness of Recurrent Neural Networks • Shakir Mohamed, A Statistical View of Deep Learning: RNN and Dynamic System Thank you!