Recurrent Neural Network

Recurrent Neural Network Thanh Nguyen [email protected] Deep Learning series • What covered thus far • Neural net and back-prop algorithm – Ch.6 • Train a deep net – Ch.8 • Convolutional neural net (CNN) – Ch.9 • Today’s topic – Ch.10 • Recurrent neural net (RNN) for modelling sequence • Long Short Term Memory (LSTM) for long-term dependencies • What remained (and probably interesting) in the book • Factor models and auto-encoders – Ch.13-14 • Deep Generative Model with an appealing GAN – Ch.20 Machine Learning to sequence - Motivation • So far, we mainly deal with prediction/classification problems where we consider data as independent feature vectors • Though, the data could exhibit themselves as time-series/sequences • E.g. Email spam classification • Document/sentence/phrase representation though “one-hot” feature vector (e.g. bag of words) • In fact, English phrase/sentence is a sequence of words with order, syntactic and semantic dependencies • “Alice went to the beach. There she built a …” • “You shall know a word by the company it keeps” (J.R.Firth, 1957) Sequential data in nature and applications • Natural language – sentence is a sequence of words or characters • Language modelling (LM) for machine translation (e.g. Google Translate) • Question Answering (an application of Recursive net) • Image Captioning (combination with CNN) • Text to speech • Speech – sequence of sound waves • Speech recognition (e.g. Siri, Google Assistant...) • Programming language – statement is a sequence of APIs or tokens (e.g. for, while, if-else…) • Program synthesis: English description to Java code • Cursive handwriting – sequence of coordinates of the tip of pen • Video – sequence of frames over time Memoryless models • Autoregressive model • E.g. second-order AR to predict stock price, temperature… • Feed-forward neural network • Generalize autoregressive model with a hidden unit • E.g. Bengio’s neural probabilistic language model: Linear Dynamical Systems • e.g. Kalman Filter, PHD Filter for real-time tracking systems • Generative models with a real-valued hidden state • Hidden state has linear dynamics (i.e. motion equation) • Make prediction from the hidden state Hidden Markov Model (HMM) • Mid-1970s, model for speech recognition during • HMMs have a discrete state with transition between states • Outputs produced by a state are stochastic Recurrent network – architecture Recurrent net (cont.) • Powerful with three properties: • Hidden state allows the model to store information about the past. • Parameter sharing over time • Non-linear dynamics • RNN is a universal approximator that can compute any functions computable by computers (i.e. Turing machine)!!! • Limitation: • The power of RNN makes it very hard to train • Typical RNNs can have difficulty dealing with long-range dependencies RNN design patterns e.g. Video classification e.g. Sentiment Analysis Frame level Sentence sentiment (+/-) e.g. Image Captioning e.g. Machine Translation Image sequence of words English French RNN with output recurrent • Not hidden-to-hidden recurrent connections • Less powerful unless the output is very high-dimensional and rich Train RNN with output • Advantage: • Training can be parallelized because computation over all time steps are decoupled • Teacher forcing training technique: • At train time, feed correct output as input at next time step RNN for language modelling • Unfolded RNN is a directed acyclic computation graph • We can train it as a regular feed-forward neural net Train RNN with back-prop • Back-prop through time algorithm (BPTT) Truncated BPTT • Backward propagation is truncated in the middle, but forward computation is kept as normal Difficulties in training an RNN • Exploding or vanishing gradients: • Weights are small, gradients shrink exponentially • Weights are big, gradients grow exponentially • Typical feed-forward nets can cope with these effects because of its few hidden layers • An RNN trained on very long sequences (e.g. 100 time steps) easily suffers exploding or vanishing gradients • To train ~10 million parameters on CPU takes months • I used to train a RNN on CPU for Java-to-C# migration with 10^8 parameters within a month! • If you have a GPU (e.g. Titan Z) and a proper parallel scheme, it takes 20x faster (i.e. 1-2 days) for the same network Effective ways to train an RNN • Long Short Term Memory • Designed to remember values for long time • Hessian Free Optimization • Deal with vanishing gradient problem by using fancy • Refer to HF Optimizer (Martens and Sutskever, 2011) for details • Echo State Network • Good initialization with momentum Long Short Term Memory • Solve problem of getting an RNN to remember things longer (like hundreds of time steps) • Hochreiter and Schmidhuber (1997) designed a memory cell using logistic and linear units with multiplicative interactions Long Short Term Memory (cont.) • Architecture of LSTM Long Short Term Memory (cont.) A few facts about RNN • Canadian Press uses cursive handwriting recoginition • RNN model with LSTM by Alex Graves • Coordinate (x, y, p) of pen as sequence • Demo website • Google Translate: Sequence-to-sequence learning with RNN (LSTM) • Before 2016, it is used by statistical machine translation with Phrase-based technique, alignment model with IBM models (e.g. 1 through 5) • https://arxiv.org/pdf/1609.08144.pdf Reference • Ian Goodfellow et al. Deep Learning, Chapter 10 • Lecture notes and videos, Oxford, Hilary term 2017 https://github.com/oxford-cs-deepnlp-2017/lectures • CS224d: Deep Learning for NLP, Stanford http://cs224d.stanford.edu/syllabus.html • Neural Network for Machine Learning, Coursera. Lectures 7 and 8 • Blogs: • Andrej Kaparthy, The Unreasonable Effectiveness of Recurrent Neural Networks • Shakir Mohamed, A Statistical View of Deep Learning: RNN and Dynamic System Thank you!.

Recurrent Neural Network

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support