Collaborators
Total Page:16
File Type:pdf, Size:1020Kb
LEARNING MULTI-SCALE TEMPORAL DYNAMICS WITH RECURRENT NEURAL NETWORKS GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH Workshop on Modelling and Inference for Dynamics on Networks NIPS 2015, Montreal, Canada Collaborators • Natalia Neverova and Christian Wolf (INSA-Lyon) • Griffin Lacey (University of Guelph) • Lex Fridman (MIT) • Brandon Barbello and Deepak Chandra (Google) 11 Dec 2015 / 2 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Deep Learning SuccessesDeepPose: Human Pose Estimation via Deep Neural Networks Alexander Toshev Christian Szegedy Google • Vision 1600 Amphitheatre Pkwy Mountain View, CA 94043 - Object recognition and detection toshev,[email protected] - Pose estimation and activity recognition mainly by the first challenge, the need to search in the large space of all possible articulated poses. Part-based models • Speech lend themselves naturally to model articulations ([16, 8]) and in the recent years a variety of models with efficient inference have been proposed ([6, 19]). - Speech recognition The above efficiency, however, is achieved at the cost of limited expressiveness – the use of local detectors, which Figure 1. Besides extreme variability in articulations, many of the reason in many cases about a single part, and most impor- - Speaker authentication joints are barely visible. We can guess the location of the right tantly by modeling only a small subset of all interactions arm in the left image only because we see the rest of the pose and between body parts. These limitations, as exemplified in anticipate the motion or activity of the person. Similarly, the left Fig. 1, have been recognized and methods reasoning about • Natural Language Processing body half of the person on the right is not visible at all. These pose in a holistic manner have been proposed [15, 21] but are examples of the need for holistic reasoning. We believe that with limited success in real-world problems. DNNs can naturally provide such type of reasoning. - Question answering In this work we ascribe to this holistic view of human pose estimation. We capitalize on recent developments of Abstract deep learning and propose a novel algorithm based on a - Machine translation Deep Neural Network (DNN). DNNs have shown outstand- We propose a method for human pose estimation based ing performance on visual classification tasks [14] and more on Deep Neural Networks (DNNs). The pose estimation recently on object localization [23, 9]. However, the ques- is formulated as a DNN-based regression problem towards tion of applying DNNs for precise localization of articulated body joints. We present a cascade of such DNN regres- objects has largely remained unanswered. In this paper we 26 Nov 2015 / 3 attempt to cast a light on this question and present a simple MLRG ・ G Taylor sors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a and yet powerful formulation of holistic human pose esti- holistic fashion and has a simple but yet powerful formula- mation as a DNN. tion which capitalizes on recent advances in Deep Learn- We formulate the pose estimation as a joint regression arXiv:1312.4659v3 [cs.CV] 20 Aug 2014 ing. We present a detailed empirical analysis with state-of- problem and show how to successfully cast it in DNN set- art or better performance on four academic benchmarks of tings. The location of each body joint is regressed to using diverse real-world images. as an input the full image and a 7-layered generic convolu- tional DNN. There are two advantages of this formulation. First, the DNN is capable of capturing the full context of 1. Introduction each body joint – each joint regressor uses the full image as a signal. Second, the approach is substantially simpler The problem of human pose estimation, defined as the to formulate than methods based on graphical models – no problem of localization of human joints, has enjoyed sub- need to explicitly design feature representations and detec- stantial attention in the computer vision community. In tors for parts; no need to explicitly design a model topology Fig. 1, one can see some of the challenges of this prob- and interactions between joints. Instead, we show that a lem – strong articulations, small and barely visible joints, generic convolutional DNN can be learned for this problem. occlusions and the need to capture the context. Further, we propose a cascade of DNN-based pose pre- The main stream of work in this field has been motivated dictors. Such a cascade allows for increased precision of 1 Two “Hammers” Convolutional Neural Nets Recurrent Neural Nets ht ⨉ ct ⨉ it gt ft ot ht 1 ⨉ − ⨉ ct 1 − xt 26 Nov 2015 / 4 MLRG ・ G Taylor Recurrent Neural Networks 11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks W xt xt 11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks Uht 1 − W xt xt 11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks ht Uht 1 (+) − W xt xt 11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks ot φ V ht ht Uht 1 (+) − W xt xt 11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks ot 1 ot ot+1 − φ φ φ V ht 1 V ht V ht+1 − ht 1 ht ht+1 − … (+) Uht 1 (+) Uht (+) … − W xt 1 W xt W xt+1 − xt 1 xt xt+1 − 11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor RNN (Update) ot 1 ot ot+1 − φ φ φ V ht 1 V ht V ht+1 − ht = (Uht 1 + W xt) ht 1 ht ht+1 − − Uht 1 Uht … − … ot = φ(V ht) W xt 1 W xt W xt+1 − xt 1 x xt+1 − t 11 Dec 2015 / 6 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor RNNs: The Promise • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series - and capture long-range dependencies! • Exact gradients can be computed exactly via Backpropagation Through Time 11 Dec 2015 / 7 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Generality Figure: Andrej Karpathy 11 Dec 2015 / 8 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor http://karpathy.github.io/2015/05/21/rnn-effectiveness/ RNNs: The Reality • A powerful and interesting architecture - but? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients 11 Dec 2015 / 9 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural • Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997) - Gated Recurrent Units (Chung et al. 2014) • “Structured” depth (Pascanu et al. 2014) • Rectifiers (Le, Jaitly and Hinton 2015) • Units operating at different timescales (Hihi and Bengio 1996, Koutnik et al. 2014) 11 Dec 2015 / 10 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural • Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997) - Gated Recurrent Units (Chung et al. 2014) • “Structured” depth (Pascanu et al. 2014) • Rectifiers (Le, Jaitly and Hinton 2015) • Units operating at different timescales (Hihi and Bengio 1996, Koutnik et al. 2014) 11 Dec 2015 / 10 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor sigmoid layer LSTM tanh layer element-wise tanh ht memory cell ⨉ ⨉ gating ct it = σ (Wi [ht 1,xt]+bi) ⨉ − it gt f o gt = tanh (Wg [ht 1,xt]+bg) ⨉ t t − ht 1 ft = σ (Wf [ht 1,xt]+bf ) − − ct = ftct 1 + itgt ⨉ − ct 1 − x ot = σ (Wo [ht 1,xt]+bo) t − ht = ot tanh (ct) Recommended reading: Chris Olah: Understanding LSTMs (blog) http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 11 Dec 2015 / 11 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural • Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997) - Gated Recurrent Units (Chung et al. 2014) • “Structured” depth (Pascanu et al. 2014) • Rectifiers (Le, Jaitly and Hinton 2015) • Units operating at different timescales (Hihi and Bengio 1996, Koutnick et al. 2014) 11 Dec 2015 / 12 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural • Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997) - Gated Recurrent Units (Chung et al. 2014) • “Structured” depth (Pascanu et al. 2014) • Rectifiers (Le, Jaitly and Hinton 2015) • Units operating at different timescales (Hihi and Bengio 1996, Koutnick et al. 2014) 11 Dec 2015 / 12 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Optimization • 2nd-order methods (Martens and Sutskever 2011) • Reservoirs (Jaeger and Haas 2004) • Gradient clipping/normalization (Pascanu et al. 2012, Mikolov 2012) • Careful initialization (Sutskever et al. 2013) 11 Dec 2015 / 13 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Clockwork RNN (CW-RNN) • LSTM is powerful but overly complex • Clockwork RNN (Koutnik et al. 2014) proposes a simpler architectural change - Partition hidden units into separate modules which operate at different “clocks” - Actually reduces number of parameters compared to Vanilla RNN 11 Dec 2015 / 14 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor CW-RNN ot ht k =2 k =1 k =0 xt 11 Dec 2015 / 15 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor CW-RNN (Update) ot ht k =2 k =1 k =0 xt k (U (k) ht 1 (k)+W (k) xt)if(t mod n )=0 ht(k)= − (ht (k) , otherwise 11 Dec 2015 / 16 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor CW-RNN (Example) For n =2,t=6 1 2 4 8 16 xt ht U ht 1 W − k (U (k) ht 1 (k)+W (k) xt)if(t mod n )=0 ht(k)= − (ht (k) , otherwise 11 Dec 2015 / 17 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Figure adapted from (Koutnick et al.