LEARNING MULTI-SCALE TEMPORAL DYNAMICS WITH RECURRENT NEURAL NETWORKS
GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH
Workshop on Modelling and Inference for Dynamics on Networks NIPS 2015, Montreal, Canada Collaborators
• Natalia Neverova and Christian Wolf (INSA-Lyon)
• Griffin Lacey (University of Guelph)
• Lex Fridman (MIT)
• Brandon Barbello and Deepak Chandra (Google)
11 Dec 2015 / 2 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Deep Learning SuccessesDeepPose: Human Pose Estimation via Deep Neural Networks
Alexander Toshev Christian Szegedy Google • Vision 1600 Amphitheatre Pkwy Mountain View, CA 94043 - Object recognition and detection toshev,[email protected]
- Pose estimation and activity recognition mainly by the first challenge, the need to search in the large space of all possible articulated poses. Part-based models • Speech lend themselves naturally to model articulations ([16, 8]) and in the recent years a variety of models with efficient inference have been proposed ([6, 19]). - Speech recognition The above efficiency, however, is achieved at the cost of limited expressiveness – the use of local detectors, which Figure 1. Besides extreme variability in articulations, many of the reason in many cases about a single part, and most impor- - Speaker authentication joints are barely visible. We can guess the location of the right tantly by modeling only a small subset of all interactions arm in the left image only because we see the rest of the pose and between body parts. These limitations, as exemplified in anticipate the motion or activity of the person. Similarly, the left Fig. 1, have been recognized and methods reasoning about • Natural Language Processing body half of the person on the right is not visible at all. These pose in a holistic manner have been proposed [15, 21] but are examples of the need for holistic reasoning. We believe that with limited success in real-world problems. DNNs can naturally provide such type of reasoning. - Question answering In this work we ascribe to this holistic view of human pose estimation. We capitalize on recent developments of Abstract deep learning and propose a novel algorithm based on a - Machine translation Deep Neural Network (DNN). DNNs have shown outstand- We propose a method for human pose estimation based ing performance on visual classification tasks [14] and more on Deep Neural Networks (DNNs). The pose estimation recently on object localization [23, 9]. However, the ques- is formulated as a DNN-based regression problem towards tion of applying DNNs for precise localization of articulated body joints. We present a cascade of such DNN regres- objects has largely remained unanswered. In this paper we 26 Nov 2015 / 3 attempt to cast a light on this question and present a simple MLRG ・ G Taylor sors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a and yet powerful formulation of holistic human pose esti- holistic fashion and has a simple but yet powerful formula- mation as a DNN. tion which capitalizes on recent advances in Deep Learn- We formulate the pose estimation as a joint regression
arXiv:1312.4659v3 [cs.CV] 20 Aug 2014 ing. We present a detailed empirical analysis with state-of- problem and show how to successfully cast it in DNN set- art or better performance on four academic benchmarks of tings. The location of each body joint is regressed to using diverse real-world images. as an input the full image and a 7-layered generic convolu- tional DNN. There are two advantages of this formulation. First, the DNN is capable of capturing the full context of 1. Introduction each body joint – each joint regressor uses the full image as a signal. Second, the approach is substantially simpler The problem of human pose estimation, defined as the to formulate than methods based on graphical models – no problem of localization of human joints, has enjoyed sub- need to explicitly design feature representations and detec- stantial attention in the computer vision community. In tors for parts; no need to explicitly design a model topology Fig. 1, one can see some of the challenges of this prob- and interactions between joints. Instead, we show that a lem – strong articulations, small and barely visible joints, generic convolutional DNN can be learned for this problem. occlusions and the need to capture the context. Further, we propose a cascade of DNN-based pose pre- The main stream of work in this field has been motivated dictors. Such a cascade allows for increased precision of
1 Two “Hammers”
Convolutional Neural Nets Recurrent Neural Nets
ht ⨉ ct ⨉ it gt ft ot
ht 1 ⨉ ⨉ ct 1 xt
26 Nov 2015 / 4 MLRG ・ G Taylor Recurrent Neural Networks
11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks
W xt
xt
11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks
Uht 1
W xt
xt
11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks
ht
Uht 1 (+)
W xt
xt
11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks ot