LEARNING MULTI-SCALE TEMPORAL DYNAMICS WITH RECURRENT NEURAL NETWORKS

GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH

Workshop on Modelling and Inference for Dynamics on Networks NIPS 2015, Montreal, Canada Collaborators

• Natalia Neverova and Christian Wolf (INSA-Lyon)

• Griffin Lacey (University of Guelph)

• Lex Fridman (MIT)

• Brandon Barbello and Deepak Chandra ()

11 Dec 2015 / 2 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Deep Learning SuccessesDeepPose: Human Pose Estimation via Deep Neural Networks

Alexander Toshev Christian Szegedy Google • Vision 1600 Amphitheatre Pkwy Mountain View, CA 94043 - Object recognition and detection toshev,[email protected]

- Pose estimation and activity recognition mainly by the first challenge, the need to search in the large space of all possible articulated poses. Part-based models • Speech lend themselves naturally to model articulations ([16, 8]) and in the recent years a variety of models with efficient inference have been proposed ([6, 19]). - Speech recognition The above efficiency, however, is achieved at the cost of limited expressiveness – the use of local detectors, which Figure 1. Besides extreme variability in articulations, many of the reason in many cases about a single part, and most impor- - Speaker authentication joints are barely visible. We can guess the location of the right tantly by modeling only a small subset of all interactions arm in the left image only because we see the rest of the pose and between body parts. These limitations, as exemplified in anticipate the motion or activity of the person. Similarly, the left Fig. 1, have been recognized and methods reasoning about • Natural Language Processing body half of the person on the right is not visible at all. These pose in a holistic manner have been proposed [15, 21] but are examples of the need for holistic reasoning. We believe that with limited success in real-world problems. DNNs can naturally provide such type of reasoning. - Question answering In this work we ascribe to this holistic view of human pose estimation. We capitalize on recent developments of Abstract deep learning and propose a novel algorithm based on a - Machine translation Deep Neural Network (DNN). DNNs have shown outstand- We propose a method for human pose estimation based ing performance on visual classification tasks [14] and more on Deep Neural Networks (DNNs). The pose estimation recently on object localization [23, 9]. However, the ques- is formulated as a DNN-based regression problem towards tion of applying DNNs for precise localization of articulated body joints. We present a cascade of such DNN regres- objects has largely remained unanswered. In this paper we 26 Nov 2015 / 3 attempt to cast a light on this question and present a simple MLRG ・ G Taylor sors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a and yet powerful formulation of holistic human pose esti- holistic fashion and has a simple but yet powerful formula- mation as a DNN. tion which capitalizes on recent advances in Deep Learn- We formulate the pose estimation as a joint regression

arXiv:1312.4659v3 [cs.CV] 20 Aug 2014 ing. We present a detailed empirical analysis with state-of- problem and show how to successfully cast it in DNN set- art or better performance on four academic benchmarks of tings. The location of each body joint is regressed to using diverse real-world images. as an input the full image and a 7-layered generic convolu- tional DNN. There are two advantages of this formulation. First, the DNN is capable of capturing the full context of 1. Introduction each body joint – each joint regressor uses the full image as a signal. Second, the approach is substantially simpler The problem of human pose estimation, defined as the to formulate than methods based on graphical models – no problem of localization of human joints, has enjoyed sub- need to explicitly design feature representations and detec- stantial attention in the computer vision community. In tors for parts; no need to explicitly design a model topology Fig. 1, one can see some of the challenges of this prob- and interactions between joints. Instead, we show that a lem – strong articulations, small and barely visible joints, generic convolutional DNN can be learned for this problem. occlusions and the need to capture the context. Further, we propose a cascade of DNN-based pose pre- The main stream of work in this field has been motivated dictors. Such a cascade allows for increased precision of

1 Two “Hammers”

Convolutional Neural Nets Recurrent Neural Nets

ht ⨉ ct ⨉ it gt ft ot

ht 1 ⨉ ⨉ ct 1 xt

26 Nov 2015 / 4 MLRG ・ G Taylor Recurrent Neural Networks

11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks

W xt

xt

11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks

Uht 1

W xt

xt

11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks

ht

Uht 1 (+)

W xt

xt

11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks ot

V ht

ht

Uht 1 (+)

W xt

xt

11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Recurrent Neural Networks ot 1 ot ot+1

V ht 1 V ht V ht+1

ht 1 ht ht+1

… (+) Uht 1 (+) Uht (+) …

W xt 1 W xt W xt+1

xt 1 xt xt+1

11 Dec 2015 / 5 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor RNN (Update)

ot 1 ot ot+1

V ht 1 V ht V ht+1

ht = (Uht 1 + W xt) ht 1 ht ht+1

Uht 1 Uht … … ot = (V ht)

W xt 1 W xt W xt+1

xt 1 x xt+1 t

11 Dec 2015 / 6 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor RNNs: The Promise

• Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series

- and capture long-range dependencies!

• Exact gradients can be computed exactly via Backpropagation Through Time

11 Dec 2015 / 7 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Generality

Figure: Andrej Karpathy 11 Dec 2015 / 8 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor http://karpathy.github.io/2015/05/21/rnn-effectiveness/ RNNs: The Reality

• A powerful and interesting architecture - but? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients

11 Dec 2015 / 9 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural

• Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997)

- Gated Recurrent Units (Chung et al. 2014)

• “Structured” depth (Pascanu et al. 2014)

• Rectifiers (Le, Jaitly and Hinton 2015)

• Units operating at different timescales (Hihi and Bengio 1996, Koutnik et al. 2014)

11 Dec 2015 / 10 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural

• Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997)

- Gated Recurrent Units (Chung et al. 2014)

• “Structured” depth (Pascanu et al. 2014)

• Rectifiers (Le, Jaitly and Hinton 2015)

• Units operating at different timescales (Hihi and Bengio 1996, Koutnik et al. 2014)

11 Dec 2015 / 10 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor sigmoid layer LSTM tanh layer element-wise tanh

ht memory cell ⨉ ⨉ gating

ct it = (Wi [ht 1,xt]+bi) ⨉ it gt f o gt = tanh (Wg [ht 1,xt]+bg) ⨉ t t ht 1 ft = (Wf [ht 1,xt]+bf ) ct = ftct 1 + itgt ⨉ ct 1 x ot = (Wo [ht 1,xt]+bo) t ht = ot tanh (ct)

Recommended reading: Chris Olah: Understanding LSTMs (blog) http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 11 Dec 2015 / 11 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural

• Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997)

- Gated Recurrent Units (Chung et al. 2014)

• “Structured” depth (Pascanu et al. 2014)

• Rectifiers (Le, Jaitly and Hinton 2015)

• Units operating at different timescales (Hihi and Bengio 1996, Koutnick et al. 2014)

11 Dec 2015 / 12 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Architectural

• Gating ⨉ - Long Short-term Memory (Hochreiter et al. 1997)

- Gated Recurrent Units (Chung et al. 2014)

• “Structured” depth (Pascanu et al. 2014)

• Rectifiers (Le, Jaitly and Hinton 2015)

• Units operating at different timescales (Hihi and Bengio 1996, Koutnick et al. 2014)

11 Dec 2015 / 12 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Solutions: Optimization

• 2nd-order methods (Martens and Sutskever 2011)

• Reservoirs (Jaeger and Haas 2004)

• Gradient clipping/normalization (Pascanu et al. 2012, Mikolov 2012)

• Careful initialization (Sutskever et al. 2013)

11 Dec 2015 / 13 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Clockwork RNN (CW-RNN)

• LSTM is powerful but overly complex

• Clockwork RNN (Koutnik et al. 2014) proposes a simpler architectural change

- Partition hidden units into separate modules which operate at different “clocks”

- Actually reduces number of parameters compared to Vanilla RNN

11 Dec 2015 / 14 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor CW-RNN

ot

ht k =2

k =1

k =0

xt

11 Dec 2015 / 15 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor CW-RNN (Update)

ot

ht k =2

k =1

k =0

xt k (U (k) ht 1 (k)+W (k) xt)if(t mod n )=0 ht(k)= (ht (k) , otherwise

11 Dec 2015 / 16 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor CW-RNN (Example) For n =2,t=6

1 2 4 8

16 xt

ht U ht 1 W

k (U (k) ht 1 (k)+W (k) xt)if(t mod n )=0 ht(k)= (ht (k) , otherwise

11 Dec 2015 / 17 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Figure adapted from (Koutnick et al. 2014) CW-RNN: Problems

• “Slow” (low-frequency) units:

- Inactivity over long periods of time

- As a result, undertrained

- Barely contribute to prediction

• Shif-variant

- Network responds differently to same input stimuli applied at different moments in time

11 Dec 2015 / 18 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor RNNs: Shif-invariance Original sequence Vanilla RNN CW-RNN

hidden activity (y-axis) vs. time(x-axis)

11 Dec 2015 / 19 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor HMOG dataset: Reading/walking (top) Writing/sitting (bottom) Dense CW-RNN

ot

ht k =2

k =1

k =0

xt

11 Dec 2015 / 20 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Dense CW-RNN (Example)

1 2 4 8

16 xt

U ht 1 W ht

1 2 4 8 16 xt

ht U H W

11 Dec 2015 / 21 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Again n =2,t=6 = (extract diagonal and vectorize) Dense CW-RNN (Example)

1 2 4 8

16 xt

U ht 1 W ht

1 2 4 8 16 xt

ht U H W

11 Dec 2015 / 21 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Again n =2,t=6 = (extract diagonal and vectorize) DCW-RNN: Shif-invariance

Original sequence Vanilla RNN CW-RNN DCW-RNN

hidden activity (y-axis) vs. time(x-axis)

11 Dec 2015 / 22 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Application

• Entering PIN codes is painful

• Automatically and continuously authenticate users based on their interaction with the device

• Lock phone when thef is e.g. : detected 13 sensors to exploit

11 Dec 2015 / 23 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Project Abacus (Google ATAP)

• 1500 volunteers, issued Nexus 5

• Several months of natural daily usage

• 27.6 TB of data

• Sensors include camera, Our contribution restricted to: touchscreen, GPS, bluetooth, • accelerometer (linear acceleration); WiFi, cell antenna, • gyroscope ( velocity) accelerometer, gyroscope

11 Dec 2015 / 24 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Constraint: Embedded Processing

• Cloud-based processing limited by privacy and latency

• Authentication must be performed on device

• Constrained by storage, memory, and processing power

• Adapting to a new user should be quick (limited data)

11 Dec 2015 / 25 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Possible Solutions

Discriminative

• Offline training of deep neural networks on a large amount of data, multiple classes (=devices/users) • Training a new binary classifier (user vs. rest) on the device Conclusion: resources too limited

11 Dec 2015 / 26 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Possible Solutions

Discriminative Generative

• Offline training of deep • Train offline a general data neural networks on a large distribution — Universal amount of data, multiple Background Model (UBM) classes (=devices/users) • Deploy to phones • Training a new binary • For each user, use a very classifier (user vs. rest) on small amount of samples the device to perform on-device Conclusion: resources adaptation of UBM to client too limited model

11 Dec 2015 / 26 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor GMM-based Biometrics

Feature extractor (RNN, Convnet, etc.) xt H = h { } { t}

M Train GMM on features: p(h ⇥UBM)= ⇡i (h; µi, ⌃i) | N i=1 X MAP-Adapt GMM for a given user: ⇥UBM ⇥client !

Scoring: ⇤(H) = log p(H ⇥client) log p(H ⇥UBM) 11 Dec 2015 / 27 | | NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor valida4on labels test predic4ons EM EM

τ valida4on test valida4on test client models client models client models client models MAP adapta4on

EM UBM-GMM UBM-GMM

development labels

feature feature extractor extractor

D V T V T valida4on labels test predic4ons EM EM

τ valida4on test valida4on test client models client models client models client models MAP adapta4on

EM UBM-GMM UBM-GMM

development labels labels

feature feature extractor extractor

D V T V T valida4on labels test predic4ons EM EM

τ valida4on test valida4on test client models client models client models client models MAP adapta4on

EM UBM-GMM UBM-GMM

development labels

feature feature extractor extractor

D V T V T valida4on labels test predic4ons EM EM

τ valida4on valida,on test valida4on test client models client models client models client models MAP adapta4on MAP adapta,on

EM UBM-GMM UBM-GMM

development labels

feature feature extractor extractor

D V T V T valida4on labels test predic4ons EM EM

τ valida4on test valida4on test client models client models client models client models MAP adapta4on

EM UBM-GMM UBM-GMM

development labels

feature feature extractor extractor

D V T V T valida4on labels test predic4ons test predic8ons EM EM

τ valida4on test valida4on test client models client models client models client models MAP adapta4on

EM UBM-GMM UBM-GMM

development labels

feature feature extractor extractor

D V T V T Dynamic Data Representations

Implicit modeling of Explicit modeling of dynamics dynamics (static convnet aggregating (various flavours of RNNs) temporal statistics)

distribution of the extracted dynamic features distribution of the extracted dynamic features

discriminative discriminative fully connected layer pretraining pretraining discriminative pretraining fully connected layer temporal feature feature units maps (3) maps (3) feature feature maps (2) maps (2)

feature feature maps (1) maps (1)

time time

synchronized stream of accelerometer and gyroscope data synchronized stream of accelerometer and gyroscope data (long sequences) (set of short sequences)

In either case, 1st layer is convolutional 11 Dec 2015 / 34 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Experimental Setup (1)

• 1500 users/devices total.

• Data recorded between unlock and lock of device (=“session”).

• Data resampled to 50Hz.

• 587 devices for discriminative feature learning.

• 150 devices as validation set for tuning.

• 150 devices as “clients” for testing.

11 Dec 2015 / 35 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Experimental Setup (2)

• RNNs trained on sequences of 20 blocks of 50 samples with 50% overlap = 10sec.

• Training with SGD on NLL loss, dropout on FC layers.

• Implemented in Theano.

• Computation on 8 Nvidia Tesla K80 GPUs.

• CWRNN & DCWRNN: 3 bands, exponential, base=2.

• PCA reduction of feature space to 100 (post representation learning).

• GMMs with 256 mixture components, EM with 100 iterations.

• All hyper-parameters optimized on validation set.

11 Dec 2015 / 36 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Results: Project Abacus Data

Feature Extraction Accuracy # Model (%) param ST Convnet 37.13 6,102,137

LT Convnet 56.46 6,102,137

Conv-RNN 64.57 1,960,295

Conv-CWRNN 68.83 1,964,254

Conv-LSTM 68.92 1,965,403

Conv-DCWRNN 69.41 1,964,254

Accuracy defined as percentage of cases where correct user was in 5% of classes with highest probability

11 Dec 2015 / 37 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Results: Project Abacus Data

Feature Extraction Biometric Task Accuracy # Model Model EER (%) HTER (%) (%) param Raw features 36.21 42.17 ST Convnet 37.13 6,102,137 ST Convnet 32.44 34.89 LT Convnet 56.46 6,102,137 LT Convnet 28.15 29.01 Conv-RNN 64.57 1,960,295 Conv-RNN 22.32 22.49

Conv-CWRNN 68.83 1,964,254 Conv-CWRNN 21.52 21.92 Conv-LSTM 21.13 21.41 Conv-LSTM 68.92 1,965,403 Conv-DCWRNN 20.01 20.52 Conv-DCWRNN 69.41 1,964,254 Conv-DCWRNN 18.17 19.29 Conv-DCWRNN† 15.84 16.13 Accuracy defined as percentage of cases where correct user was in 5% of classes with highest probability Conv-DCWRNN‡ 8.82 9.37

11 Dec 2015 / 37 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Per-device (†) and per-session (‡) EERs optimize the threshold for each device/session separately to indicate the upper bound of performance in the case of perfect store normalization. Conclusion

• DCW-RNNs combine the advantages of CWRNNs with shif-invariance and training efficiency

• The way smartphone users interact with their device IS highly correlated with their identity

• Future work will combine multiple modalities (inertial, touch etc.)

• Other applications of DCW-RNNs are currently under investigation (gesture recognition, activity recognition etc.) N. Neverova, C. Wolf, G. Lacey, L. Fridman, D. Chandra, B. Barbello and G.W. Taylor. Learning Human Identity from Motion Patterns. Arxiv:1511.03908

11 Dec 2015 / 38 NIPS Workshop on Dynamics ・ Learning Multi-scale Temporal Dynamics / G Taylor Thank You!

Montreal

Guelph Toronto

New York