Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)
Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables
Nils Y. Hammerla1,2, Shane Halloran2, Thomas Plotz¨ 2 1babylon health, London, UK 2Open Lab, School of Computing Science, Newcastle University, UK [email protected]
Abstract these relatively simple means suffice to obtain impressive recognition accuracies. However, more elaborate behaviours Human activity recognition (HAR) in ubiquitous which are, for example, of interest in medical applications, computing is beginning to adopt deep learning to pose a significant challenge to this manually tuned approach substitute for well-established analysis techniques [Hammerla et al., 2015]. Furthermore, some work has sug- that rely on hand-crafted feature extraction and gested that the dominant technical approach in HAR benefits classification methods. However, from these iso- from biased evaluation settings [Hammerla and Plotz,¨ 2015], lated applications of custom deep architectures it is which may explain some of the apparent inertia in the field difficult to gain an overview of their suitability for toward the adoption of deep learning techniques. problems ranging from the recognition of manipu- Deep learning has the potential to have significant impact lative gestures to the segmentation and identifica- on HAR in ubicomp. It can substitute for manually designed tion of physical activities like running or ascending feature extraction procedures which lack the robust physio- stairs. In this paper we rigorously explore deep, logical basis that benefits other fields such as speech recog- convolutional, and recurrent approaches across nition. However, for the practitioner it is difficult to select three representative datasets that contain movement the most suitable deep learning method for their application. data captured with wearable sensors. We describe Work that promotes deep learning almost always provides how to train recurrent approaches in this setting, in- only the performance of the best system, and rarely includes troduce a novel regularisation approach, and illus- details on how its seemingly optimal parameters were dis- trate how they outperform the state-of-the-art on a covered. As only a single score is reported it remains unclear large benchmark dataset. We investigate the suit- how this peak performance compares with the average during ability of each model for HAR, across thousands parameter exploration, and how likely a practitioner would be of recognition experiments with randomly sampled to discover a system that performs similarly well. model configurations, explore the impact of hy- In this paper we provide the first unbiased and systematic perparameters using the fANOVA framework, and exploration of the performance of state-of-the-art deep learn- provide guidelines for the practitioner who wants ing approaches on three different recognition problems typi- to apply deep learning in their problem setting. cal for HAR in ubicomp. The training process for deep, con- volutional, and recurrent models is described in detail, and 1 Introduction we introduce a novel approach to regularisation for recur- Deep learning represents the biggest trend in machine learn- rent networks. In more than 4,000 experiments we investigate ing over the past decade. Since the inception of the umbrella the suitability of each model for different tasks in HAR, ex- term the diversity of methods it encompasses has increased plore the impact each model’s hyper-parameters have on per- rapidly, and will continue to do so driven by the resources of formance, and conclude guidelines for the practitioner who both academic and commercial interests. Deep learning has wants to apply deep learning to their application scenario. become accessible to everyone via machine learning frame- Over the course of these experiments we discover that re- works like Torch7 [Collobert et al., 2011], and has had sig- current networks outperform the state-of-the-art and that they nificant impact on a variety of application domains [LeCun et allow novel types of real-time application of HAR through al., 2015]. sample-by-sample prediction of physical activities. One field that has yet to benefit from deep learning is Hu- man Activity Recognition (HAR) in Ubiquitous Computing 2 Deep Learning in Ubiquitous Computing (ubicomp). The dominant technical approach in HAR in- Movement data collected with body-worn sensors are multi- cludes sliding window segmentation of time-series data cap- variate time-series data with relatively high spatial and tem- tured with body-worn sensors, manually designed feature ex- poral resolution (e.g. 20Hz - 200Hz). Analysis of this data traction procedures, and a wide variety of (supervised) clas- in ubicomp is typically follows a pipeline-based approach sification methods [Bulling et al., 2014]. In many cases, [Bulling et al., 2014]. The first step is to segment the time-
1533 Figure 1: Models used in this work. From left to right: (a) LSTM network hidden layers containing LSTM cells and a final softmax layer at the top. (b) bi-directional LSTM network with two parallel tracks in both future direction (green) and to the past (red). (c) Convolutional networks that contain layers of convolutions and max-pooling, followed by fully-connected layers and a softmax group. (d) Fully connected feed-forward network with hidden (ReLU) layers. series data into contiguous segments (or frames), either driven been explored in two settings. First, [Neverova et al., 2015] by some signal characteristics such as signal energy [Plotz¨ investigated a variety of recurrent approaches to identify in- et al., 2012], or through a sliding-window segmentation ap- dividuals based on movement data recorded from their mo- proach. From each of the frames a set of features is ex- bile phone in a large-scale dataset. Secondly, [Ordo´nez˜ and tracted, which most commonly include statistical features or Roggen, 2016] compared the performance of a recurrent ap- stem from the frequency domain. proach to CNNs on two HAR datasets, representing the cur- The first deep learning approach explored to substitute for rent state-of-the-art performance on Opportunity, one of the this manual feature selection corresponds to deep belief net- datasets utilised in this work. In both cases, the recurrent net- works as auto-encoders, trained generatively with Restricted work was paired with a convolutional network that encoded Boltzmann Machines (RBMs) [Plotz¨ et al., 2011]. Results the movement data, effectively employing the recurrent net- were mixed, as in most cases the deep model was outper- work to only model temporal dependencies on a more ab- formed by a combination of principal component analysis stract level. Recurrent networks have so far not been applied and a statistical feature extraction process. Subsequently to model movement data at the lowest possible level, which is a variety of projects have explored the use of pre-trained, the sequence of individual samples recorded by the sensor(s). fully-connected networks for automatic assessment in Parkin- son’s Disease [Hammerla et al., 2015], as emission model 3 Comparing deep learning for HAR in Hidden Markov Models (HMMs) [Zhang et al., 2015; While there has been some exploration of deep models for Alsheikh et al., 2015], and to model audio data for ubicomp a variety of application scenarios in HAR there is still a applications [Lane et al., 2015]. lack of a systematic exploration of deep learning capabili- The most popular approach so far in ubicomp relies on con- ties. Authors report to explore the parameter space in prelim- volutional networks. Their performance for a variety of ac- inary experiments, but usually omit the details. The overall tivity recognition tasks was explored by a number of authors process remains unclear and difficult to replicate. Instead, [Zeng et al., 2014; Ronao and Cho, 2015; Yang et al., 2015; single instantiations of e.g. CNNs are presented that show Ronaoo and Cho, 2015]. Furthermore, convolutional net- good performance in an application scenario. Solely report- works have been applied to specific problem domains, such as ing peak performance figures does, however, not reflect the the detection of stereotypical movements in Autism [Rad et overall suitability of a method for HAR in ubicomp, as it re- al., 2015], where they significantly improved upon the state- mains unclear how much effort went into tuning the proposed of-the-art. approach, and how much effort went into tuning other ap- Individual frames of movement data in ubicomp are usu- proaches it was compared to. How likely is a practitioner to ally treated as statistically independent, and applications of find a parameter configuration that works similarly well for sequential modelling techniques like HMMs are rare [Bulling their application? How representative is the reported perfor- et al., 2014]. However, approaches that are able to exploit mance across the models compared during parameter explo- the temporal dependencies in time-series data appear as the ration? Which parameters have the largest impact on perfor- natural choice for modelling human movement captured with mance? These questions are important for the practitioner, sensor data. Deep recurrent networks, most notably those that but so far remain unanswered in related work. rely on Long Short-Term Memory cells (LSTMs) [Hochreiter In this paper we provide the first unbiased comparison of and Schmidhuber, 1997], have recently achieved impressive the most popular deep learning approaches on three repre- performance across a variety of scenarios (e.g. [Gregor et al., sentative datasets for HAR in ubicomp. They include typ- 2015; Hermann et al., 2015]). Their application to HAR has ical application scenarios like manipulative gestures, repeti-
1534 tive physical activities, and a medical application of HAR in For regularisation we apply dropout after each max- Parkinson’s disease. We compare three types of models that pooling or fully-connected layer, where the dropout- i are described below. To explore the suitability of each method probability pdrop in layer i is fixed for all experiments we chose reasonable ranges for each of their hyperparameters 1 2 i>2 (pdrop =0.1, pdrop =0.25, pdrop =0.5). Similar to the and randomly sample model configurations. We report on DNN we also perform max-in norm regularisation as sug- their performance across thousands of experiments and anal- 1 gested in [Srivastava et al., 2014]. The input data fed into yse the impact of hyperparameters for each approach . the CNN corresponds to frames of movement data as in the DNN. However, instead of concatenating the different input 3.1 Deep feed-forward networks (DNN) s d dimensions the matrix-structure is retained (Ft R R ). We implemented deep feed-forward networks, which corre- The CNN is trained using stratified mini-batches2 (64 frames)⇥ spond to a neural network with up to five hidden layers fol- and stochastic gradient descent to minimise negative log like- lowed by a softmax-group. The DNN represents a sequence lihood. of non-linear transformations to the input data of the network. We follow convention and refer to a network with N hidden 3.3 Recurrent networks layers as N-layer network. Each hidden layer contains the In order to exploit the temporal dependencies within the same number of units, and corresponds to a linear transfor- movement data we implemented recurrent neural networks mation and a recitified-linear (ReLU) activation function. We based on LSTM cells in their vanilla variant which does not use two different regularisation techniques: i) Dropout: dur- contain peephole connections [Greff et al., 2015]. This archi- ing training each unit in each hidden layer is set to zero with tecture is recurrent as some of the connections within the net- a probability pdrop, and during inference the output of each work form a directed cycle, where the current timestep t con- unit is scaled by 1/pdrop [Srivastava et al., 2014] (dropout- siders the states of the network in the previous timestep t 1. rate is fixed to 0.5 for all experiments); ii) Max-in norm: Af- LSTM cells are designed to counter the effect of diminish- ter each mini-batch the incoming weights of each unit in the ing gradients if error derivatives are backpropagated through network are scaled to have a maximum euclidean length of many layers “through time” in recurrent networks [Hochre- din. To limit the number of hyperparameters of the approach iter et al., 2001]. Each LSTM cell (unit) keeps track of an we chose not to perform any generative pre-training and to internal state (the constant error carousel) that represents it’s solely rely on a supervised learning approach. The input “memory”. Over time the cells learn to output, overwrite, data fed into the network corresponds to frames of movement or reset their internal memory based on their current input data. Each frame consists of a varying number of s samples and the history of past internal states, leading to a system ca- from Rd, which are simply concatenated into a single vector pable of retaining information across hundreds of time-steps s d Ft R ⇤ . The model is illustrated in figure 1(d). [Hochreiter and Schmidhuber, 1997]. The2 DNN is trained in a mini-batch approach, where each We implement two flavours of LSTM recurrent networks: mini-batch contains 64 frames and is stratified with respect i) deep forward LSTMs contain multiple layers of recurrent to the class distribution in the training-set. We minimise the units and are connected “forward” in time (see figure 1(a)); negative log likelihood using stochastic gradient descent. and ii) bi-directional LSTMs which contain two parallel re- current layers that stretch both into the “future” and into the 3.2 Convolutional networks (CNN) “past” of the current time-step, followed by a layer that con- CNNs aim to introduce a degree of locality in the patterns catenates their internal states for timestep t (see figure 1(b)). matched in the input data and to enable translational invari- Practically these two flavours differ significantly in their ance with respect to the precise location (i.e. time of occur- application requirements. A forward LSTM contextualises rence) of each pattern within a frame of movement data. We the current time-step based on those it has seen previously, explore the performance of convolutional networks and fol- and is inherently suitable for real-time applications where, at low suggestions by [Srivastava et al., 2014] in architecture inference time, the “future” is not yet known. Bi-directional and regularisation techniques. The overall CNN architecture LSTMs on the other hand use both the future and past context is illustrated in figure 1(c). Each CNN contains at least one to interpret the input at timestep t, which makes them suitable temporal convolution layer, one pooling layer and at least one for offline analysis scenarios. fully connected layer prior to a top-level softmax-group. The In this work we apply recurrent networks in three different temporal convolution layer corresponds to a convolution of settings, each of which is trained to minimise the negative log likelihood using adagrad [Duchi et al., 2011] and subject to the input sequence with nf different kernels (feature maps) of max-in norm regularisation. width kw. Subsequent max-pooling is looking for the maxi- In the first case the input data fed into the network at any mum within a region of width mw and corresponds to a sub- sampling, introducing translational invariance to the system. given time t corresponds to the current frame of movement Width of the max-pooling was fixed to 2 throughout all exper- data, which stretches a certain length of time and whose di- iments. The output of each max-pooling layer is transformed mensions have been concatenated (as in the DNN above). We using a ReLU activation function. The subsequent fully con- denote this model as LSTM-F. The second application case nected part effectively corresponds to a DNN and follows the of forward LSTMs represents a real-time application, where same architecture outlined above. each sample of movement data is presented to the network in the sequence they were recorded, denoted LSTM-S. The fi- 1Code available at https://github.com/nhammerla/deepHAR nal scenario sees the application of bi-directional LSTMs to
1535 the same sample-by-sample prediction problem, denoted b- LSTM-S.
3.4 Training RNNs for HAR Common applications for RNNs include speech recognition and natural language processing. In these settings the context for an input (e.g. a word) is limited to it’s surrounding enti- ties (e.g. a sentence, paragraph). Training of RNNs usually treats these contextualised entities as a whole, for example by training an RNN on complete sentences. In HAR the context of an individual sample of movement data is not well defined, at least beyond immediate correla- Figure 2: Training procedure for RNNs. For batch-wise train- tions between neighbouring samples, and likely depends on ing on long sequences we randomly initialise k positions the type of movement and its wider behavioural context. This along the time-series data, where k equals the size of the is a problem well known in the field, and affects the choice of mini-batch (here k =4). For a mini-batch i we extract a window length for sliding window segmentation [Bulling et sequence of length L from each of the positions, and increase al., 2014]. the corresponding position by L. If we reach the end of the In order to construct b mini-batches that are used to train data we return to the start. After training on a mini-batch we the RNN we initialise a number of positions (pi)b between the retain the internal state of the RNN with probability pcarry. start and end of the training-set. To construct a mini-batch we extract the L samples that follow each position in (p ) , and i b 4.1 Datasets increase (pi)b by L steps, possibly wrapping around the end of the sequence. We found that it is important to initialise We select three datasets typical for HAR in ubicomp for the the positions randomly to avoid oscillations in the gradients. exploration in this work. Each dataset corresponds to a differ- While this approach retains the ordering of the samples pre- ent application of HAR. The first dataset, Opportunity, con- sented to the RNN it does not allow for stratification of each tains manipulative gestures like opening and closing doors, mini-batch w.r.t. class-distribution. Figure 2 illustrates the which are short in duration and non-repetitive. The second, approach. PAMAP2, contains prolonged and repetitive physical activi- Training on long sequences has a further issue that is ad- ties typical for systems aiming to characterise energy expen- dressed in this work. If we use the approach outlined above diture. The last, Daphnet Gait, corresponds to a medical ap- to train a sufficiently large RNN it may “memorise” the en- plication where participants exhibit a typical motor compli- tire input-output sequence implicitly, leading to poor gener- cation in Parkinson’s disease that is known to have a large alisation performance. In order to avoid this memorisation inter-subject variability. Below we detail each dataset: we need to introduce “breaks” where the internal states of the The Opportunity dataset (Opp) [Chavarriaga et al., RNN are reset to zero: after each mini-batch we decide to 2013] consists of annotated recordings from on-body sensors retain the internal state of the RNN with a carry-over prob- from 4 participants instructed to carry out common kitchen ability pcarry, and reset it to zero otherwise. This is a novel activities. Data is recorded at a frequency of 30Hz from 12 form of regularisation of RNNs, which should be useful for locations on the body, and annotated with 18 mid-level ges- similar applications of RNNs. ture annotations (e.g. Open Door / Close Door). For each subject, data from 5 different runs is recorded. We used the 4 Experiments subset of sensors that did not show any packet-loss, which included accelerometer recordings from the upper limbs, the The different hyper-parameters explored in this work are back, and complete IMU data from both feet. The resulting listed in table 1. The last column indicates the number of dataset had 79 dimensions. We use run 2 from subject 1 as parameter configurations sampled for each dataset, selected our validation set, and replicate the most popular recognition to represent an equal amount of computation time. We con- challenge by using runs 4 and 5 from subject 2 and 3 in our duct experiments on three benchmark datasets representative test set. The remaining data is used for training. For frame- of the problems tyical for HAR (described below). Experi- by-frame analysis, we created sliding windows of duration 1 ments were run on a machine with three GPUs (NVidia GTX second and 50% overlap. The resulting training-set contains 980 Ti), where two model configurations are run on each GPU approx. 650k samples (43k frames). except for the largest networks. The PAMAP2 dataset [Reiss and Stricker, 2012] consists After each epoch of training we evaluate the performance of recordings from 9 participants instructed to carry out 12 of the model on the validation set. Each model is trained for lifestyle activities, including household activities and a va- at least 30 epochs and for a maximum of 300 epochs. After riety of exercise activities (Nordic walking, playing soccer, 30 epochs, training stops if there is no increase in validation etc). Accelerometer, gyroscope, magnetometer, temperature performance for 10 subsequent epochs. We select the epoch and heart rate data are recorded from inertial measurement that showed the best validation-set performance and apply the units located on the hand, chest and ankle over 10 hours (in corresponding model to the test-set. total). The resulting dataset has 52 dimensions. We used runs
1536 sotie.fNV a enue rvosyt explore by to networks previously 2015 recurrent used in been performance hyper-parameters has network the fANOVA percentage of the obtained. variability which is overall from to hyper-parameters, func- contribution the interaction joint of and marginal tions into model func- decomposed non-linear a then hyper-parameters.This as is model’s performance the model of the tion of predictive forest) a builds (random It model performance. network’s a to contributes apply fANOVA we 2014 framework. on experiments hyper-parameter analysis all fANOVA each across the of observed impact performance the the estimate to order hyper-parameters In of Influence 4.2 approx. contains second training-set 1 The 470 of analysis, overlap. windows 50% our in sliding and In duration 2 created frame- subject we training. For from analysis, for 32Hz. to by-frame 2 rest data and the accelerometer the 1 used downsampled we runs and set, set, sub- validation test from The 1 our our run in trunk. used the We 9 on ject dimensions. and 9 knee has the dataset above resulting two- ankle, recorded a the was above represents data from This inform Accelerometer to problem. system. goal recognition the class prompting with situated incidents, future freezing a these detect objective individuals The to walking. affected is as such where movements PD, initiate com- to a in struggle is complication Freezing activities gait. motor with of out mon freezing affected induce carry to likely to participants are instructed which 10 (PD), from Disease Parkinson’s recordings of consists frames). stepping 2012 second overlap) one (78% windows with adjacent duration between seconds sliding 5.12 non-overlapping of with analy- windows work frame-by-frame previous For replicate we compa- dataset. resolution sis, Opportunity temporal for a the used have to 2 to is rable order and data in accelerometer 1 remaining 33.3Hz the runs The downsampled to data we and set. analysis, set our test validation In our our training. in in 6 5 subject subject for for 2 and 1 The k ] ] ] ape ( samples . approx. contains training-set The . eemnsteetn owihec hyper-parameter each which to extent the determines ahe atdtst(DG) dataset Gait Daphnet b-LSTM-S LSTM-S LSTM-F CNN DNN 30 al :Hprprmtr ftemdl n h agso ausepoe nexperiments. in explored values of ranges the and models the of Hyper-parameters 1: Table k o-nfr?xx------x x log-uniform? frames). Category max max max max max min min min min min 10 10 10 10 10 10 10 10 10 10