Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables

Nils Y. Hammerla1,2, Shane Halloran2, Thomas Plotz¨ 2 1babylon health, London, UK 2Open Lab, School of Computing Science, Newcastle University, UK [email protected]

Abstract these relatively simple means suffice to obtain impressive recognition accuracies. However, more elaborate behaviours Human activity recognition (HAR) in ubiquitous which are, for example, of interest in medical applications, computing is beginning to adopt deep learning to pose a significant challenge to this manually tuned approach substitute for well-established analysis techniques [Hammerla et al., 2015]. Furthermore, some work has sug- that rely on hand-crafted feature extraction and gested that the dominant technical approach in HAR benefits classification methods. However, from these iso- from biased evaluation settings [Hammerla and Plotz,¨ 2015], lated applications of custom deep architectures it is which may explain some of the apparent inertia in the field difficult to gain an overview of their suitability for toward the adoption of deep learning techniques. problems ranging from the recognition of manipu- Deep learning has the potential to have significant impact lative gestures to the segmentation and identifica- on HAR in ubicomp. It can substitute for manually designed tion of physical activities like running or ascending feature extraction procedures which lack the robust physio- stairs. In this paper we rigorously explore deep, logical basis that benefits other fields such as speech recog- convolutional, and recurrent approaches across nition. However, for the practitioner it is difficult to select three representative datasets that contain movement the most suitable deep learning method for their application. data captured with wearable sensors. We describe Work that promotes deep learning almost always provides how to train recurrent approaches in this setting, in- only the performance of the best system, and rarely includes troduce a novel regularisation approach, and illus- details on how its seemingly optimal parameters were dis- trate how they outperform the state-of-the-art on a covered. As only a single score is reported it remains unclear large benchmark dataset. We investigate the suit- how this peak performance compares with the average during ability of each model for HAR, across thousands parameter exploration, and how likely a practitioner would be of recognition experiments with randomly sampled to discover a system that performs similarly well. model configurations, explore the impact of hy- In this paper we provide the first unbiased and systematic perparameters using the fANOVA framework, and exploration of the performance of state-of-the-art deep learn- provide guidelines for the practitioner who wants ing approaches on three different recognition problems typi- to apply deep learning in their problem setting. cal for HAR in ubicomp. The training process for deep, con- volutional, and recurrent models is described in detail, and 1 Introduction we introduce a novel approach to regularisation for recur- Deep learning represents the biggest trend in machine learn- rent networks. In more than 4,000 experiments we investigate ing over the past decade. Since the inception of the umbrella the suitability of each model for different tasks in HAR, ex- term the diversity of methods it encompasses has increased plore the impact each model’s hyper-parameters have on per- rapidly, and will continue to do so driven by the resources of formance, and conclude guidelines for the practitioner who both academic and commercial interests. Deep learning has wants to apply deep learning to their application scenario. become accessible to everyone via frame- Over the course of these experiments we discover that re- works like Torch7 [Collobert et al., 2011], and has had sig- current networks outperform the state-of-the-art and that they nificant impact on a variety of application domains [LeCun et allow novel types of real-time application of HAR through al., 2015]. sample-by-sample prediction of physical activities. One field that has yet to benefit from deep learning is Hu- man Activity Recognition (HAR) in 2 Deep Learning in Ubiquitous Computing (ubicomp). The dominant technical approach in HAR in- Movement data collected with body-worn sensors are multi- cludes sliding window segmentation of time-series data cap- variate time-series data with relatively high spatial and tem- tured with body-worn sensors, manually designed feature ex- poral resolution (e.g. 20Hz - 200Hz). Analysis of this data traction procedures, and a wide variety of (supervised) clas- in ubicomp is typically follows a pipeline-based approach sification methods [Bulling et al., 2014]. In many cases, [Bulling et al., 2014]. The first step is to segment the time-

1533 Figure 1: Models used in this work. From left to right: (a) LSTM network hidden layers containing LSTM cells and a final softmax layer at the top. (b) bi-directional LSTM network with two parallel tracks in both future direction (green) and to the past (red). (c) Convolutional networks that contain layers of convolutions and max-pooling, followed by fully-connected layers and a softmax group. (d) Fully connected feed-forward network with hidden (ReLU) layers. series data into contiguous segments (or frames), either driven been explored in two settings. First, [Neverova et al., 2015] by some signal characteristics such as signal energy [Plotz¨ investigated a variety of recurrent approaches to identify in- et al., 2012], or through a sliding-window segmentation ap- dividuals based on movement data recorded from their mo- proach. From each of the frames a set of features is ex- bile phone in a large-scale dataset. Secondly, [Ordo´nez˜ and tracted, which most commonly include statistical features or Roggen, 2016] compared the performance of a recurrent ap- stem from the frequency domain. proach to CNNs on two HAR datasets, representing the cur- The first deep learning approach explored to substitute for rent state-of-the-art performance on Opportunity, one of the this manual feature selection corresponds to deep belief net- datasets utilised in this work. In both cases, the recurrent net- works as auto-encoders, trained generatively with Restricted work was paired with a convolutional network that encoded Boltzmann Machines (RBMs) [Plotz¨ et al., 2011]. Results the movement data, effectively employing the recurrent net- were mixed, as in most cases the deep model was outper- work to only model temporal dependencies on a more ab- formed by a combination of principal component analysis stract level. Recurrent networks have so far not been applied and a statistical feature extraction process. Subsequently to model movement data at the lowest possible level, which is a variety of projects have explored the use of pre-trained, the sequence of individual samples recorded by the sensor(s). fully-connected networks for automatic assessment in Parkin- son’s Disease [Hammerla et al., 2015], as emission model 3 Comparing deep learning for HAR in Hidden Markov Models (HMMs) [Zhang et al., 2015; While there has been some exploration of deep models for Alsheikh et al., 2015], and to model audio data for ubicomp a variety of application scenarios in HAR there is still a applications [Lane et al., 2015]. lack of a systematic exploration of deep learning capabili- The most popular approach so far in ubicomp relies on con- ties. Authors report to explore the parameter space in prelim- volutional networks. Their performance for a variety of ac- inary experiments, but usually omit the details. The overall tivity recognition tasks was explored by a number of authors process remains unclear and difficult to replicate. Instead, [Zeng et al., 2014; Ronao and Cho, 2015; Yang et al., 2015; single instantiations of e.g. CNNs are presented that show Ronaoo and Cho, 2015]. Furthermore, convolutional net- good performance in an application scenario. Solely report- works have been applied to specific problem domains, such as ing peak performance figures does, however, not reflect the the detection of stereotypical movements in Autism [Rad et overall suitability of a method for HAR in ubicomp, as it re- al., 2015], where they significantly improved upon the state- mains unclear how much effort went into tuning the proposed of-the-art. approach, and how much effort went into tuning other ap- Individual frames of movement data in ubicomp are usu- proaches it was compared to. How likely is a practitioner to ally treated as statistically independent, and applications of find a parameter configuration that works similarly well for sequential modelling techniques like HMMs are rare [Bulling their application? How representative is the reported perfor- et al., 2014]. However, approaches that are able to exploit mance across the models compared during parameter explo- the temporal dependencies in time-series data appear as the ration? Which parameters have the largest impact on perfor- natural choice for modelling human movement captured with mance? These questions are important for the practitioner, sensor data. Deep recurrent networks, most notably those that but so far remain unanswered in related work. rely on Long Short-Term Memory cells (LSTMs) [Hochreiter In this paper we provide the first unbiased comparison of and Schmidhuber, 1997], have recently achieved impressive the most popular deep learning approaches on three repre- performance across a variety of scenarios (e.g. [Gregor et al., sentative datasets for HAR in ubicomp. They include typ- 2015; Hermann et al., 2015]). Their application to HAR has ical application scenarios like manipulative gestures, repeti-

1534 tive physical activities, and a medical application of HAR in For regularisation we apply dropout after each max- Parkinson’s disease. We compare three types of models that pooling or fully-connected layer, where the dropout- i are described below. To explore the suitability of each method probability pdrop in layer i is fixed for all experiments we chose reasonable ranges for each of their hyperparameters 1 2 i>2 (pdrop =0.1, pdrop =0.25, pdrop =0.5). Similar to the and randomly sample model configurations. We report on DNN we also perform max-in norm regularisation as sug- their performance across thousands of experiments and anal- 1 gested in [Srivastava et al., 2014]. The input data fed into yse the impact of hyperparameters for each approach . the CNN corresponds to frames of movement data as in the DNN. However, instead of concatenating the different input 3.1 Deep feed-forward networks (DNN) s d dimensions the matrix-structure is retained (Ft R R ). We implemented deep feed-forward networks, which corre- The CNN is trained using stratified mini-batches2 (64 frames)⇥ spond to a neural network with up to five hidden layers fol- and stochastic gradient descent to minimise negative log like- lowed by a softmax-group. The DNN represents a sequence lihood. of non-linear transformations to the input data of the network. We follow convention and refer to a network with N hidden 3.3 Recurrent networks layers as N-layer network. Each hidden layer contains the In order to exploit the temporal dependencies within the same number of units, and corresponds to a linear transfor- movement data we implemented recurrent neural networks mation and a recitified-linear (ReLU) activation function. We based on LSTM cells in their vanilla variant which does not use two different regularisation techniques: i) Dropout: dur- contain peephole connections [Greff et al., 2015]. This archi- ing training each unit in each hidden layer is set to zero with tecture is recurrent as some of the connections within the net- a probability pdrop, and during inference the output of each work form a directed cycle, where the current timestep t con- unit is scaled by 1/pdrop [Srivastava et al., 2014] (dropout- siders the states of the network in the previous timestep t 1. rate is fixed to 0.5 for all experiments); ii) Max-in norm: Af- LSTM cells are designed to counter the effect of diminish- ter each mini-batch the incoming weights of each unit in the ing gradients if error derivatives are backpropagated through network are scaled to have a maximum euclidean length of many layers “through time” in recurrent networks [Hochre- din. To limit the number of hyperparameters of the approach iter et al., 2001]. Each LSTM cell (unit) keeps track of an we chose not to perform any generative pre-training and to internal state (the constant error carousel) that represents it’s solely rely on a supervised learning approach. The input “memory”. Over time the cells learn to output, overwrite, data fed into the network corresponds to frames of movement or reset their internal memory based on their current input data. Each frame consists of a varying number of s samples and the history of past internal states, leading to a system ca- from Rd, which are simply concatenated into a single vector pable of retaining information across hundreds of time-steps s d Ft R ⇤ . The model is illustrated in figure 1(d). [Hochreiter and Schmidhuber, 1997]. The2 DNN is trained in a mini-batch approach, where each We implement two flavours of LSTM recurrent networks: mini-batch contains 64 frames and is stratified with respect i) deep forward LSTMs contain multiple layers of recurrent to the class distribution in the training-set. We minimise the units and are connected “forward” in time (see figure 1(a)); negative log likelihood using stochastic gradient descent. and ii) bi-directional LSTMs which contain two parallel re- current layers that stretch both into the “future” and into the 3.2 Convolutional networks (CNN) “past” of the current time-step, followed by a layer that con- CNNs aim to introduce a degree of locality in the patterns catenates their internal states for timestep t (see figure 1(b)). matched in the input data and to enable translational invari- Practically these two flavours differ significantly in their ance with respect to the precise location (i.e. time of occur- application requirements. A forward LSTM contextualises rence) of each pattern within a frame of movement data. We the current time-step based on those it has seen previously, explore the performance of convolutional networks and fol- and is inherently suitable for real-time applications where, at low suggestions by [Srivastava et al., 2014] in architecture inference time, the “future” is not yet known. Bi-directional and regularisation techniques. The overall CNN architecture LSTMs on the other hand use both the future and past context is illustrated in figure 1(c). Each CNN contains at least one to interpret the input at timestep t, which makes them suitable temporal convolution layer, one pooling layer and at least one for offline analysis scenarios. fully connected layer prior to a top-level softmax-group. The In this work we apply recurrent networks in three different temporal convolution layer corresponds to a convolution of settings, each of which is trained to minimise the negative log likelihood using adagrad [Duchi et al., 2011] and subject to the input sequence with nf different kernels (feature maps) of max-in norm regularisation. width kw. Subsequent max-pooling is looking for the maxi- In the first case the input data fed into the network at any mum within a region of width mw and corresponds to a sub- sampling, introducing translational invariance to the system. given time t corresponds to the current frame of movement Width of the max-pooling was fixed to 2 throughout all exper- data, which stretches a certain length of time and whose di- iments. The output of each max-pooling layer is transformed mensions have been concatenated (as in the DNN above). We using a ReLU activation function. The subsequent fully con- denote this model as LSTM-F. The second application case nected part effectively corresponds to a DNN and follows the of forward LSTMs represents a real-time application, where same architecture outlined above. each sample of movement data is presented to the network in the sequence they were recorded, denoted LSTM-S. The fi- 1Code available at https://github.com/nhammerla/deepHAR nal scenario sees the application of bi-directional LSTMs to

1535 the same sample-by-sample prediction problem, denoted b- LSTM-S.

3.4 Training RNNs for HAR Common applications for RNNs include speech recognition and natural language processing. In these settings the context for an input (e.g. a word) is limited to it’s surrounding enti- ties (e.g. a sentence, paragraph). Training of RNNs usually treats these contextualised entities as a whole, for example by training an RNN on complete sentences. In HAR the context of an individual sample of movement data is not well defined, at least beyond immediate correla- Figure 2: Training procedure for RNNs. For batch-wise train- tions between neighbouring samples, and likely depends on ing on long sequences we randomly initialise k positions the type of movement and its wider behavioural context. This along the time-series data, where k equals the size of the is a problem well known in the field, and affects the choice of mini-batch (here k =4). For a mini-batch i we extract a window length for sliding window segmentation [Bulling et sequence of length L from each of the positions, and increase al., 2014]. the corresponding position by L. If we reach the end of the In order to construct b mini-batches that are used to train data we return to the start. After training on a mini-batch we the RNN we initialise a number of positions (pi)b between the retain the internal state of the RNN with probability pcarry. start and end of the training-set. To construct a mini-batch we extract the L samples that follow each position in (p ) , and i b 4.1 Datasets increase (pi)b by L steps, possibly wrapping around the end of the sequence. We found that it is important to initialise We select three datasets typical for HAR in ubicomp for the the positions randomly to avoid oscillations in the gradients. exploration in this work. Each dataset corresponds to a differ- While this approach retains the ordering of the samples pre- ent application of HAR. The first dataset, Opportunity, con- sented to the RNN it does not allow for stratification of each tains manipulative gestures like opening and closing doors, mini-batch w.r.t. class-distribution. Figure 2 illustrates the which are short in duration and non-repetitive. The second, approach. PAMAP2, contains prolonged and repetitive physical activi- Training on long sequences has a further issue that is ad- ties typical for systems aiming to characterise energy expen- dressed in this work. If we use the approach outlined above diture. The last, Daphnet Gait, corresponds to a medical ap- to train a sufficiently large RNN it may “memorise” the en- plication where participants exhibit a typical motor compli- tire input-output sequence implicitly, leading to poor gener- cation in Parkinson’s disease that is known to have a large alisation performance. In order to avoid this memorisation inter-subject variability. Below we detail each dataset: we need to introduce “breaks” where the internal states of the The Opportunity dataset (Opp) [Chavarriaga et al., RNN are reset to zero: after each mini-batch we decide to 2013] consists of annotated recordings from on-body sensors retain the internal state of the RNN with a carry-over prob- from 4 participants instructed to carry out common kitchen ability pcarry, and reset it to zero otherwise. This is a novel activities. Data is recorded at a frequency of 30Hz from 12 form of regularisation of RNNs, which should be useful for locations on the body, and annotated with 18 mid-level ges- similar applications of RNNs. ture annotations (e.g. Open Door / Close Door). For each subject, data from 5 different runs is recorded. We used the 4 Experiments subset of sensors that did not show any packet-loss, which included accelerometer recordings from the upper limbs, the The different hyper-parameters explored in this work are back, and complete IMU data from both feet. The resulting listed in table 1. The last column indicates the number of dataset had 79 dimensions. We use run 2 from subject 1 as parameter configurations sampled for each dataset, selected our validation set, and replicate the most popular recognition to represent an equal amount of computation time. We con- challenge by using runs 4 and 5 from subject 2 and 3 in our duct experiments on three benchmark datasets representative test set. The remaining data is used for training. For frame- of the problems tyical for HAR (described below). Experi- by-frame analysis, we created sliding windows of duration 1 ments were run on a machine with three GPUs (NVidia GTX second and 50% overlap. The resulting training-set contains 980 Ti), where two model configurations are run on each GPU approx. 650k samples (43k frames). except for the largest networks. The PAMAP2 dataset [Reiss and Stricker, 2012] consists After each epoch of training we evaluate the performance of recordings from 9 participants instructed to carry out 12 of the model on the validation set. Each model is trained for lifestyle activities, including household activities and a va- at least 30 epochs and for a maximum of 300 epochs. After riety of exercise activities (Nordic walking, playing soccer, 30 epochs, training stops if there is no increase in validation etc). Accelerometer, gyroscope, magnetometer, temperature performance for 10 subsequent epochs. We select the epoch and heart rate data are recorded from inertial measurement that showed the best validation-set performance and apply the units located on the hand, chest and ankle over 10 hours (in corresponding model to the test-set. total). The resulting dataset has 52 dimensions. We used runs

1536 sotie.fNV a enue rvosyt explore by to networks previously 2015 recurrent used in been performance hyper-parameters has network the fANOVA percentage of the obtained. variability which is overall from to hyper-parameters, func- contribution the interaction joint of and marginal tions into model func- decomposed non-linear a then hyper-parameters.This as is model’s performance the model of the tion of predictive forest) a builds (random It model performance. network’s a to contributes apply fANOVA we 2014 framework. on experiments hyper-parameter analysis all fANOVA each across the of observed impact performance the the estimate to order hyper-parameters In of Influence 4.2 approx. contains second training-set 1 The 470 of analysis, overlap. windows 50% our in sliding and In duration 2 created frame- subject we training. For from analysis, for 32Hz. to by-frame 2 rest data and the accelerometer the 1 used downsampled we runs and set, set, sub- validation test from The 1 our our run in trunk. used the We 9 on ject dimensions. and 9 knee has the dataset above resulting two- ankle, recorded a the was above represents data from This inform Accelerometer to problem. system. goal recognition the class prompting with situated incidents, future freezing a these detect objective individuals The to walking. affected is as such where movements PD, initiate com- to a in struggle is complication Freezing activities gait. motor with of out mon freezing affected induce carry to likely to participants are instructed which 10 (PD), from Disease Parkinson’s recordings of consists frames). stepping 2012 second overlap) one (78% windows with adjacent duration between seconds sliding 5.12 non-overlapping of with analy- windows work frame-by-frame previous For replicate we compa- dataset. resolution sis, Opportunity temporal for a the used have to 2 to is rable order and data in accelerometer 1 remaining 33.3Hz the runs The downsampled to data we and set. analysis, set our test validation In our our training. in in 6 5 subject subject for for 2 and 1 The k ] ] ] ape ( samples . approx. contains training-set The . eemnsteetn owihec hyper-parameter each which to extent the determines ahe atdtst(DG) dataset Gait Daphnet b-LSTM-S LSTM-S LSTM-F CNN DNN 30 al :Hprprmtr ftemdl n h agso ausepoe nexperiments. in explored values of ranges the and models the of Hyper-parameters 1: Table k o-nfr?xx------x x log-uniform? frames). Category max max max max max min min min min min 10 10 10 10 10 10 10 10 10 10

LR 3 1 3 1 3 1 4 1 4 1 Learning [ Bachlin 10 10 10 10 2-050016 ------64 - - - 384 - - - 1 64 1 0.0 - 384 64 1 0.5 1.0 384 3 0.0 1 4.0 - 0.5 1.0 3 0.0 - 4.0 0.5 1.0 - 32 4.0 196 - - 32 - - 196 - 8 - 64 - - - 473 [ LR decay es n Stricker, and Reiss 5 3 5 3 k [ ape ( samples . . 413331 616 - 16 - - 128 16 128 - - 3 128 3 - - 3 3 - 5 - 1 - 9 - 64 - 3 - 2048 1 64 - 3 - 2048 1 - 5 0.5 - 4.0 0.0 - 0.5 0.99 4.0 - 0.0 - 0.99 - - Hutter tal. et [

Greff L 2009 , Regularisation momentum tal. et tal. et 14 k ] , , max-in norm 1537

pcarry oe.Bsdo h aiblt bevdfrec hyper- each for observed variability the on Based and model. overfitting; avoid to model iii) the of capabilities modelling cate- three ii) i) of process; learning one 1): grouped into table We (see model performance. gories each for of crucial hyper-parameters most the the is model the ewe h etpromn oe n h einmdlfor model median setting. the ap- difference each and the model deep performing lists best other it the between of Furthermore scores Opportunity. best on baseline the proaches and lists hyper- systems 2 work. of Table performing fANOVA. group this using each in of estimated utilised importance parameters models objec- the the illustrates the (d) all for Graph across f1-score 3. (DG), the class of figure tive or mean in PAMAP2), the of (OPP, illustrated distribution cumulative f1-score the are show (c) experiments to (a) Graphs all across Results Results 5 samples. of number total where the estimate we state-of-the-art to the order f1-score: to weighted In results as our Opportunity). f1-score (for compare weighted metric the performance used primary previously has work Related recall: corre- and which precision of f1-score, mean harmonic mean class the the to the sponds estimate of re- We independent we biased are highly distribution. that are metrics work performance this in quire utilised datasets the As metrics Performance 4.3 interactions order higher attributed to be categories. and between can category, that parameter variability each the to estimate we parameter o h rciinri sipratt nwwihapc of aspect which know to important is it practitioner the For

architecture #layers N c

stenme fsmlsi class in samples of number the is #units F w F

aaeesta fettesrcueo the of structure the affect that parameters , #conv.-layers m =2 = Architecture kW 1 regularisation X | c 2 c learning

| kW 2 X N N c total kW 3 c prec prec prec prec aaeesta oto the control that parameters , nF 1 c c aaeesta ii the limit that parameters , c c + ⇥ + ⇥

recall recall nF 2 recall recall c c c and , nF 3 c c , 1000 128 128 128 256 N #experiments total sthe is (2) (1) Figure 3: (a)-(c): Cumulative distribution of recognition performance for each dataset. (d): results from fANOVA analysis, illustrating impact of hyperparameter-categories on recognition performance (see table 1).

PAMAP2 DG OPP performance of all approaches of up to 35.7% on OPP. Both Performance Fm F1 Fm Fw DNN 0.904 0.633 0.575 0.888 forward RNNs (LSTM-F, LSTM-S) show similar behaviour CNN 0.937 0.684 0.591 0.894 across the different datasets. Practically all of their configura- LSTM-F 0.929 0.673 0.672 0.908 LSTM-S 0.882 0.760 0.698 0.912 tions explored on PAMAP2 and OPP have non-trivial recog- b-LSTM-S 0.868 0.741 0.745 0.927 nition performance. CNN [Yang et al., 2015] 0.851 The effect of each category of hyperparameter on the CNN [Ordo´nez˜ and Roggen, 2016] 0.535 0.883 DeepConvLSTM [Ordo´nez˜ and Roggen, 2016] 0.704 0.917 recognition performance is illustrated in figure 3(d). Interest- Delta from median Fm F1 Fm mean ingly, we observe the most consistent effect of the parameters DNN 0.129 0.149 0.357 0.221 CNN 0.071 0.122 0.120 0.104 in the CNN. Contrary to our expectation it is the parameters LSTM-F 0.10 0.281 0.085 0.156 surrounding the learning process (see table 1) that have the LSTM-S 0.128 0.297 0.079 0.168 largest main effect on performance. We expected that for this b-LSTM-S 0.087 0.221 0.205 0.172 model the rich choice of architectural variants should have a Table 2: Best results obtained for each model and dataset, larger effect. along with some baselines for comparison. Delta from me- For DNNs we do not observe a systematic effect of any dian (lower part of table) refers to the absolute difference be- category of hyperparameter. On PAMAP2, the correct learn- tween peak and median performance across all experiments. ing parameters appear to the be the most crucial. On OPP it is the architecture of the model. Interestingly we observed that relatively shallow networks outperform deeper variants. Overall we observe a large spread of peak performances There is a drop in performance for networks with more than between models on OPP and DG, with more than 15% mean 3 hidden layers. This may be related to our choice to solely f1-score between the best performing approach (b-LSTM-S) rely on supervised training, where a generative pre-training and the worst (DNN) on OPP (12% on DG) (see table 2). On may improve the performance of deeper networks. PAMAP2 this difference is smaller, but still considerable at The performance of the frame-based RNN (LSTM-F) on 7%. The best performing approach on OPP (b-LSTM-S) out- OPP depends critically on the carry-over probability intro- performs the current state-of-the-art by a considerable mar- duced in this work. Both always retaining the internal state gin of 4% mean f1-score (1% weighted f1-score). The best and always forgetting the internal state lead to the low per- CNN discovered in this work further outperforms previous re- formance. We found that pcarry of 0.5 works well for most sults reported in the literature for this type of model by more settings. Our findings merit further investigation, for exam- than 5% mean f1-score and weighted f1-score (see table 2). ple into a carry-over schedule, which may further improve The good performance of recurrent approaches, which model LSTM performance. movement at the sample level, holds the potential for novel Results for sample-based forward LSTMs (LSTM-S) (real-time) applications in HAR, as they alleviate the need mostly confirm earlier findings for this type of model that for segmentation of the time-series data, which is discussed found learning-rate to be the most crucial parameter [Greff below. et al., 2015]. However, for bi-directional LSTMs (b-LSTM- The distributions of performance scores differ between the S) we observe that the number of units in each layer has a models investigated in this work. CNNs show the most char- suprisingly large effect on performance, which should moti- acteristic behaviour: a fraction of model configurations do vate practitioners to first focus on tuning this parameter. not work at all (e.g. 20% on PAMAP2), while the remaining configurations show little variance in their performance. On 6 Discussion PAMAP2, for example, the difference between the peak and median performance is only 7% mean f1-score (see table 2). In this work we explored the performance of state-of-the-art The DNNs show the largest spread between peak and median deep learning approaches for Human Activity Recognition

1538 using wearable sensors. We described how to train recur- even if DNNs do not work well in a preliminary exploration. rent approaches in this setting and introduced a novel regular- Furthermore we found that results from other fields do not isation approach. In thousands of experiments we evaluated necessarily translate into the ubicomp domain. Deep net- the performance of the models with randomly sampled hyper- works with more than three hidden layers do not work well in parameters. We found that bi-directional LSTMs outperform the data-sets utilised in this work. To an extend this may be the current state-of-the-art on Opportunity, a large benchmark explained in the lack of a generative pre-training procedure dataset, by a considerable margin. (e.g. [Hinton et al., 2006]). However, interesting from a practitioner’s point of view is not the peak performance for each model, but the process of 6.1 Real-time HAR through RNNs parameter exploration and insights into their suitability for So far the application of HAR in ubicomp relies almost different tasks in HAR. All experiments in this work were exclusively on a pipeline-based architecture that involves a based on a random exploration of the parameter space for segmentation step which mostly corresponds to a sliding- each model. We found that performing hundreds or thousands window extraction procedure [Bulling et al., 2014]. Inher- of recognition experiments is a suitable way to find configura- ent to this approach is that it introduces a delay between the tions with state-of-the-art performance, and recommend this recording of movements using inertial sensors and the ear- approach to other practitioners. Additionally, more sophisti- liest time they can contribute significantly to the inference cated exploration methods (see e.g. [Snoek et al., 2012]) are procedure. While for offline analysis scenarios this is of no likely to discover good configurations in fewer iterations. concern, it effectively prohibits applications that require real- In our experiments we found that RNNs outperform con- time processing or an immediate response to e.g. user input. volutional networks significantly on activities that are short We believe that this work is the first to apply a model to in duration but have a natural ordering, where a recurrent ap- movement data that allows immediate, real-time inference at proach benefits from the ability to contextualise observations the same rate as the data is collected by the sensors. While across long periods of time. We see some indication that this we do not explicitly explore this in this work we believe that ability and the sensitivity to long term dependencies between there is significant potential in the use of sample-by-sample activities (such as opening / closing doors in Opportunity), RNNs (LSTM-S in this work) to allow for novel application introduce constraints on study design. In order for the RNN scenarios of HAR in ubicomp and in HCI in general. to learn long-term effects it needs to be trained with data col- Acknowledgements lected in a naturalistic environment, where the sequence of activities is similar to that in the envisioned application sce- Part of this work was supported by the Engineering and nario. Movement data which was collected adhering to a fixed Physical Sciences Research Council, Centre for Doctoral protocol will be less suitable to train RNNs, as long-term con- Training in Cloud Computing for Big Data [grant number text brings little to no benefit over localised pattern matching EP/L015358/1] during training and inference (and may even lead to overfit- ting). This would explain, at least to some extent, the poor References performance of RNNs on PAMAP2, a data-set that contains [Alsheikh et al., 2015] Mohammad Abu Alsheikh, Ahmed only well segmented, scripted activities. Selim, Dusit Niyato, Linda Doyle, Shaowei Lin, and However, not all application scenarios of HAR in ubicomp Hwee-Pink Tan. Deep activity recognition models with require a model that can maintain long term dependencies triaxial accelerometers. arXiv:1511.04664, 2015. between activities. The activities a user engages in may al- [Bachlin et al., 2009] Marc Bachlin, Daniel Roggen, Ger- ready be pre-determined, for example in applications like gait hard Troster, Meir Plotnik, Noit Inbar, Inbal Meidan, Talia analysis, or skill assessment in sports [Khan et al., 2015], Herman, Marina Brozgol, Eliya Shaviv, Nir Giladi, et al. where the analysis focusses on (changes in) short term move- Potentials of enhanced in wearable as- ment patterns. For these applications we recommend the sistants for parkinson’s disease patients with the freezing use of CNNs, as they can reliably model local patterns, are of gait syndrome. In ISWC, 2009. straight-forward to train, and are supported by the majority of [Bulling et al., 2014] Andreas Bulling, Ulf Blanke, and deep learning frameworks “out-of-the-box”. As CNNs (and Bernt Schiele. A tutorial on human activity recognition DNNs) only focus on a fixed context-length they allow for using body-worn inertial sensors. ACM Computing Sur- more liberty in study design, where naturalistic, as well as veys (CSUR), 46(3):33, 2014. protocol-driven data can be useful for training. For CNNs we recommend to start exploring learning-rates, before optimis- [Chavarriaga et al., 2013] Ricardo Chavarriaga, Hesam ing the architecture of the network, as the learning-parameters Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, had the largest effect on performance in our experiments. Gerhard Troster,¨ Jose´ del R Millan,´ and Daniel Roggen. The approach that is probably the most approachable for The opportunity challenge: A benchmark database for the practitioner are simple feed-forward DNNs. Their peak on-body sensor-based activity recognition. Pattern performance across the data-sets in this work can approach Recognition Letters, 2013. both the performance of CNNs and RNNs. However, we [Collobert et al., 2011] Ronan Collobert, Koray found that DNNs are very sensitive to their hyper-parameters Kavukcuoglu, and Clement´ Farabet. Torch7: A matlab- and require a significant investment into parameter explo- like environment for machine learning. In BigLearn, NIPS ration. We advise the practitioner not to discard this model Workshop, 2011.

1539 [Duchi et al., 2011] John Duchi, Elad Hazan, and Yoram [Plotz¨ et al., 2011] Thomas Plotz,¨ Nils Y Hammerla, and Singer. Adaptive subgradient methods for online learn- Patrick Olivier. Feature learning for activity recognition ing and stochastic optimization. The Journal of Machine in ubiquitous computing. In IJCAI, 2011. Learning Research, 12:2121–2159, 2011. [Plotz¨ et al., 2012] Thomas Plotz,¨ Nils Y Hammerla, Agata [Greff et al., 2015] Klaus Greff, Rupesh Kumar Srivastava, Rozga, Andrea Reavis, Nathan Call, and Gregory D Jan Koutn´ık, Bas R Steunebrink, and Jurgen¨ Schmidhu- Abowd. Automatic assessment of problem behavior in ber. Lstm: A search space odyssey. arXiv preprint individuals with developmental disabilities. In Ubicomp, arXiv:1503.04069, 2015. 2012. [Gregor et al., 2015] Karol Gregor, Ivo Danihelka, Alex [Rad et al., 2015] Nastaran Mohammadian Rad, Andrea Graves, and Daan Wierstra. Draw: A recurrent neural net- Bizzego, Seyed Mostafa Kia, Giuseppe Jurman, Paola work for image generation. arXiv:1502.04623, 2015. Venuti, and Cesare Furlanello. Convolutional neural [Hammerla and Plotz,¨ 2015] Nils Y Hammerla and Thomas network for stereotypical motor movement detection in Plotz.¨ Let’s (not) stick together: pairwise similarity biases autism. arXiv:1511.01865, 2015. cross-validation in activity recognition. In Ubicomp, 2015. [Reiss and Stricker, 2012] Attila Reiss and Didier Stricker. [Hammerla et al., 2015] Nils Y Hammerla, James M Fisher, Introducing a new benchmarked dataset for activity moni- Peter Andras, Lynn Rochester, Richard Walker, and toring. In ISWC, 2012. Thomas Plotz.¨ Pd disease state assessment in naturalistic [Ronao and Cho, 2015] Charissa Ann Ronao and Sung-Bae environments using deep learning. In AAAI, 2015. Cho. Deep convolutional neural networks for human ac- [Hermann et al., 2015] Karl Moritz Hermann, Tomas Ko- tivity recognition with smartphone sensors. In Neural In- cisky, Edward Grefenstette, Lasse Espeholt, Will Kay, formation Processing, pages 46–53. Springer, 2015. Mustafa Suleyman, and Phil Blunsom. Teaching machines [Ronaoo and Cho, 2015] Charissa Ann Ronaoo and Sung- to read and comprehend. In Advances in Neural Informa- Bae Cho. Evaluation of deep convolutional neural network tion Processing Systems, pages 1684–1692, 2015. architectures for human activity recognition with smart- [Hinton et al., 2006] Geoffrey E Hinton, Simon Osindero, phone sensors. In Proc. of the KIISE Korea Computer and Yee-Whye Teh. A fast learning algorithm for deep Congress, pages 858–860, 2015. belief nets. Neural computation, 18(7):1527–1554, 2006. [Snoek et al., 2012] Jasper Snoek, Hugo Larochelle, and [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Ryan P Adams. Practical bayesian optimization of ma- Jurgen¨ Schmidhuber. Long short-term memory. Neural chine learning algorithms. In NIPS. 2012. computation, 9(8):1735–1780, 1997. [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton, [Hochreiter et al., 2001] Sepp Hochreiter, Yoshua Bengio, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- Paolo Frasconi, and Jurgen¨ Schmidhuber. Gradient flow nov. Dropout: A simple way to prevent neural networks in recurrent nets: the difficulty of learning long-term de- from overfitting. JMLR, 2014. pendencies, 2001. [Yang et al., 2015] Jian Bo Yang, Minh Nhut Nguyen, [Hutter et al., 2014] Frank Hutter, Holger Hoos, and Kevin Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. Leyton-Brown. An efficient approach for assessing hyper- Deep convolutional neural networks on multichannel time parameter importance. In ICML, pages 754–762, 2014. series for human activity recognition. In IJCAI, 2015. [Khan et al., 2015] Aftab Khan, Sebastian Mellor, Eugen [Zeng et al., 2014] Ming Zeng, Le T Nguyen, Bo Yu, Ole J Berlin, Robin Thompson, Roisin McNaney, Patrick Mengshoel, Jiang Zhu, Pang Wu, and Juyong Zhang. Con- Olivier, and Thomas Plotz.¨ Beyond activity recognition: volutional neural networks for human activity recogni- skill assessment from accelerometer data. In Ubicomp, tion using mobile sensors. In MobiCASE, pages 197–205. pages 1155–1166. ACM, 2015. IEEE, 2014. [Lane et al., 2015] Nicholas D Lane, Petko Georgiev, and [Zhang et al., 2015] Licheng Zhang, Xihong Wu, and Ding- Lorena Qendro. Deepear: robust smartphone audio sens- sheng Luo. Human activity recognition with hmm-dnn ing in unconstrained acoustic environments using deep model. In ICCI, pages 192–197. IEEE, 2015. learning. In Ubicomp, pages 283–294. ACM, 2015. [LeCun et al., 2015] Yann LeCun, Yoshua Bengio, and Ge- offrey Hinton. Deep learning. Nature, 2015. [Neverova et al., 2015] Natalia Neverova, Christian Wolf, Griffin Lacey, Lex Fridman, Deepak Chandra, Brandon Barbello, and Graham Taylor. Learning human identity from motion patterns. arXiv:1511.03908, 2015. [Ordo´nez˜ and Roggen, 2016] Francisco Javier Ordo´nez˜ and Daniel Roggen. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recog- nition. Sensors, 16(1):115, 2016.

1540