Running head: RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 1

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS

MICHAEL COOLEN

TILBURG UNIVERSITY

Student number: 2031591

Administration number: u741570

Email address:

Supervisor: dr. P.A. Vogt

Supervisor email address:

Second reader: dr. M. Alimardani

Course: 880502-M-18 (Master thesis/Data Science in Action)

Faculty: Tilburg School of Humanities and Digital Sciences

Department: Department of Cognitive Science and Artificial Intelligence

Program: Data Science & Society

Date: May 29th, 2019

Word count: 9.791 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 2

Preface

This thesis has been written for the master program Data Science & Society at Tilburg

University. Many thanks to my supervisor dr. Paul Vogt and Jan de Wit for providing me with this interesting research opportunity. Also, thanks to my fellow students for all the positive moments throughout the year. Finally, this thesis could not have been written without the support of my family and friends.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 3

Abstract

Human-Robot Interaction is becoming more common. For robots to communicate in a natural way with humans, they will need to develop non-verbal communication. The first step in this process is to recognise human gestures. To recognise gestures, human gestures can be recorded using a motion device such as a Kinect. This research builds on previous research in which a large gesture dataset was created using Human-Robot interaction (de Wit et al.,

2019a). In this previous study, in total 35 different types of gestures were recorded using a

Kinect. Using one-shot learning, gestures were able to be classified with an accuracy of 23%.

The goal of the current research was to find a method to increase this gesture classification accuracy. In this research, time-series analysis was used, which has not been done before in this research field. The dataset was transformed into a featureset with 4,960 features. Using feature selection, 146 features were found to be important. In total 23 algorithms were tested on the features. It was found that ensemble type algorithms work best for these kind of features. After hyperparameter tuning, it was found that a simple Random

Forest was best in classifying gestures, with an accuracy of 47%. To increase this accuracy, three state-of-the-art ensemble algorithms were tested, which resulted in a classification accuracy of over 50% using CatBoost. For generalization purposes, a fast and simple model was created using the 15 most important time-series features. This model can achieve a classification accuracy of 35%.

Keywords: Human-Robot Interaction, Gesture Recognition, Kinect, Time-series,

Machine Learning

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 4

Contents

PREFACE ...... 2

ABSTRACT ...... 3

INTRODUCTION ...... 6

RELATED WORK ...... 9

RECORDING GESTURES ...... 9

ONE-SHOT LEARNING ...... 10

SUPERVISED LEARNING ...... 11

TIME-SERIES ...... 12

SUPERVISED LEARNING ALGORITHMS ...... 14

METHOD ...... 17

SETUP ...... 17

DATASET DESCRIPTION ...... 17

PRE-PROCESSING ...... 18

FEATURE EXTRACTION ...... 20

FEATURESET PRE-PROCESSING ...... 21

BASELINE ...... 21

FEATURE SELECTION ...... 21

HYPERPARAMETER TUNING ...... 23

VOTING ...... 23

RESULTS ...... 24

ALGORITHM COMPARISON ...... 24

FEATURE SELECTION ...... 26

FIRST COMPARISON ...... 29

HYPERPARAMETER TUNING ...... 31

VOTING ...... 33

EXTRA ALGORITHMS ...... 34

GESTURES ...... 36

DISCUSSION ...... 38

FEATURES ...... 38

ALGORITHMS ...... 39 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 5

LIMITATIONS AND FUTURE RESEARCH ...... 41

CONCLUSION ...... 42

REFERENCES ...... 44

APPENDIX A ...... 56

APPENDIX B ...... 58

APPENDIX C ...... 60

APPENDIX D ...... 62

APPENDIX E ...... 63

APPENDIX F ...... 65

APPENDIX G ...... 67

APPENDIX H ...... 72

APPENDIX I ...... 74

APPENDIX J ...... 76

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 6

Introduction

Have you ever seen WALL·E? This popular Disney film featured a small cute robot that was liked by many (Morris, 2008). Although this robot did not speak one word during the film, you could still make out what he was trying to communicate. Instead of speech, this robot communicated entirely using body language. This film showed us that without language, we can still understand robots just by looking at their body language.

Although hard to put into numbers, many researchers agree that non-verbal communication is more important than verbal communication (Beattie, 2004). By using gestures, the receiver of the conversation can get a better understanding of what you are trying to communicate (Kendon, 1994). Whereas there are cultural differences, it seems that gestures are used everywhere in the world (Graham & Argyle, 1975).

Additionally, we now live in a world where humans are starting to communicate more frequently with robots. Talking verbally to a robot is a problem that has partly been tackled.

For instance, our current phones can recognise simple speech and provide us with information. On the contrary, robots cannot read our non-verbal language nor produce it.

Small applications of simple non-verbal social robots exist, but currently no robot exist that can automatically recognise gestures and communicate in a natural way with humans

(Adăscălitei, Doroftei, Lefeber, & Vanderborght, 2014).

There would be many practical benefits if robots could communicate in a natural way with humans. This research field called Human-Robot Interaction has recently gained much research. Some applications exist in the form of language tutoring for children (de Wit,

Krahmer, & Vogt, 2019b). Similarly, research on how robots can help children learn a second language is currently being done (Vogt et al., 2019). Another field in which social robots could play a significant role is in the elderly support (Montemerlo, Pineau, Roy, Thrun, & Verma,

2002). Furthermore, many applications exist in the entertainment industry, for example a dancing partner robot (Montemayor et al., 2000). More applications of social robots are expected in the future, as research continues.

The first step in this non-verbal communication problem is solving gesture recognition.

Robots do not have eyes, so they must use another method of seeing humans and their RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 7

gestures. Researchers often use motion sensing input devices such as the Microsoft Kinect™ to record human movement (Pfister, West, & Noah, 2014). Using this device, extensive information about human movement is recorded, which can then be used for data analysis.

Multiple approaches can be taken in gesture recognition. After the creation of a new robot, it does not know any gestures. To learn a lot of different gestures, creating many training examples for each gesture would take a lot of time. Therefore, one-shot learning is used as a method to recognise gestures (Escalante, Guyon, Athitsos, Jangyodsuk, & Wan, 2017). When a lot of data has been collected, supervised learning can be used. Using machine learning, gesture classification only improves when a lot of training data is available.

The current research was inspired by previous research in which a large gesture dataset was created (de Wit et al., 2019a). By letting a robot play a game of charades with a human, many gestures were recorded. In this previous study, one-shot learning using the gist of gesture was applied to classify gestures (Cabrera & Wachs, 2017). This led to a gesture classification accuracy of 23%. Now this gesture dataset has been created, supervised learning can be applied to this dataset to improve gesture classification accuracy. This study will use supervised machine learning methods to improve gesture recognition.

Supervised learning using Kinect data requires feature extraction. There is still no fool proof way to extract features from gesture data, as every research seem to use their own method (Biswas & Basu, 2011; Marin, Dominio, & Zanuttigh, 2014; Xia, Chen, & Aggarwal,

2011). In the same way, it is unknown which machine learning algorithms work well on this kind of data. Most researchers use similar algorithms, however, many more algorithms are available (Bhattacharya, Czejdo, & Perez, 2012; D’Orazio, Marani, Renó, & Cicirelli, 2016).

This study will attempt a new method of feature extraction to support the existing literature on this subject. Using Time-series Analysis, this study will attempt to classify gestures.

Time-series features can be viewed as statistics that describe the data, for example mean or standard deviation. Although time-series analysis have been applied successfully to other research fields (Pincus, 1991), this method has not been applied in gesture recognition research.

This leads to the following research question: RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 8

Can time-series analysis be used for gesture recognition?

The two methods that this research will focus on can be converted into two additional sub questions.

a) How well do time-series features work for gesture classification?

Since gesture data has the dimension of time, simple statistics such as mean and variance can be calculated. These features, also called time-series features, are simple to calculate, but might work well for gesture recognition. No research has been found using this method of feature extraction in this research field, so it is unknown how well this method will work.

b) What machine learning algorithms work best on gesture time-series features?

This question will be answered by testing different machine learning algorithms to see which ones work best. Since best is a relative term, the goal is to achieve as high accuracy as possible. However, there might be a trade-off between speed and accuracy. Some algorithm may achieve a slighter better accuracy than another, but if a faster algorithm manages to achieve just a slightly poorer result, than this faster algorithm is considered more efficient and preferred. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 9

Related Work

Some research has been done on recognising gestures. This is a multi-step process wherein the choices of each step have an effect on the final result (Mitra & Acharya, 2007). First gestures have to be recorded and turned into data that can be processed. Next, features have to be extracted from the data that can be used for analysis. Finally, an algorithm has to be used to perform classification.

Recording Gestures

Few methods exist in capturing human movement. One method is to recognise human movement from images or videos (Rahman & Afrin, 2013). Different algorithms can be used to process image data. Research shows that an acceptable classification accuracy is achievable, although often the same algorithms are used (Morency, Quattoni, & Darrell, 2007). Analysing videos with gestures is also possible, but research in this field is often limited to only hand- gestures or other simple gestures (Lee, Lee, Lee, & Hong, 2004)

In more recent years, motion sensing input devices such as the Microsoft Kinect™ have come available. Although originally developed as an Xbox gaming accessory, researchers have shown interest in these devices because of its ability to recognise human movement (Ren,

Yuan, Meng, & Zhang, 2013). This device can record human movement in 3D through time using an infrared laser projector and a RGB (Red, green, blue) camera image (Zhang, 2012).

There are multiple ways to utilize the Kinect device for research. Researchers can decide to only use the depth camera (Uddin, Thang, & Kim, 2010). In addition, the use of only the RGB sensor data is also seen in research (Biswas & Basu, 2011). Lastly, the Kinect device can also provide skeleton data, which will provide the X, Y and Z coordinates of skeleton joints of a subject through time (Le & Nguyen, 2013; Raptis, Kirovski, & Hoppe, 2011).

More recently, a Leap Motion device has come available (Leap Motion Inc, 2012). While working similar to a Kinect device, this device specialises in recording hand and finger movement. The Leap Motion can more accurately record hand movements than a Kinect

(Weichert, Bachmann, Rudak, & Fisseler, 2013). Research shows basic gestures can be recognised with an acceptable accuracy (Marin, Dominio, & Zanuttigh, 2014).

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 10

One-shot Learning

Gesture classification can be done using a few approaches. While regular machine learning tasks often use hundreds of training examples, one-shot learning refers to classifying gestures based on just a few training samples (Escalante, Guyon, Athitsos, Jangyodsuk, &

Wan, 2017). A new developed robot has to start learning to discriminate between gestures.

Researchers often provide a few training examples so the robot can begin learning. Many methods for one-shot gesture classification have been developed.

One method of recognising gestures is using Template Matching (Aggarwal & Cai, 1999).

In this method, movement is identified by applying templates or so called meshes to an image.

These templates can be used to compare and match gestures. One research uses parameters extracted from the depth motion data to differentiate between gestures. First, the background was removed from frames using a grayscale threshold. Thereafter, features were extracted from these frames (Mahbub, Imtiaz, Roy, Rahman, & Ahad, 2013). Another research took a similar approach and used both the RGB and depth sensor data from a Kinect. A process of phological denoising on depth images was done and human silhouettes were segmented using temporal segmentation (Wu, Zhu, & Shao, 2012).

Another method for extracting gestures is by using Action Recognition (Ji & Liu, 2010).

In this field, different approaches can be taken. One research used a language-motivated approach that works similar to how topics of documents can be detected for its contents. Using a hierarchical Bayesian model, visual features of videos of action poses can be connected with classes of activities (Malgireddy & Nwogu, 2013). Other researchers created a Moving Pose framework, which is a descriptor that uses both pose information as well as speed and acceleration of human body joints within a short time (Zanfir, Leordeanu, & Sminchisescu,

2013). Another method selects only relevant frames from a video and then uses motion descriptors for temporal segmentation in combination with Dynamic Time Warping (Konečný

& Hagara, 2014).

Moreover, another method in gesture recognition is by using Manifold Learning. This is a machine learning approach wherein dimensionality reduction is done. The datasets are thought of as high dimensional (Seung & Lee, 2000). One research used the geometry of tensor RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 11

space for action recognition. Using tensors to represent videos, only useful information is extracted which then can be used for gesture recognition (Lui, 2012).

Other methods of gesture recognition is using Principal Motion Components. Using only a single training video, a 2D map of motion energy is extracted from a video (Escalante,

Guyon, Athitsos, Jangyodsuk, & Wan, 2017). This 2D map can be used to categorise gestures.

The benefit of this method is its performance and efficiency.

Recent research on this topic focusses on how humans produce gestures instead of the gesture itself (Cabrera & Wachs, 2017). From a single gesture recording, biomechanical features are extracted which are called the gist (Cabrera & Wachs, 2016). This is a natural approach, because the gist represents what humans remember after seeing the gesture in combination with the cognitive processes that were used to replicate it. From a gist, many new realistic observations are created that are similar to the one provided. The goal is to generate a large dataset of similar observations. This is done by adding meaningful variability to these features. After this process is done, classifiers can be used for recognition. This approach was also used in recent research (de Wit et al., 2019a).

Supervised Learning

Another approach in gesture recognition is using Supervised Learning. This method can only be successfully applied if many training example are available (Graves, 2012). Since the previous research (de Wit et al., 2019a) managed to create a large gesture database, gesture classification accuracy can be potentially improved using this method. Supervised learning requires effective features to be extracted from the data.

The data collected in previous research also contains the Skeleton Feature Representation which is often used for gesture analysis (Raptis, Kirovski, & Hoppe, 2011). This feature representation contains the subject’s joint positions represented in X, Y and Z coordinates, over time (Shotton et al., 2011). The origin (푥 = 0, 푦 = 0, 푧 = 0) is located at the centre of the sensor.

Looking from the sensor’s point of view (see Figure 1), the X dimensions grows to the left.

Next, the Y dimension grows up, so it gives an indication of height. Finally, the Z dimension grows out in the direction the camera faces (Microsoft, 2014). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 12

Figure 1. Kinect Coordinate System (Microsoft, 2014).

Many methods are used to extract features from these skeleton representations. One way to extract features is to calculate the joint angles with respect to the person’s torso (Gu,

Do, Ou, & Weihua, 2012). Other researchers were able to distinguish between a positive and negative emotional dance by extracting features based on the upper body, velocity acceleration and angle between different joints (Saha, Shreya, Konar, & Nagar, 2013). Similarly, other researchers have tried to select the most important joints and then calculating the angles between them (Le & Nguyen, 2013). This led to a good success when a single frame was selected, but performed poorly when time-series was used.

Looking at a more complicated way of feature extraction, Principal Component Analysis was used by some researchers to divide the skeleton positions into three sections called torso, first-degree joints and second-degree joints (Raptis, Kirovski, & Hoppe, 2011). Another way of extracting features is done by developing a Dynamic Time Warping template, wherein a similarity value is produced by warping time sequences of joint positions (Celebi, Aydin,

Temiz, & Arici, 2013). It is hard to compare these individual feature extraction methods, but what most researchers agree on is that structural hand information around the hand joints are one of the most useful features that allow discrimination between gestures (Escalera et al.,

2013).

Time-series

Time is an important component of human life. Human have been analysing changing events for many centuries. For example, one of the most known time-series analysis is that of the weather (Chen & Hwang, 2000). The temperature changes trough time and patterns can be RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 13

found by analysing the change over time. A more advanced example of time-series analysis is earthquake prediction analysis, where the time and date of earthquakes are recoded in order to predict where the next one will occur (Moustra, Avraamides, & Christodoulou, 2011). What these analyses have in common is that they all use the component of time. Multiple observations are done over time in order to analyse these dynamics. Correspondingly, gesture recognition can be viewed as a time-series problem, as during recording human movement is observed for some period of time.

Time can be characterized in multiple ways (Fulcher, 2018). For example, a time-series might follow a distribution such as a normal distribution. Likewise, the values recorded during a time-series might be correlated with each other. Additionally, simple statistics such as median, maximum or minimum can be used to describe characteristics of time-series. In short, the time-series are described or characterized using statistics. This leads to a feature based representation of time, which can be used for analysis.

Time-series statistics have been used many times before in previous studies. For example, researchers have shown that entropy can be successfully used for prediction (Pincus,

1991). Other researchers have shown that statistics such as mean, standard deviation, skewness, and kurtosis can be used to analyse control chart pattern data (Nanopoulos, Alcock,

& Manolopoulos, 2001). Other researchers used thirteen features such as skewness, kurtosis, chaos, and nonlinearity to summarise time-series (Wang, Smith, & Hyndman, 2006).

There are multiple ways to analyse time-series. If all time-series are of equal length, similarity distances such as the Euclidean Distance can be computed between them (Ding,

Trajcevski, Scheuermann, Wang, & Keogh, 2008). Furthermore, if time-series are not of equal length, methods such as Dynamic Time Warping (DTW) can be used (Berndt & Clifford, 1994).

DTW is often used in combination with an algorithm such as nearest-neighbor, often leading to great results (Bagnall, Lines, Bostrom, Large, & Keogh, 2017). However, the literature describes that instead of creating new algorithms to analyse time-series, research should focus on transforming time-series into useful features and use proven algorithms to analyse them

(Bagnall, Davis, Hills, & Lines, 2012; Fulcher & Jones, 2014). To summarise, research should focus on computing different statistics over features and analyse them with conventional algorithms. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 14

Looking at gesture recognition, some research has been done on using time-series analysis. For example, DTW has been used as method for recognising sign language (Ten Holt,

Reinders, & Hendriks, 2007). Furthermore, the Euclidean Distance has been used to recognise gestures using Kinect video recordings (Ren, Yuan, Meng, & Zhang, 2013). Another method of analysing time-series is using Hidden Markov Models. This algorithm can be easily applied on time-series such as those in gestures. They are relatively easy to implement, and research shows that a good accuracy can be achieved in recognising gestures (Uddin, Thang, & Kim,

2010). However, what lacks in the literature is research on using time-series in combination with conventional algorithms. According to previous research, this should be a direction of research that has a lot of potential (Bagnall, Davis, Hills, & Lines, 2012; Fulcher & Jones, 2014).

The upside of using conventional algorithms is that they are tested and proven to be reliable.

Supervised Learning Algorithms

Many machine learning algorithms are available for supervised gesture classification.

There is no best algorithm available for classification (Wolpert, 1996), therefore researchers must use different algorithms to see what works. Focusing just on research that uses features extracted from the skeleton feature representation, some supervised machine learning algorithms are often being used.

Firstly, one of the most basic classification methods is a Decision Tree. This algorithm has been used in classifying gestures (Patsadu, Nukoolkit, & Watanapa, 2012). The benefit of using a Decision Tree is that they are very fast and easy to comprehend (Safavian & Landgrebe,

1991). However, they often lack performance compared to more advanced algorithms

(Bhattacharya, Czejdo, & Perez, 2012).

Secondly, Ensemble algorithms are slightly more comprehensive but still considered to be simple (Dietterich, 2000). They can be dived into three categories namely: Bagging, Boosting and Random Forest. The most popular algorithm seems to be Random Forests, which is also often used in gesture classification (Shotton et al., 2011). The reasons for this is that they are simple, easy to implement and use few processing resources (Segal, 2003). Boosting algorithms such as ADABoost are used in previous research with good results (Saha, Datta, Konar, &

Janarthanan, 2014). Bagging algorithms however, are not found to be used in previous research. They should have some potential similarity to other ensemble algorithms (Dietterich, RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 15

2000). Although there are more advanced algorithms available that are potentially better for gesture classification, ensemble algorithms seem a solid way for this task (Shotton et al., 2011).

Furthermore, Nearest Neighbors is a simple supervised learning method that can classify data. The most common algorithm of this category, K-nearest-neighbors, is often used in multi- class classification (Escalera et al., 2013). In gesture classification, this algorithm is also often used because it is simple and fast (Lai, Konrad, & Ishwar, 2012). Some research shows that achieving a good accuracy is possible on a simple gesture classification task (Saha, Datta,

Konar, & Janarthanan, 2014). Some other neighbor classification algorithms that differ slightly from nearest neighbors are Nearest Centroid and Radius Neighbor. These have not been used in previous research on gestures, so it not known whether they work as well as Nearest

Neighbors.

Moreover, it seems that Regression methods such as Logistic Regression are sparsely used in gesture classification. One study shows that Logistic Regression can be successfully used in gesture recognition (Itauma & Kivrak, 2012). The reason why logistic regression is seldom used is unknown. They should have some potential as one research found that they performed well compared to other basic machine learning algorithms (Rosa-Pujazón, Barbancho, Tardón,

& Barbancho, 2016).

Additionally, Naïve Bayes has not been used as an algorithm for classifying gestures. It should however have some potential, as similar research has shown that humans can be recognised with this algorithm using Kinect data (Preis, Kessel, Werner, & Linnhoff-Popien,

2012). According to previous research, Naïve Bayes should level well with other types of algorithms (Rish, 2001). Two algorithms that can be used for classification are Gaussian (John

& Langley, 1995) and Bernoulli (Narayanan, Arora, & Bhatia, 2013). It is not known how well they will perform in classifying gestures.

Next, a solid machine learning algorithm often used for classification is Support Vector

Machine. Although more depending on processing power, using SVMs often lead to a better accuracy than simpler algorithms (Hsu & Lin, 2002). This algorithm has also been applied to gesture classification, often with good success (Orasa, Nukoolkit, & Watanapa, 2012). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 16

In contrast to SVM, Relevance Vector Machines are seldom applied in gesture recognition. One research shows that RVMs achieve a similar result to SVMs (Nguyen & Hai-

Son, 2015). It is unknown how well they will perform in time-series based gesture recognition.

Finally, Artificial Neural Networks are often used for gesture classification (Joshi, Ghosh,

Betke, Sclaroff, & Pfister, 2017). The benefit of using Neural Networks is that raw data can be used and no features need to be created (Chen & Koskela, 2015). On the contrary, they often take a long time to process, however, they often lead to a better accuracy result (Cho & Xi,

2014). A basic form of an ANN is a Multi-Layer Perceptron. It has been shown that a MLP can be used to recognise people (Sinha, Chakravarty, & Bhowmick, 2013). More complicated

Neural Networks are available for gesture recognition, however that is beyond the scope of the current research.

Other algorithms that cannot be categorised are Gaussian Process (Rasmussen, 2003) and

Discriminant Analysis (Altman, 1968). These algorithms have not been used in previous research using a Kinect. These algorithms will also be tested in current research to see how well they perform.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 17

Method

Setup

For the current research, the latest version of Python was used (3.7.3). Different basic

Python packages were used for data management and calculations (see table 1). These packages were the latest ones available at the time of this writing. Programming and calculations were done using Jupyter notebooks (version 5.7.4). It should be noted that computations were done using a Windows 10 system with a 3.40 GHz (4 core) processor with

16 GB of memory. For replication purposes, using a similar system is advised as many computations took a long time to run and the system occasionally ran out of memory.

Table 1

Python packages and their version used in this study

Package Version

NumPy 1.16.2

Pandas 0.24.2

SciPy 1.2.1

Scikit-learn 0.20.3

IPython 7.4.0

Matplotlib 3.0.3

Dataset Description

The dataset used in the current research is the Lowlands/NEMO dataset, which was created in previous research (de Wit et al., 2019a). This dataset was created using human-robot interaction, specifically, by letting a robot play a game of charades with a human. The human performed a gesture which was recorded using a Kinect. Thereafter, the robot tried to recognise the gesture that was performed by the human. Although much more has been done in this study, this section of the study resulted in a large gesture dataset.

The dataset consists of 3,760 files and each file represents one participant performing a gesture. In total gestures for 35 different concepts were recorded. Each recorded gesture contained the X, Y, and Z positions of limbs such as the head, spine, hands and many more. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 18

Moreover, they also contained the orientations of these limbs in X, Y, Z, and W dimensions.

Lastly, they contained some other columns such as the state of the hands (open or closed), the confidence interval of these states, and some information about the face (pitch, roll, and yaw).

The data files have 155 columns in total (see Appendix A for a description). Some participants performed an additional attempt in expressing the gesture. Whether the current gesture data was the result of the first or second attempt could be retrieved from the filename. Likewise, data was gathered at the NEMO or Lowlands venue and this information could also be retrieved from the filename.

Pre-processing

There was no missing data in the dataset. Before processing, all fields in the data were converted to float except the integer columns (rhandState, lhandState, facepitch, faceyaw and faceroll). The columns ‘rhandconfidence’ and ‘lhandconfidence’ contained the values ‘low’ and ‘high’ and were recoded to 0 and 1.

After an inspection of the data, some idleness was detected in some cases. This meant that the Kinect started to record, however almost no movement was detected, probably because the participant did not move yet. Similarly, in some cases this idleness was recorded at the end, presumably because the participant was finished with his gesture, but the Kinect was still recording (see Figure 2 and 3).

In order to remove this idleness, two cut-off points were calculated based on the up and down (Y) movement of the right hand. First, a threshold of movement was calculated:

푇ℎ푟푒푠ℎ표푙푑: 0.10 × 푎푏푠(푚푎푥 − 푚𝑖푛). Then, the first cut-off point was calculated by starting from the beginning and selecting the first observation that moved more than the threshold.

The second cut-off point was calculated by starting at the end and selecting the first observation that moved more than the threshold. Since the hands were presumed to be the limb that gives the most information (Escalera et al., 2013), using the up and down detection of the right hand seemed to be a rational choice. The right hand was chosen, because most people are right-handed and achieve higher accuracy on movement tasks using this hand

(Hanna et al., 1997).

After additional inspection, some glitched cases were detected (first case in Figure 2).

These glitched cases characterise themselves by having mixed up rows. The time column was RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 19

not ordered and this resulted in a weird order of data. This could cause problems in some research. However, since this research uses time-series analysis, this is not a problem as time- series statistics such as mean and standard deviation calculated over a column do not change when the order changes. Thus, these cases were not excluded from further analysis.

Figure 2. Detection of right hand movement (Y dimension) for 10 random cases. Data within the two cut-off points is used.

Figure 3. Using two cut-off points to detect hand movement. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 20

Feature Extraction

Since the data consist of a time-series, it is possible to calculate time-series statistics on the data. Some examples of these statistics are mean, variance or standard deviation. The

Python package TSFRESH (version 0.11.2) was used to calculate these statistics in an easy manner (Christ, Braun, Neuffer, & Kempa-Liehr, 2018). In total, TSFRESH can currently calculate 65 different features on time-series. In the current research 32 different features were chosen (see Appendix B for a full list). These features were chosen because they were the easiest to comprehend and their calculation speed was much faster than the other features.

For all the 3,760 gestures, multiple actions were performed. Firstly, the chosen time- series features were calculated on each column. Since the gesture files have 155 columns, this resulted in a featureset of 65 × 155 = 4,960 columns. Thereafter, the current gesture (target), the origin of the data (Lowlands or Nemo, recoded to 0 and 1) and the attempt (First or second) were added as columns. However, they were not used in any model calculation as to prevent leakage. This resulted in a final featureset with the 3,760 rows and 4,963 columns.

A copy of this featureset was saved in which outliers were removed. In the current study it is theorized that the time humans take to perform gestures is normally distributed. If many people perform the same gesture, they are likely to take approximately the same time performing it. One side note on this is that humans can take another approach while performing the same gestures. For example, there are several ways to express an airplane.

Nonetheless, removing outliers based on gesture duration also removes erroneous cases such as those in which the recording went on for too long after the gesture ended.

For this reason, outliers were removed based on the time it took for a participant to perform a gesture. The detection of outliers was based on the z-score using the SciPy python package (Jones, Oliphant, & Peterson, 2001-). Participants were removed if they had a z-score of +3 or -3. This resulted in removing 83 outliers in total (2.21% of the data). The maximum removed cases for a class was 5, for the gesture fish. This finally resulted in a featureset with

3,677 rows (see Appendix C for a full result).

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 21

Featureset Pre-processing

Correlated features were removed for two reasons. Firstly, the current featureset had too many features (4,963), since the number of features is higher than the number of cases

(3,760) the featureset is high dimensional, which will cause problems in modelling (Donoho,

2000). One of these problems is overfitting, were the model is learning too recognise to many characteristics of the training set so it cannot generalize well on the test set (or new data) (Xing,

Jordan, & Karp, 2001). Another problem is that fitting the algorithms will take too long with this many features. Hence, features with a correlation of .90 or higher were removed. In total

2,521 features were removed, resulting in a feature set of 2,440 columns.

Baseline

In the original research, gesture classification was done using one-shot learning. Using k-nearest neighbors approach the researchers classified the gestures. This resulted in an accuracy of 23% which will be used as a baseline for the current study.

In the current study, in total 23 different machine learning algorithms were used to model the data. From all available supervised classification algorithm categories, at least one algorithm was chosen. Some ensemble algorithms such as AdaBoost and Random Forest were used, as they are a popular choice in gesture recognition (Shotton et al., 2011). From linear algorithms, classifiers such as Logistic Regression and Ridge Regression were used. Linear algorithms were chosen because they are seldom used in previous research. Additionally, some Support Vector Machine classifiers were used such as SVC, NuSVC and LinearSVC.

They are used in some previous and have resulted in a good accuracy (Hsu & Lin, 2002). A full list of algorithms used in this research can be found in Appendix D. For each algorithm a

10-fold cross validation was performed in order to obtain a non-biased accuracy. Furthermore, the time it took for each algorithm to run was recorded.

Feature Selection

There are multiple ways to select features in classification. One solid method to consider is Principal Component Analysis. PCA is often used on large datasets to transform the variables into a set of independent components (Jolliffe, 2014). It reduces the number of variables by combining them into components that explain the most variance of the data.

Although PCA is often used, a problem with this method is that the new principal components RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 22

can be difficult to interpret. Looking at the limitations of PCA, if this method was used on the current featureset, the resulting principal components would be extremely hard to interpret.

Furthermore, a PCA will take a long time to run on a featureset of this size. So, PCA is deemed not suitable for the current featureset.

Another way to select features is by using Feature Importance. Some ensemble machine learning algorithms such as Random Forest have a built-in method of ranking features (Scikit- learn, n.d.). With this method the importance of features can be compared to each other.

However, this method does not give information about how many features should be selected for further analysis.

A good method to select features is Recursive Feature Elimination (RFE). RFE ranks features by recursively running models with a subset of features in order to find a set of features that result in the model with the highest accuracy (Guyon, Weston, Barnhill, &

Vapnik, 2002). It starts by using the base model with all the features. Then it will drop the feature with the least importance and runs the model without this feature. It will stop when no more improvement in accuracy is found (Yan & Zhang, 2015). The only problem with using

RFE on this featureset is that it will take a long time to run.

The base featureset contains 2,512 features, so running a lot of models to find the most important features is required. To increase the processing speed, the featureset was split into four feature sets. Then on those four featureset RFE was applied to find the best features of that featureset. This resulted in four different sets of important features. These were combined into a combination set and RFE was applied to this set to find the most important features.

After feature selection, all the algorithms were tested again with the best selected features. This first comparison shows how effective feature selection is on each individual algorithm. Feature selection may be a step to avoid, since processing time is large and the gains are small. It will be done in this study in order to show the maximum potential of the algorithms. Research has shown that Decision Trees and SVMs often gain from feature selection (Navot, Gilad-Bachrach, Navot, & Tishby, 2005; Tirelli & Pessani, 2011). Furthermore, feature selection can prevent overfitting (Reunanen, 2003). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 23

Hyperparameter Tuning

Some algorithms don’t perform very well until their hyperparameters are tuned. Using default hyperparameters results in not showing the algorithm’s full potential. Tuning hyperparameters can also prevent overfitting. To see which hyperparameters work best, different hyperparameters will be tuned. For algorithms that have a long fitting time, not many hyperparameters can be tested as processing time will be too long. To see a list of hyperparameters tested in this study see Appendix E.

Voting

To further improve the accuracy, a majority voting classifier will be tested. This type of classifier combines multiple algorithms into one ‘super’ algorithm for the prediction of classes

(Kuncheva & Rodríguez, 2014). A benefit of this is that it can balance out individual algorithm’s weaknesses. Two methods of voting are available. Firstly, hard voting is based on the majority vote. If ten algorithms are used with a voting classifier and six of them predict that a gesture is a crocodile, then the voting classifier predicts the gesture will be a crocodile.

Secondly, soft voting uses probabilities in which the class with the highest average probability is chosen (see Figure 4).

Hard Voting

Classifier KNN Decision Random Gradient Logistic SVC

Tree Forest Boost Regression

Prediction Bus Motorcycle Motorcycle Bus Motorcycle Car

Final Prediction ↓ Motorcycle

Soft Voting

Classifier KNN Decision Random Gradient Logistic SVC

Tree Forest Boost Regression

Prediction Bus Motorcycle Motorcycle Bus Motorcycle Car

Probability 67% 60% 17% 32% 20% 51%

Predictions Bus: (67 + 32) / 2 = 49.5% Motorcycle: (60 + 17 + 20) / 3 = 32.33% Car: 51% Final Prediction ↓ Car

Figure 4. Hard and soft voting algorithm. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 24

Results

Algorithm Comparison

Two feature sets were tested in establishing a baseline, one with outliers and one without. In total 23 algorithms were run on the feature sets. All algorithms were using 10-fold cross validation in order to acquire a non-biased test accuracy. Additionally, all algorithms were run using their default hyperparameters. This resulted in a baseline which can be seen in Table 2 and Figure 5.

Looking at the results, it is noticeable that a lot of algorithms tend to overfit on the training data even though 10-fold cross validation was used. This can be an indication that too many features are used. The algorithm that produces the highest accuracy is the

RandomForestClassifier. Another goal of this comparison was to find an algorithm that can be used in feature selection. A fast model is needed, because during Recursive Feature

Elimination many models are tested.

Surprisingly, Support Vector Machine algorithms did not perform very well. This can be explained as SVMs are often very sensitive to hyperparameter tuning, which was not done in this step (Duan, Keerthi, & Poo, 2003). What’s more is that the top five methods are all ensemble algorithms. This also explains why they are overfitting as ensemble algorithms have the tendency to overfit (Dietterich, 2000).

The average difference between the accuracy of the featureset with outliers and without the outliers was 0.13%, which is negligible. As 2.21% of the data was removed from the no outliers featureset during pre-processing, it seems no important information is lost. In the next analyses only the featureset without outliers was used.

Two algorithms did not successfully run during the first baseline test. Firstly,

GaussianProcessClassifier resulted in a ‘Memory Error’. An explanation for this is that the algorithm is trying to create a matrix of shape (푛, 푛), in this case (3677, 3677). This will result in a matrix with 13,520,329 elements, which is too large for the memory (Metzen, 2018).

Secondly, RadiusNeighborsClassifier resulted in the error “No neighbours found” using

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 25

Table 2

A baseline of the algorithm’s performances

Classifier Train Test Test Execution Diff.

Accuracy Accuracy Accuracy Time Outliers

Mean Mean 3*STD Seconds Test acc.

RandomForestClassifier 1.0000 0.3617 0.0440 10.63 -0.0013

ExtraTreesClassifier 1.0000 0.3506 0.0583 4.71 +0.0026

GradientBoostingClassifier 1.0000 0.2677 0.0419 1327.45 -0.0153

RidgeClassifier 1.0000 0.2617 0.0330 1.51 +0.0024

NuSVC 0.9686 0.2591 0.0229 189.82 +0.0071

BaggingClassifier 0.9965 0.2526 0.0273 65.18 -0.0075

LogisticRegression 1.0000 0.2418 0.0298 9.37 +0.0118

BernoulliNB 0.5535 0.2364 0.0450 0.49 -0.0022

PassiveAggressiveClassifier 0.9975 0.2363 0.0425 23.38 +0.0087

LinearSVC 1.0000 0.2302 0.0309 183.65 +0.0072

SGDClassifier 0.9003 0.2095 0.0863 8.76 +0.0140

NearestCentroid 0.4037 0.1918 0.0198 0.32 +0.0022

Perceptron 0.7930 0.1908 0.0574 7.83 +0.0104

SVC 0.5111 0.1898 0.0326 197.00 +0.0050

DecisionTreeClassifier 1.0000 0.1737 0.0275 9.78 +0.0007

MLPClassifier 0.2797 0.1284 0.1509 160.11 -0.0445

KneighborsClassifier 0.3708 0.1145 0.0188 1.14 +0.0039

GaussianNB 0.4597 0.0918 0.0133 0.52 -0.0012

AdaBoostClassifier 0.1078 0.0884 0.0657 32.43 -0.0088

LinearDiscriminantAnalysis 1.0000 0.0637 0.0190 19.56 -0.0259

QuadraticDiscriminantAnalysis 1.0000 0.0341 0.0181 0.98 +0.0039

Average difference -0.0013

Note. Execution time is the average time for fitting the estimator on the train set. Since

accuracies are normally distributed, statistically the STD*3 captures 99.73% of the scores,

so it gives an indication of the worst possible scenario (accuracy – STD*3).

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 26

default hyperparameters. This can be explained as this classifier tries to use fewer nearest neighbours for classification based on the radius hyperparameter (Sckit-learn, 2019). The default radius of 1.0 is too small to find any neighbours in some cases, which results in this error. Both these algorithms were excluded from further analysis.

Figure 5. Comparing baseline accuracy of the algorithms.

Feature Selection

Recursive Feature Elimination was used to extract the best subset of features. This method works based on an algorithm that recursively tests models until it finds a subset of features that result in the highest accuracy. As many models are tested, it is advisable to use a very fast algorithm in order to minimize processing time. The algorithm that was selected for the use of RFE was the ExtraTreeClassifier, as it is one of the fastest algorithms. It only performed slightly worse than the most accurate algorithm, but its execution time is faster.

Before Recursive Feature Elimination was used, the featureset was split up into four subsets to speed up processing. Then RFE was run on each individual subset to find the best features of that subset. Next, the best features of each subset were combined back to one featureset. This resulted in a featureset with 216 features that were considered to be most important. Finally, RFE was run again on this final featureset to find the best features. RFE managed to find 146 features that were considered to be the most important (see Table 3). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 27

Table 3

The results of Recursive Feature Elimination

# Features Featureset 1 Featureset 2 Featureset 3 Featureset 4

Before RFE 1398 1832 1397 777

After RFE 80 50 59 27

# Features Final featureset

Before RFE 216

After RFE 146

Note. Feature sets were split up into four sets to speed up processing. RFE was run on the

final featureset to find the most important features.

This study started with in total 32 different time-series features that were calculated on the columns. The featureset after RFE resulted in only 18 of those time-series features being used (see Figure 6). The most important ones and how many columns with this statistic are used are as follows: Mean abs change (27), standard deviation (23), mean (21), maximum (16) and sum values (8). It is noteworthy that the time-series features that were used are all simple statistics. More complicated time-series features such as ‘Mean second derivate central’ or

‘First location of maximum’ are not used.

Figure 6. Number of time-series features per type selected by RFE. E.g. the final featureset contained the statistic ‘Mean absolute change’ of 26 different columns. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 28

Looking at the columns of the data, it seems that some columns are most important.

The most important columns of the data and how many time-series features were used on them are as follows: lelbowori (15), rhandpos (10), lhandpos (10) and relbowori (9) (see Figure

7). Noticeable about this is that the most important columns are the orientations of the elbows and the positions of the hands. However, the position of the shoulders do not change much, nor the positions of the feet are often not seen. According to RFE, orientation is more invariant than position bases features, namely, 68 and 39 respectively (see Appendix F for a full list).

Furthermore, other columns of the data are used as well, such as rhandstate (7) and rhandconfidence (6). These columns may be often overlooked by researchers, but they seem to give important information.

Figure 7. The occurrence of columns in the final featureset. E.g. the final featureset contained 15 different time-series statistics about ‘lelbowori’.

After running an ensemble algorithm, the importance of the features can be retrieved.

According to the ExtraTreeClassifier, these are the most important features (see Table 4). As expected, the top eleven features are all about the hands or elbows (see Appendix G for a full list of features). Surprisingly, the standard deviation of the state of the right-hand (open or closed) seems to be the most important feature. Mostly, the mean and standard deviation, RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 29

which are considered to be the most basic time-series feature seems to give the most information in classifying gestures.

Table 4

The 8 most important features of the featureset

Feature Importance STD*3

rhandState_standard_deviation 0.011884 0.003778

rhandconfidence_mean 0.011073 0.003478

lhandState_standard_deviation 0.010692 0.003753

rhandposY_mean_abs_change 0.010060 0.003034

relboworiW_maximum 0.009948 0.003927

relboworiX_mean 0.009646 0.003253

lelboworiX_mean 0.009517 0.003887

lelboworiW_mean 0.009510 0.003606

Note. According to the feature importance method of

the Extra Tree classifier. The importance is the

average of a 10-fold cross validation.

First Comparison

After feature selection, the first comparison was calculated by running each algorithm again (see Table 5 and Figure 8). This resulted in an average accuracy increase of 8.32%.

Surprisingly, the LinearDiscriminantAnalysis had a remarkable accuracy increase of 33.73% and moved from being one of the worst classifiers to one of the best. The two best algorithms for this featureset at this stage are currently still the RandomForestClassifier and

ExtraTreeClassifier, which both had an accuracy increase of about 5%. The three SVC algorithms also had a significant accuracy increase, so they should not be counted out.

Noticeably, the top five algorithms consists of four different algorithm categories, so it seems that, at this stage, there is not one best algorithm category for this type of data. Also, overfitting seem to be less after feature selection, however the ensemble methods still overfit as much as before. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 30

Table 5

First comparison of the performances of the algorithms after feature selection

Classifier Train Test Test Diff.

Accuracy Accuracy Accuracy Execution Baseline

Mean Mean 3*STD Time Test acc.

ExtraTreesClassifier 1.0000 0.4082 0.0485 1.11 +0.0576

RandomForestClassifier 1.0000 0.4046 0.0352 3.46 +0.0429

LinearDiscriminantAnalysis 0.6241 0.4010 0.0331 0.15 +0.3373

RidgeClassifier 0.5819 0.3738 0.0365 0.05 +0.1121

LinearSVC 0.7235 0.3628 0.0378 19.35 +0.1327

LogisticRegression 0.6270 0.3590 0.0392 1.34 +0.1172

NuSVC 0.6961 0.3466 0.0431 10.33 +0.0874

MLPClassifier 1.0000 0.3352 0.0514 49.49 +0.2068

GradientBoostingClassifier 1.0000 0.3233 0.0513 136.36 +0.0556

SVC 0.4852 0.3019 0.0539 9.33 +0.1121

SGDClassifier 0.4979 0.2864 0.0697 1.24 +0.0769

PassiveAggressiveClassifier 0.4466 0.2730 0.0779 0.69 +0.0367

BaggingClassifier 0.9964 0.2721 0.0317 4.45 +0.0195

NearestCentroid 0.3252 0.2423 0.0359 0.02 +0.0506

GaussianNB 0.3264 0.2380 0.0463 0.04 +0.1462

Perceptron 0.3552 0.2294 0.1218 0.45 +0.0387

BernoulliNB 0.3408 0.2278 0.0207 0.04 -0.0086

KneighborsClassifier 0.4682 0.2092 0.0188 0.09 +0.0947

DecisionTreeClassifier 1.0000 0.1776 0.0486 0.68 +0.0039

AdaBoostClassifier 0.1337 0.1121 0.0622 3.54 +0.0237

QuadraticDiscriminantAnalysis 1.0000 0.0367 0.0142 0.11 +0.0026

Average difference +0.0832

Note. The accuracies are compared to the baseline accuracies before feature selection.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 31

Figure 8. Comparing algorithms after feature selection.

Hyperparameter Tuning

After feature selection, many hyperparameters for each algorithm were tested in order to find each algorithm’s maximum potential. For each algorithm, different hyperparameters were tested (see Appendix E for all tested hyperparameters). The final hyperparameters that led to the highest accuracy can be found in Appendix H. After hyperparameter tuning, every algorithm was run again using 10-fold cross validation with the best hyperparameters in order to see how much improvement was gained (see Table 6 and Figure 9).

Interestingly, the top three algorithms are all ensemble algorithms. It seems that SVC algorithms still lack behind, as its accuracy is beaten by Logistic and Ridge Regression.

Furthermore, Linear Regression algorithms work mediocre on average. Moreover, Neighbour algorithms and Naïve Base algorithms perform poorly. Surprisingly, by looking at discriminant algorithms, it seems that LinearDiscriminant works very well, but

QuadraticDiscriminant does not.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 32

Table 6

Second comparison, comparing accuracies after hyperparameter tuning

Classifier Train Test Test Diff 1st

Accuracy Accuracy Accuracy Execution Comp.

Mean Mean 3*STD Time Test Acc.

RandomForestClassifier 1.0000 0.4715 0.0583 42.01 +0.0669

BaggingClassifier 1.0000 0.4497 0.0414 542.08 +0.1776

ExtraTreesClassifier 1.0000 0.4347 0.0471 1.16 +0.0265

LinearDiscriminantAnalysis 0.6244 0.4280 0.0392 0.189 +0.0270

RidgeClassifier 0.5970 0.4044 0.0460 7.44 +0.0306

LogisticRegression 0.6888 0.3967 0.0254 17.60 +0.0377

MLPClassifier 0.8381 0.3895 0.0384 30.39 +0.0543

SVC 0.8387 0.3895 0.0455 15.11 +0.0662

NuSVC 0.9198 0.3887 0.0473 16.29 +0.0421

LinearSVC 0.7338 0.3872 0.0269 25.65 +0.0244

GradientBoostingClassifier 0.9936 0.3511 0.0409 155.29 +0.0278

SGDClassifier 0.5511 0.3242 0.0496 3.05 +0.0378

NearestCentroid 0.3593 0.2907 0.0429 0.03 +0.0483

AdaBoostClassifier 0.3736 0.2882 0.0301 76.51 +0.1761

PassiveAggressiveClassifier 0.4186 0.2700 0.0501 0.68 -0.0030

GaussianNB 0.3461 0.2678 0.0325 0.05 +0.0299

BernoulliNB 0.3625 0.2629 0.0467 0.04 +0.0537

KneighborsClassifier 0.3663 0.2607 0.0445 0.13 +0.0831

Perceptron 0.3822 0.2528 0.1079 0.44 +0.0234

DecisionTreeClassifier 0.4933 0.2210 0.0424 0.56 +0.0434

QuadraticDiscriminantAnalysis 0.1293 0.2149 0.0424 0.12 +0.1781

Average difference +0.0596

Note. Accuracies are compared to the first comparison.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 33

Figure 9. Comparing the performance of the algorithms after hyperparameter tuning.

Voting

A voting classifier was used to classify the data. Voting can result in a ‘super’ algorithm, although it does not guarantee a higher accuracy result than the current best algorithm. Not all classifiers can be used in a voting algorithm, the algorithm needs to have a function that can compute probabilities of possible outcomes for samples in the featureset. The following classifiers used in this study cannot be used for voting: PassiveAggressiveClassifier,

RidgeClassifier, Perceptron, NearestCentroid and LinearSVC. For the algorithms that could be used, both soft and hard voting was done. Two methods were used, firstly algorithms were combined starting with the top two algorithms and then recursively adding the next one.

Secondly, only the best algorithms of their category were used such as Random Forest from

Ensemble and Logistic Regression from Linear Regression. Similarly, first the top two algorithms were used and then another one was added until all algorithms were used. Voting using the top five algorithms led to an accuracy result of 47.61% using hard voting. Using only different algorithm categories, a result of 45.02% was found using the best two algorithms with soft voting. A full result of these methods can be found in Appendix I.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 34

Extra Algorithms

Relevance Vector Machines (RVC) were not used until now because its processing time would be too long. They have been seldom used in previous research, but they have shown some promising results as they performed equally good as SVMs in some studies (Nguyen &

Hai-Son, 2015). One reason why they might not be used very often is because there is only one implementation of RVC in Python (Ritchie & Jonathan, 2019). This package however, does not have any documentation. Another reason why it is not used more often is because the fitting time of RVCs are usually longer than SVMs. Nevertheless, a RVC with default hyperparameters was tried. Even though RVC was not tuned, this algorithm managed to achieve an accuracy of 31.72%, which was lower than all SVC algorithms.

Since Ensemble algorithms performed very well on this featureset, some state-of-the- art ensemble algorithms were used. Firstly, the XGBoost algorithm was tried (Chen &

Guestrin, 2016). This is an algorithm that uses parallel tree boosting (XGBoost, 2016). Secondly, the lightGBM algorithm was used (Ke et al., 2017). LightGBM is also a decision tree algorithm created by Microsoft. Lastly, the most recent CatBoost algorithm was used (Dorogush, Ershov, & Gulin, 2018). One characteristic of these algorithms is that they can run on a GPU, which makes them very fast. Only one custom parameter (n_estimators: 1000) was used to make them more comparable to the tuned algorithms. Setting this hyperparameter had a large effect on the Random Forest algorithm during hyperparameter tuning, so it was expected that it would have a similar effect on these algorithms. The results show that all these algorithms perform very well, with CatBoost achieving a classification accuracy of 51.16%.

Finally, for practical uses a simple model was created. This model can be used as a baseline for future research. The most important features were selected to create a model that would result in an acceptable accuracy. Random Forest was chosen as classifier because it was proven to be the best simple algorithm in this research. In total, the top 15 features were selected, resulting in a simple and fast model. Two hyperparameters were found to be beneficial to the accuracy improvement: (max_depth=20, n_estimators=1000). This resulted in a model that could achieve a gesture classification accuracy of 35%. The final performance of all algorithms used in the study can be seen in Table 7 and Figure 10.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 35

Table 7

Final result of all algorithms

Classifier Train Test Test Diff.

Accuracy Accuracy Accuracy Previous

Mean Mean 3*STD Research

CatBoost 0.9861 0.5116 0.0711 +0.2780

Voting A – Hard 1.0000 0.4761 0.0491 +0.2425

RandomForestClassifier 1.0000 0.4715 0.0583 +0.2379

LightGBM 1.0000 0.4617 0.0384 +0.2281

XGBoost 1.0000 0.4543 0.0463 +0.2207

VOTING B – Soft 0.9235 0.4502 0.0452 +0.2166

BaggingClassifier 1.0000 0.4497 0.0414 +0.2161

ExtraTreesClassifier 1.0000 0.4347 0.0471 +0.2011

LinearDiscriminantAnalysis 0.6244 0.4280 0.0392 +0.1944

RidgeClassifier 0.5970 0.4044 0.0460 +0.1708

LogisticRegression 0.6888 0.3967 0.0254 +0.1631

MLPClassifier 0.8381 0.3895 0.0384 +0.1559

SVC 0.8387 0.3895 0.0455 +0.1559

NuSVC 0.9198 0.3887 0.0473 +0.1551

LinearSVC 0.7338 0.3872 0.0269 +0.1536

GradientBoostingClassifier 0.9936 0.3511 0.0409 +0.1175

Simple Model 1.0000 0.3498 0.0373 +0.1162

SGDClassifier 0.5511 0.3242 0.0496 +0.0906

Relevance Vector Machine 0.6506 0.3172 0.0315 +0.0836

NearestCentroid 0.3593 0.2907 0.0429 +0.0571

AdaBoostClassifier 0.3736 0.2882 0.0301 +0.0546

PassiveAggressiveClassifier 0.4186 0.2700 0.0501 +0.0364

GaussianNB 0.3461 0.2678 0.0325 +0.0342

BernoulliNB 0.3625 0.2629 0.0467 +0.0293

KneighborsClassifier 0.3663 0.2607 0.0445 +0.0271

Perceptron 0.3822 0.2528 0.1079 +0.0192 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 36

BASELINE STUDY 0.2336 -

DecisionTreeClassifier 0.4933 0.2210 0.0424 -0.0126

QuadraticDiscriminantAnalysis 0.1293 0.2149 0.0424 -0.0187

Note. Accuracies are compared to previous research (de Wit et al., 2019a).

Voting A: RandomForestClassifier, BaggingClassifier, ExtraTreeClassifier,

LinearDiscriminantAnalysis and LogisticRegression.

Voting B: RandomForestClassifier and LinearDiscriminantAnalysis

Figure 10. Final accuracy result of all algorithms.

Gestures

Some gestures are easier to classify than others. To see how well each gesture was classified, the tuned Random Forest algorithm was run and a classification report was retrieved (see Table 8). For this run, the classification accuracy was 47.50%. The gestures were ordered based on their F1-score to see their classification performance. The top five gestures all had a good precision, recall and F1-score. However, the algorithm had problems classifying certain gestures. The gestures ‘pig’ and ‘boat’ all had a F1-score of respectively 0.06 and 0.15.

Although the precisions scores of the bottom five gestures except pig were fine, their recall scores were low. This means that these gestures were classified correctly in many cases when RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 37

they were classified. However, the algorithm had problems finding all relevant instances of these gestures in the dataset.

To see how much influence the two hardest classifiable gestures had on the accuracy, the same algorithm was run again without these gestures in the featureset. This resulted in an accuracy of 53.40%, which is an improvement of 5.90%. There are a few possible reasons why the algorithms had problems classifying these gestures. Firstly, participants could have problems expressing these gestures. Secondly, the gestures that were classified poorly could look too much like other gestures. Furthermore, participants could have taken different approaches in expressing gestures. For example, when expressing a bridge, there are many types of bridges that a participant can choose to express. Moreover, in the case of expressing a lamp, the participants might have only moved their hands slightly, as a lamp is small. This may be problematic for the sensor, as it may be hard to pick up these small movements.

Table 8

Easiest 5 and hardest 5 gestures for the RF algorithm to classify

Top 5 Precision Recall F1-score Support

Airplane 0.81 0.76 0.78 33

Crocodile 0.75 0.75 0.75 28

Violin 0.75 0.67 0.71 27

Train 0.83 0.59 0.69 32

Bird 0.59 0.81 0.68 16

Bottom 5 Precision Recall F1-score Support

Pig 0.11 0.04 0.06 26

Boat 0.38 0.10 0.15 31

Bridge 0.38 0.15 0.22 33

Stairs 0.50 0.16 0.24 32

Lamp 0.36 0.20 0.26 25

Note. A full classification report can be found in Appendix J. For this run, the data was

split into a train (0.75) and test (0.25) set and no cross validation was done. Accuracy was

0.4750 on this run. Accuracy without ‘Boat’ and ‘Pig’ was 0.5340.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 38

Discussion

The initial goal of this study was to see if time-series analysis could be used in supervised gesture recognition. This was done by focusing on using time-series features and different machine learning algorithms. In this research, many steps were taken to use Kinect data for gesture classification. By using an unique approach for feature extraction, this research hopes to contribute to the ongoing Human-Robot interaction research. To answer the research question, first the two sub questions will be answered.

Features

A sub question of this research was: “How well do time-series features work for gesture classification?”. This study extracted time-series features from the data to classify gesture data, which is unique to this field. The reason for this choice was that time-series analyses showed promising results in other research fields (Wang, Smith, & Hyndman, 2006). Furthermore, previous research suggested that researchers should focus on using time-series features in combination with conventional supervised machine learning algorithms (Bagnall, Davis, Hills,

& Lines, 2012; Fulcher & Jones, 2014). Looking at the results, it has been shown that time-series features work well for supervised gesture classification.

There are some benefits in using time-series features. Firstly, it is an easy way to reduce the size of the dataset, which also results in reduced processing time for the algorithms.

Secondly, time-series features are simple to calculate, reducing the time in pre-processing.

Moreover, time-series features are easy to comprehend as a mean or standard deviation are basic statistics. Furthermore, it also gives information about which columns of the data provide the most information.

The simplest time-series features such as ‘mean’, ‘standard deviation’, ‘maximum’,

‘minimum’ and ‘mean absolute change’ work very well for this kind of data. Additionally, these kind of statistics are among the easiest to comprehend, the more advanced time-series features did not perform as well as the more simple features.

In this study, Recursive Feature Elimination was used as a feature selection method.

Although processing time was long, it successfully selected the most important features. It managed to select the 146 most important features out of a featureset with 4,960 features. By RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 39

ranking the features, the most important features have been found. What has been shown is that some limbs provide more information than others. The position of the hands seem to be one of the most important information that the Kinect provides, which is in agreement with previous research (Escalera et al., 2013). Similarly, the orientations of the shoulders give a lot of information as well. In fact, the top 25 most important features are only about the hand positions or orientations of the shoulders.

One question arising from this research is how to detect outliers. In this study outliers were detected based on the time it took for participants to perform the gesture. Outliers were defined as those that had a z-score of +3 or -3 based on the gesture’s duration. Only a small part of the data was removed (2.21%) and this had a negligible effect on the algorithms’ performances. A better outlier detection method should lead to an increase in accuracy as it will remove cases that deviate from the data too much (Osborne & Overbay, 2014). However, it is hard to define what outliers actually are in a gesture dataset, since the people that performed the gestures in the previous study were free how to choose how to produce a gesture (de Wit et al., 2019a). Also, no method of outlier detection is used in similar research.

Gestures are always classified even if the participant is performing a bogus gesture. To solve this, one method that could potentially work could occur during recording of new gestures. If the classifier is trying to classify the gesture, it could discard the recording of the participant when the probability of the predicted gesture is too low. This should occur when the gesture performed by a participant is deviating too much from the similar gestures already recorded.

One problem in the current research is the detection of movement. The participant started moving some time after the Kinect started recording. This resulted in idleness in the data which was removed with a basic method. Although this method seemed to work fairly well, better methods of movement detection can be created. To prevent idleness in the data, some measures could be taken when recording gestures. For example, only begin recording when the hands move at least a certain distance. This should lead to less idleness in the data.

Algorithms

The second sub question of this research was “What machine learning algorithms work best on gesture time-series features?”. To answer this question, many supervised machine learning algorithms were tested in this study. By running each algorithm again after each RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 40

processing step showed what the effects were of each step on every algorithm. The top five algorithms (voting excluded) in this study were all ensemble algorithms. These algorithms seem to work the best on this kind of data. A simple Random Forest seems to work exceptionally well. The benefits of using Random Forests is that they are fast and simple

(Dietterich, 2000).

SVM algorithms did not perform as well as expected. All SVMs were outperformed by a simple Logistic Regression. This is in contrast to previous research, which found that SVMs perform better than simpler algorithms (Hsu & Lin, 2002). Moreover, in previous gesture classification research using a Kinect, it was found that SVMs worked well (Orasa, Nukoolkit,

& Watanapa, 2012). The disadvantage of SVMs is that they require more processing time. Since

SVMs were outperformed by faster algorithms, it would not be advisable to use SVMs on this kind of features.

Other type of algorithms lacked behind even more. This research shows that the best algorithm is depends on the dataset. This is in agreement with previous research (Wolpert,

1996). The best practice of trying many algorithms to see what works had led to surprising results. It was not expected that more simple algorithms would outperform more advanced algorithms.

Combing algorithms into a super algorithm was not very successful. Combing the top five algorithms into one voting algorithm only led to a negligible accuracy increase of 0.46% over the Random Forest algorithm when hard voting was used. Combing only different type of algorithms with each other did not led to a higher accuracy than the Random Forest algorithm. It seems that voting is not worth it as it cost more processing time than just using a single algorithm.

Testing the most modern ensemble algorithms has led to some promising results. Even though XGBoost and Light GBM were not tuned, they performed similarly to a full tuned

Random Forest. As ensemble algorithm gained about 2% to 6% in accuracy after hyperparameter tuning, it is expected that these two algorithms will beat a Random Forest in classification. Looking at CatBoost, this algorithm has shown excellent results. This latest ensemble algorithm outperformed the Random Forest algorithm by 4% even when it was not tuned. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 41

This study has shown that selecting features is a crucial step when using time-series features. It resulted in an average accuracy increase of 8.32% for the algorithms. Some algorithms were more sensitive to feature selecting than others. While ensemble algorithms only saw a minor increase, SVMs saw a major increase in accuracy. More surprisingly, the

Linear Discriminant Analysis went to being one of the worst performing algorithms to being one of the best. But the most benefit of this step is the decrease in fitting time.

Hyperparameter tuning was also key in achieving a better accuracy. This research saw an average accuracy increase of 5.96% after this step. Most algorithms saw only a few percent gain, but the algorithms ‘Bagging’, ‘AdaBoost’ and ‘Quadratic Discriminant Analysis’ all saw an increase of approximately 17%.

Limitations and Future Research

More time-series features should be tested. Currently, 32 of the 65 available features that TSFRESH can calculate were used. It is unknown how well the more comprehensive time- series feature will perform. What this research has shown is that the simplest time-series features were most important, however, no comprehensive time-series feature were used. The other 33 time-series features that were not calculated are more comprehensive. Resultingly, it will negatively impact processing time.

The time-series features used in this study might be considered simple. However, this also has benefits. They are very fast to calculate, and they are easy to comprehend. More smart features might lead to a higher accuracy. Features such as the joint angles with respect to the person’s torso as used in previous research might be more successful features (Gu, Do, Ou, &

Weihua, 2012). Similarly, calculating the velocity acceleration and angle between different joints as used in previous research might improve classification accuracy (Saha, Shreya, Konar,

& Nagar, 2013).

Supervised classification is a great method for gesture classification, but it has its limitations. This approach requires the use of a lot of data, so this method can only be applied after many gestures are recorded and classified. Using this method will not be effective on little data. One-shot learning methods such those used in previous research (Cabrera & Wachs,

A Human-Centered Approach to One-Shot Gesture Learning, 2017) are still required for RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 42

gesture classification in new robots. Only after collecting a lot of data, this method can be used to further improve gesture classification.

Even though the use of voting algorithms did not lead to a significant improvement in accuracy, more advanced methods to combine algorithms might be more successful. Voting is also not guaranteed to increase performance (Kuncheva & Rodríguez, 2014). Stacking is another way to combine algorithms (Du & Swamy, 2014), which has not been tried before on this kind of data. Research shows that stacking often leads to better results than using individual algorithms (Džeroski & Ženko, 2004).

This research has used the most conventional supervised learning algorithms to classify the data. Future research could focus on Neural Networks. Since the Kinect records a lot of data, Artificial Neural Networks should be a good choice. However, using time-series in combination with ANNs might not be as successful, as time-series turn a large dataset in a small featureset, while ANNs work better with more data (Alwosheel, van Cranenburgh, &

Chorus, 2018).

This study has shown that the hand positions contain the most information. Future research could use a Kinect sensor in combination with a Leap Motion device. The Leap

Motion device specialises in hands and is more accurate in recording these movements

(Weichert, Bachmann, Rudak, & Fisseler, 2013). More accurate recordings might make it easier for algorithms to distinguish between gestures.

Conclusion

This research has shown a new method for classification gestures based on a large

Kinect dataset. Time-series statistics work well as features, their processing speed and simplicity make it a good way to extract features from the data. A strong point about this research is that many algorithms were tested after multiple processing steps. Furthermore, all algorithms were tested using 10-fold cross validation to retrieve non-biased accuracies.

Ensemble algorithms outperform all other algorithm types using this kind of data. A basic Random Forest works well after hyperparameter tuning. More complicated algorithms such as SVMs perform worse compared to other algorithms. Since their processing time is longer, they should not be used. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 43

For generalization purposes, a simple model of 15 time-series features using a Random

Forest was created. This model can still achieve an acceptable classification accuracy. For similar datasets this model can be used to establish a baseline. This model can also be used for practical uses, as it is very fast.

For maximum classification accuracy, the latest ensemble algorithm named CatBoost should be used. It may be harder to implement this algorithm, but its performance is very promising. Unlike regular algorithms, the latest ensemble algorithms can also use a GPU, which speeds up processing by a lot.

The creators of the original dataset managed to achieve a classification accuracy of 23%

(de Wit et al., 2019a). This research has shown that this accuracy can be improved to over 50% with supervised learning using simple time-series features. Using the most important features, a simple Random Forest seems to be one of the best algorithms for this type of data. The simplicity of the features and this algorithm makes this method fast classification.

In conclusion, time-series features can be successfully used in supervised gesture recognition. Their simplicity and computation speed should make using these features an interesting choice. This research contributes to the ongoing research of human-robot interaction. Recognising gestures is an important step for robots in non-verbal communication.

Ultimately this will help robots communicate naturally with humans.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 44

References

Adăscălitei, F., Doroftei, I., Lefeber, D., & Vanderborght, B. (2014). Controlling a social robot-

performing nonverbal communication through facial expressions. Advanced Materials Research(837), 525-530. doi:10.4028/www.scientific.net/AMR.837.525

Aggarwal, J. K., & Cai, Q. (1999). Human motion analysis: A review. Computer vision and

image understanding, 73(3), 428-440. doi:10.1006/cviu.1998.0744

Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate

bankruptcy. The journal of finance, 23(4), 589-609. doi:j.1540-6261.1968.tb00843.x

Alwosheel, A., van Cranenburgh, S., & Chorus, C. G. (2018). Is your dataset big enough?

Sample size requirements when using artificial neural networks for discrete choice

analysis. Journal of choice modelling, 28, 167-182. doi:10.1016/j.jocm.2018.07.002

Bagnall, A., Davis, L., Hills, J., & Lines, J. (2012). Transformation based ensembles for time

series classification. Proceedings of the 2012 SIAM international conference on data mining

(pp. 308-318). Anaheim, CA: Society for Industrial and Applied Mathematics.

doi:10.1137/1.9781611972825.27

Bagnall, A., Lines, J., Bostrom, A., Large, J., & Keogh, E. (2017). The great time series

classification bake off: a review and experimental evaluation of recent algorithmic

advances. Data Mining and Knowledge Discovery, 31(1), 606-660. doi:10.1007/s10618-

016-0483-9

Beattie, G. (2004). Visible thought: The new psychology of body language. London: Routledge.

doi:10.4324/9780203500026

Berndt, D. J., & Clifford, J. (1994). Using dynamic time warping to find patterns in time

series. 359-370, 10(16), 359-370.

Bhattacharya, S., Czejdo, B., & Perez, N. (2012). Gesture classification with machine learning

using kinect sensor data. 2012 Third International Conference on Emerging Applications of

Information Technology (pp. 348-351). Kolkata, India: IEEE.

doi:10.1109/EAIT.2012.6407958 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 45

Biswas, K., & Basu, S. K. (2011). Gesture recognition using microsoft kinect®. In The 5th

international conference on automation, robotics and applications. (pp. 100-103).

Wellington: IEEE. doi: 10.1109/ICARA.2011.6144864

Cabrera, M. E., & Wachs, J. P. (2016). Embodied gesture learning from one-shot. 2016 25th

IEEE International Symposium on Robot and Human Interactive Communication (RO-

MAN) (pp. 1092-1097). New York City, USA : IEEE.

doi:10.1109/ROMAN.2016.7745244

Cabrera, M. E., & Wachs, J. P. (2017). A Human-Centered Approach to One-Shot Gesture

Learning. Frontiers in Robotics and AI, 4-8. doi:10.3389/frobt.2017.00008

Celebi, S., Aydin, A. S., Temiz, T. T., & Arici, T. (2013). Gesture recognition using skeleton

data with weighted dynamic time warping. International Joint Conference on Computer

Vision, Imaging and Computer Graphics Theory and Applications (pp. 620-625). Barcelona,

United States: Springer.

Chen, S. M., & Hwang, J. . (2000). Temperature prediction using fuzzy time series. IEEE

Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 30(2), 263-275.

doi:10.1109/3477.836375

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the

22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-

794). San Francisco, United States: ACM. doi:10.1145/2939672.2939785

Chen, X., & Koskela, M. (2015, February). Skeleton-based action recognition with extreme

learning machines. Neurocomputing, 149, 387-396. doi:10.1016/j.neucom.2013.10.046

Cho, K., & Xi, C. (2014). Classifying and visualizing motion capture sequences using deep

neural networks. International Conference on Computer Vision Theory and Applications

(VISAPP). (pp. 122-130). Lisabon, Portugal: IEEE.

Christ, M., Braun, N., Neuffer, J., & Kempa-Liehr, A. W. (2018). Time Series FeatuRe

Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). (307,

Ed.) Neurocomputing, 72-77. doi:10.1016/j.neucom.2018.03.067 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 46

D’Orazio, T., Marani, R., Renó, V., & Cicirelli, G. (2016). Recent trends in gesture recognition:

how depth data has improved classical approaches. Image and Vision Computing, 56-

72. doi:10.1016/j.imavis.2016.05.007 de Wit, J., de Haas, M., Krahmer, E., Vogt, P., Merckens, M., Oostdijk, R., . . . Wolfert, P.

(2019a). Playing charades with a robot: Collecting a large dataset of human gestures

through HRI. Companion Proceedings of the 2019 ACM/IEEE International Conference on

Human-Robot Interaction (HRI 2019) (pp. 634-635). Daegu, South Korea: IEEE.

doi:10.1109/HRI.2019.8673220 de Wit, J., Krahmer, E., & Vogt, P. (2019b). Social robots as language tutors: Challenges and

opportunities. Workshop on the challenges of working on social robots that

collaborate with people. ACM CHI Conference on Human Factors in Computing Systems.

Glasgow: ACM.

Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on

multiple classifier systems (pp. 1-15). Berlin, Heiderlberg: Springer. doi:10.1007/3-540-

45014-9_1

Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., & Keogh, E. (2008). Querying and

mining of time series data: experimental comparison of representations and distance

measures. Proceedings of the VLDB Endowment, 1(2), 1542-1552.

doi:10.14778/1454159.1454226

Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of

dimensionality. AMS math challenges lecture, 1(32), 375. doi:10.1.1.329.3392

Dorogush, A. V., Ershov, V., & Gulin, A. (2018). CatBoost: gradient boosting with categorical

features support. arXiv preprint arXiv:1810.11363.

Du, K. L., & Swamy, M. N. (2014). Combining Multiple Learners: Data Fusion and Emsemble

Learning. Neural Networks and Statistical Learning (pp. 621-643). London, United

Kingdom: Springer. doi:10.1007/978-1-4471-5571-3_20

Duan, K., Keerthi, S. S., & Poo, A. N. (2003). Evaluation of simple performance measures for

tuning SVM. Neurocomputing, 51, 41-59. doi:10.1016/S0925-2312(02)00601-X RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 47

Džeroski, S., & Ženko, B. (2004). Is combining classifiers with stacking better than selecting

the best one? Machine learning, 54(3), 255-273.

doi:10.1023/B:MACH.0000015881.36452.6e

Escalante, H. J., Guyon, I., Athitsos, V., Jangyodsuk, P., & Wan, J. (2017). Principal motion

components for one-shot gesture recognition. Pattern Analysis and Applications, 20(1),

167-182. doi:10.1007/s10044-015-0481-3

Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., . . . Escalante, H. (2013).

Multi-modal gesture recognition challenge 2013: Dataset and results. Proceedings of the

15th ACM on International conference on multimodal interaction (pp. 445-452). Sydney,

Australia: ACM. doi:10.1145/2522848.2532595

Fulcher, B. D. (2018). Feature-based time-series analysis. Boca Raton, FL: CRC Press.

Fulcher, B. D., & Jones, N. S. (2014). Highly comparative feature-based time-series

classification. IEEE Transactions on Knowledge and Data Engineering, 26(12), 3026-3037.

doi:10.1109/TKDE.2014.2316504

Graham, J., & Argyle, M. (1975). A cross‐cultural study of the communication of extra‐verbal

meaning by gestures. International Journal of Psychology, 10(1), 56-67.

doi:10.1080/00207597508247319

Graves, A. (2012). Supervised sequence labelling. Supervised sequence labelling with recurrent

neural networks (pp. 5-13). Berlin, Heidelberg, Germany: Springer. doi:10.1007/978-3-

642-24797-2_2

Gu, Y., Do, H., Ou, Y., & Weihua, S. (2012). Human gesture recognition through a kinect

sensor. 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO) (pp.

1379-1384). Guangzhou, China: IEEE. doi:10.1109/ROBIO.2012.6491161

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification

using support vector machines. Machine learning, 46(1-3), 390-422.

doi:10.1023/A:1012487302797 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 48

Hanna, G. B., Drew, T., Clinch, P., Shimi, S., Dunkley, P., Hau, C., & Cuschieri, A. (1997).

Psychomotor skills for endoscopic manipulations: differing abilities between right

and left-handed individuals. Annals of surgery, 225(3), 33.

Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector

machines. IEEE transactions on Neural Networks, 13(2), 415-425. doi:10.1109/72.991427

Itauma, I. I., & Kivrak, H. K. (2012). Gesture imitation using machine learning techniques.

2012 20th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4).

Mugla, Turkey: IEEE. doi:10.1109/SIU.2012.6204822

Ji, X., & Liu, H. (2010). Advances in view-invariant human motion analysis: a review. IEEE

Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1),

13-24. doi:10.1109/TSMCC.2009.2027608

John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers.

Proceedings of the Eleventh conference on Uncertainty in artificial intelligence (pp. 338-345).

Montréal, Qué, Canada: Morgan Kaufmann Publishers Inc.

Jolliffe, I. (2014). Principal component analysis. Berlin Heidelberg: Springer. doi:10.1007/978-3-

642-04898-2_455

Jones, E., Oliphant, E., & Peterson, P. (2001-). SciPy: Open Source Scientific Tools for Python.

Retrieved 04 10, 2019, from https://www.scipy.org/

Joshi, A., Ghosh, S., Betke, M., Sclaroff, S., & Pfister, H. (2017). Personalizing gesture

recognition using hierarchical Bayesian neural networks. Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition (pp. 6513-6522). Hawaii, United

States: IEEE. doi: 10.1109/CVPR.2017.56

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., & Liu, T. Y. (2017). Lightgbm: A

highly efficient gradient boosting decision tree. Advances in Neural Information

Processing Systems (pp. 3146-3154). Long Beach, California, United States: NIPS.

Kendon, A. (1994). Do gestures communicate? A review. Research on language and social

interaction, 27(3), 175-200. doi:10.1207/s15327973rlsi2703_2 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 49

Konečný, J., & Hagara, M. (2014). One-shot-learning gesture recognition using hog-hof

features. The Journal of Machine Learning Research, 15(1), 2513-2532. doi:10.1007/978-3-

319-57021-1_12

Kuncheva, L. I., & Rodríguez, J. J. (2014). A weighted voting framework for classifiers

ensembles. Knowledge and Information Systems, 38(2), 259-275. doi:10.1007/s10115-012-

0586-6

Lai, K., Konrad, J., & Ishwar, P. (2012). A gesture-driven computer interface using Kinect.

IEEE Southwest Symposium on Image Analysis and Interpretation (pp. 185-188). Santa Fe:

IEEE. doi:10.1109/SSIAI.2012.6202484

Le, T.-L., & Nguyen, M.-Q. (2013). Human posture recognition using human skeleton

provided by Kinect. In 2013 international conference on computing, management and

telecommunications (ComManTel) (pp. 340-345). Ho Chi Minh City, Vietnam: IEEE.

doi:10.1109/ComManTel.2013.6482417

Leap Motion Inc. (2012). The Leap. Retrieved from https://www.leapmotion.com/

Lee, J., Lee, Y., Lee, E., & Hong, S. (2004). Hand region extraction and gesture recognition

from video stream with complex background through entropy analysis. The 26th

Annual International Conference of the IEEE Engineering in Medicine and Biology Society

(pp. 1513-1516). San Fransico, United States: IEEE. doi:10.1109/IEMBS.2004.1403464

Lui, Y. M. (2012). Human gesture recognition on product manifolds. Journal of Machine

Learning Research, 13(Nov), 3297-3321.

Mahbub, U., Imtiaz, H., Roy, T., Rahman, M. S., & Ahad, M. A. (2013). A template matching

approach to one-shot-learning gesture recognition. Pattern Recognition Letters, 34(15),

1780–1788. doi:10.1016/j.patrec.2012.09.014

Malgireddy, M. R., & Nwogu, I. G. (2013). Language-motivated approaches to action

recognition. The Journal of Machine Learning Research, 14(1), 2189-2212.

Marin, G., Dominio, F., & Zanuttigh, P. (2014, October). Hand gesture recognition with leap

motion and kinect devices. In 2014 IEEE International Conference on Image Processing

(ICIP) (pp. 1565-1569). Paris, France: IEEE. doi:10.1109/ICIP.2014.7025313 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 50

Metzen, J. H. (2018). Gaussian processes classification. Retrieved from

https://github.com/scikit-learn/scikit-

learn/blob/master/sklearn/gaussian_process/gpc.py#L400

Microsoft. (2014, October 21). CameraSpacePoint Structure. Retrieved from Kinect for

Windows SDK 2.0: https://docs.microsoft.com/en-us/previous-

versions/windows/kinect/dn772836%28v%3dieb.10%29

Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions on Systems,

Man, and Cybernetics, Part C (Applications and Reviews), 37(3), 311-324.

doi:10.1109/TSMCC.2007.893280

Montemayor, J., Alborizi, H., Druin, A., Hendler, J., Pollack, D., Porteous, J., . . . Lal, A.

(2000). From PETS to Storykit: Creating new technology with an intergenerational design

team. Pitts-burgh, PA, USA.

Montemerlo, M., Pineau, J., Roy, N., Thrun, S., & Verma, V. (2002). Experiences with a mobile

robotic guide for the elderly. AAAI/IAAI, 587-592.

Morency, L.-P., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for

continuous gesture recognition. 2007 IEEE conference on computer vision and pattern

recognition. (pp. 1-8). Minneapolis, United States: IEEE. doi:10.1109/CVPR.2007.383299

Morris, J. (Producer), & Stanton, A. (Director). (2008). WALL·E [Motion Picture]. Walt Disney

Studios Motion Pictures.

Moustra, M., Avraamides, M., & Christodoulou, C. (2011). Artificial neural networks for

earthquake prediction using time series magnitude data or Seismic Electric Signals.

Expert systems with applications, 38(12), 15032-15039. doi:10.1016/j.eswa.2011.05.043

Nanopoulos, A., Alcock, R., & Manolopoulos, Y. (2001). Feature-based classification of time-

series data. International Journal of Computer Research, 10(3), 49-61.

Narayanan, V., Arora, I., & Bhatia, A. (2013). Fast and accurate sentiment classification using

an enhanced Naive Bayes model. International Conference on Intelligent Data

Engineering and Automated Learning (pp. 194-201). Berlin, Heidelberg: Springer.

doi:10.1007/978-3-642-41278-3_24 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 51

Navot, A., Gilad-Bachrach, R., Navot, Y., & Tishby, N. (2005). Is feature selection still

necessary? International Statistical and Optimization Perspectives Workshop "Subspace,

Latent Structure and Feature Selection" (pp. 127-138). Berlin, Heidelberg: Springer.

doi:10.1007/11752790_8

Nguyen, D.-D., & Hai-Son, L. (2015). Kinect gesture recognition: Svm vs. rvm. Seventh

International Conference on Knowledge and Systems Engineering (KSE) (pp. 395-400). Ho

Chi Minh City, Vietnam: IEEE. doi:10.1109/KSE.2015.35

Orasa, P., Nukoolkit, C., & Watanapa, B. (2012). Human gesture recognition using Kinect

camera. 2012 Ninth International Conference on Computer Science and Software

Engineering (JCSSE) (pp. 28-32). Bangkok, Thailand: IEEE.

doi:10.1109/JCSSE.2012.6261920

Osborne, J. W., & Overbay, A. (2014). The power of outliers (and why researchers should

always check for them). Practical assessment, research & evaluation, 9(6), 1-12.

Patsadu, O., Nukoolkit, C., & Watanapa, B. (2012). Human gesture recognition using Kinect

camera. 2012 Ninth International Conference on Computer Science and Software

Engineering (JCSSE) (pp. 28-32). UTCC, Bangkok Thailand: IEEE. doi:10.1007/s10462-

012-9356-9

Pfister, A., West, A. M., & Noah, J. A. (2014). Comparative abilities of Microsoft Kinect and

Vicon 3D motion capture for gait analysis. Journal of medical engineering & technology,

38(5), 274-280. doi:10.3109/03091902.2014.909540

Pincus, S. M. (1991). Approximate entropy as a measure of system complexity. Proceedings of

the National Academy of Sciences, 88(6), 2297-2301. doi:10.1073/pnas.88.6.2297

Preis, J., Kessel, M., Werner, M., & Linnhoff-Popien, C. (2012). Gait recognition with kinect.

1st international workshop on kinect in pervasive computing, (pp. 1-4). New Castle, United

Kingdom.

Rahman, M. H., & Afrin, J. (2013). Hand gesture recognition using multiclass support vector

machine. International Journal of Computer Applications., 1, 39-43. doi:10.5120/12852-

9367 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 52

Raptis, M., Kirovski, D., & Hoppe, H. (2011). Real-time classification of dance gestures from

skeleton animation. Proceedings of the 2011 ACM SIGGRAPH/Eurographics symposium

on computer animation (pp. 147-156). Vancouver, British Columbia, Canada: ACM.

doi:10.1145/2019406.2019426

Rasmussen, C. E. (2003). Gaussian processes in machine learning. Summer School on Machine

Learning (pp. 63-71). Berlin, Heidelberg, Germany: Springer. doi:10.1007/978-3-540-

28650-9_4

Ren, Z., Yuan, J., Meng, J., & Zhang, Z. (2013). Robust part-based hand gesture recognition

using kinect sensor. IEEE transactions on multimedia, 15(5), 1110-1120.

doi:10.1109/TMM.2013.2246148

Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods.

Journal of Machine Learning Research, 3(Mar), 1371-1382.

Rish, I. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 workshop on

empirical methods in artificial intelligence. 3, pp. 41-46. Seattle, Washington, United

States: AAAI.

Ritchie, J., & Jonathan, F. (2019, 04 16). Scikit-rvm. Retrieved from Relevance Vector Machine

implementation using the scikit-learn API: https://github.com/JamesRitchie/scikit-

rvm

Rosa-Pujazón, A., Barbancho, I., Tardón, L. J., & Barbancho, A. M. (2016). Fast-gesture

recognition and classification using Kinect: an application for a virtual reality

drumkit. Multimedia Tools and Applications, 75(4), 8137-8164. doi:10.1007/s11042-015-

2729-8

Safavian, S. R., & Landgrebe, D. (1991). A survey of decision tree classifier methodology.

IEEE transactions on systems, man, and cybernetics, 21(3), 660-674. doi:10.1109/21.97458

Saha, S., Datta, S., Konar, A., & Janarthanan, R. (2014). A study on emotion recognition from

body gestures using Kinect sensor. 2014 International Conference on Communication and

Signal Processing (pp. 56-60). Bangkok, Thailand: IEEE.

doi:10.1109/ICCSP.2014.6949798 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 53

Saha, S., Shreya, G., Konar, A., & Nagar, A. K. (2013). Gesture recognition from indian

classical dance using kinect sensor. 2013 Fifth International Conference on Computational

Intelligence, Communication Systems and Networks (pp. 3-8). Madrid, Spain: IEEE.

doi:10.1109/CICSYN.2013.11

Scikit-learn. (n.d.). Feature importances with forests of trees. Retrieved from scikit-learn:

https://scikit-

learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-

auto-examples-ensemble-plot-forest-importances-py

Sckit-learn. (2019, 04 15). Nearest Neighbors. Retrieved from Nearest Neighbors Classification:

https://scikit-learn.org/stable/modules/neighbors.html#classification

Segal, M. R. (2003). Machine learning benchmarks and random forest regression. Kluwer

Academic Publisher.

Seung, H. S., & Lee, D. D. (2000). The manifold ways of perception. Science, 290(5500), 2268-

2269. doi:10.1126/science.290.5500.2268

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., . . . Blake, A. (2011).

Real-time human pose recognition in parts from single depth images. Conference on

Computer Vision and Pattern Recognition (CVPR 2011) (pp. 1297-1304). Colorado

Springs, United States: IEEE. doi:10.1109/CVPR.2011.5995316

Sinha, A., Chakravarty, K., & Bhowmick, B. (2013). Person identification using skeleton

information from kinect. Proceedings of the International Conference on Advances in

Computer-Human Interactions (pp. 101-108). Nice, France: IARIA XPS Press.

Ten Holt, G. A., Reinders, M. J., & Hendriks, E. A. (2007). Multi-dimensional dynamic time

warping for gesture recognition. Thirteenth annual conference of the Advanced School for

Computing and Imaging, 300(1).

Tirelli, T., & Pessani, D. (2011). Importance of feature selection in decision-tree and artificial-

neural-network ecological applications. Alburnus alburnus alborella: A practical

example. Ecological informatics, 6(5), 309-315. doi:10.1016/j.ecoinf.2010.11.001 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 54

Uddin, M. Z., Thang, N. D., & Kim, T.-S. (2010). Human activity recognition via 3-D joint

angle features and Hidden Markov models. IEEE International Conference on Image

Processing (pp. 713-716). Hong Kong, China: IEEE. doi:10.1109/ICIP.2010.5651953

Vogt, P., van den Berghe, R., de Haas, M., Hoffman, L., Kanero, J., Ezgi, M., . . . Kumar

Pandey, A. (2019). Second language tutoring using social robots: A large-scale study.

Proceedings of the 2019 ACM/IEEE International Conference on Human-Robot Interaction

(HRI 2019) (pp. 497-505). Daegu, South Korea: IEEE. doi:10.1109/HRI.2019.8673077

Wang, X., Smith, K., & Hyndman, R. (2006). Characteristic-based clustering for time series

data. Data mining and knowledge Discovery, 13(3), 335-364. doi:10.1007/s10618-005-0039-

x

Weichert, F., Bachmann, D., Rudak, B., & Fisseler, D. (2013). Analysis of the accuracy and

robustness of the leap motion controller. Sensors, 13(5), 6380-6393.

doi:10.3390/s130506380

Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural

computation, 8(7), 1341-1390. doi:10.1162/neco.1996.8.7.1341

Wu, D., Zhu, F., & Shao, L. (2012). One shot learning gesture recognition from rgbd images.

2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Workshops (pp. 7-12). Rhode Island, United States: IEEE.

doi:10.1109/CVPRW.2012.6239179

XGBoost. (2016). XGBoost. Retrieved from XGBoost Python Package:

https://xgboost.readthedocs.io/en/latest/python/index.html

Xia, L., Chen, C. C., & Aggarwal, J. K. (2011, June). Human detection using depth

information by kinect. CVPR 2011 workshops, 15-22. doi:10.1109/CVPRW.2011.5981811

Xing, E. P., Jordan, M. I., & Karp, R. M. (2001). Feature selection for high-dimensional

genomic microarray data. ICML, 1, 601-608. doi:10.1.1.20.9408

Yan, K., & Zhang, D. (2015). Feature selection and analysis on correlated gas sensor data with

recursive feature elimination. Sensors and Actuators B: Chemical, 212, 353-363.

doi:10.1016/j.snb.2015.02.025 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 55

Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3d

kinematics descriptor for low-latency action recognition and detection. Proceedings of

the IEEE international conference on computer vision (pp. 2752-2759). Nice, France: IEEE.

Zhang, Z. (2012). Microsoft kinect sensor and its effect. IEEE multimedia, 19(2), 4-10.

doi:10.1109/MMUL.2012.24

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 56

Appendix A

Columns of Kinect recording file

Name Description

time Seconds elapsed after recording began headpos X, Y, Z neckpos X, Y, Z rshoulder X, Y, Z relbowpos X, Y, Z rwristpos X, Y, Z lshoulderpos X, Y, Z lelbospos X, Y, Z lwristpos X, Y, Z rhippos X, Y, Z rkneepos X, Y, Z rankeplos X, Y, Z lhippos X, Y, Z lkneepos X, Y, Z lanklepos X, Y, Z rfootpos X, Y, Z lfootpos X, Y, Z rhandpos X, Y, Z lhandpos X, Y, Z rhandtippos X, Y, Z lhandtippos X, Y, Z spinebasepos X, Y, Z spidemidpos X, Y, Z spineshoulderpos X, Y, Z rhandstate Integer: (-1: Closed, 1: Open) lhandstate Integer: (-1: Closed, 1: Open) rhandconfidence String: (Low, High) lhandconfidence String: (Low, High) rthumbpos X, Y, Z lthumbpos X, Y, Z neckori X, Y, Z, W rshoulderori X, Y, Z, W relbowori X, Y, Z, W rwristori X, Y, Z, W lshouderori X, Y, Z, W lelbowori X, Y, Z, W lwristori X, Y, Z, W rhipori X, Y, Z, W rkneeori X, Y, Z, W RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 57

rankleori X, Y, Z, W lhipori X, Y, Z, W lkneeori X, Y, Z, W lankleori X, Y, Z, W rhandori X, Y, Z, W lhandori X, Y, Z, W spinebaseori X, Y, Z, W spinemidori X, Y, Z, W spineshoulderori X, Y, Z, W facepitch Integer: +90° : -90° faceyaw Integer: +90° : -90° faceroll Integer: +90° : -90° Note. All columns are float unless indicated otherwise.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 58

Appendix B

List of features and their usage

Number of columns with this

Feature name feature selected

Abs energy 2

Absolute sum of changes 3

Count above mean 2

Count below mean 0

First location of maximum 0

First location of minimum 0

Has duplicate 0

Has duplicate max 0

Has duplicate min 0

Kurtosis 0

Last location of maximum 6

Last location of minimum 6

Length 0

Longest strike above mean 7

Longest strike below mean 4

Maximum 16

Mean 21

Mean abs change 27

Mean change 0

Mean second derivate central 0

Median 0

Minimum 8

Percentage of reoccurring datapoints to all datapoints 0

Percentage of reoccurring values to all values 0

Ratio value number to time series length 1

Skewness 7 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 59

Standard Deviation 23

Sum of reoccurring data points 1

Sum of reoccurring values 3

Sum values 8

Variance 1

Variance larger than standard deviation 0

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 60

Appendix C

Outliers removed per gesture

Gesture Outliers removed

Airplane 4

Bed 3

Bird 2

Boat 2

Book 2

Bridge 1

Bus 3

Car 2

Castle 1

Chair 2

Comb 2

Cow 1

Crocodile 3

Cup 2

Drumset 1

Fish 5

Guitar 2

Helicopter 2

Horse 4

Lamp 2

Motorcycle 2

Pencil 2

Piano 4

Pig 1

Scissors 2

Spoon 4

Stairs 2 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 61

Table 4

Toothbrush 2

Tortoise 1

Train 4

Triangle 4

Trumpet 1

Violin 2

Xylophone 2

# Total 83

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 62

Appendix D

Machine learning algorithms used in this study

Algorithm Category

Decision Tree Tree

Random Forest Ensemble (Random Forest)

Extra Trees Ensemble (Random Forest)

Bagging Ensemble (Bagging)

Gradient Boosting Ensemble (Boosting)

AdaBoostClassifier Ensemble (Boosting)

Logistic Regression Linear Regression

Ridge Regression Linear Regression

Passive Aggressive Regression Linear Regression

Perceptron Linear Regression

SGD Linear Regression

K Nearest Neighbors Neighbors

Nearest Centroid Neighbors

Radius Neighbor Neighbors

GaussianNB Naïve Bayes

BernoulliNB Naïve Bayes

SVC Support Vector Machine

NuSVC Support Vector Machine

LinearSVC Support Vector Machine

Multi-Layer Perceptron Neural Networks

Linear Discriminant Analysis Discriminant Analysis

Quadratic Discriminant Analysis Discriminant Analysis

Gaussian Process Gaussian Process

Relevance Vector Machine Relevance Vector Machine

XGBoost Ensemble (Boosting)

LightGBM Ensemble (Boosting)

CatBoost Ensemble (Boosting)

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 63

Appendix E

All tested hyperparameters per algorithm

Algorithm Hyperparameters Values tested

AdaBoostClassifier n_estimators 5, 100, 250, 500, 750, 1000

learning_rate .01, .03, .05, .1, .25

algorithm SAMME, SAMME.R

BaggingClassifier n_estimators 5, 100, 250, 500, 750, 1000

max_samples .1, .25, .5, .75, 1.0

ExtraTreesClassifier criterion gini, entropy

max_depth 5, 10, 20, 25, 30, 35, 40, 45, 50, 100

min_samples_split 2, 5, 10, .03, .05

min_samples_leaf 1, 5, 10, .03, .05

*GradientBoostingClassifier learning_rate .01, .03, .05, .1, .25

RandomForestClassifier n_estimators 5, 100, 250, 500, 750, 1000

criterion gini, entropy

max_depth 5, 10, 20, 25, 30, 35, 40, 45, 50, 100

PassiveAggressiveClassifier C 1, 2, 3, 4, 5

max_iter 5, 100, 250, 500, 750, 1000

RidgeClassifier alpha .0, .1, .25, .5, .75, 1.0

solver auto, svd, cholesky, lsqr,

sparse_cg, sag, saga

max_iter 5, 100, 250, 500, 750, 1000

LogisticRegression C 1, 2, 3, 4, 5

solver newton-cg, lbfgs, liblinear, sag,

saga

max_iter 5, 100, 250, 500, 750, 1000

SGDClassifier loss hinge, log, modified_huber,

squared_hinge, perceptron

max_iter 5, 100, 250, 500, 750, 1000

shuffle True, False RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 64

Perceptron max_iter 5, 100, 250, 500, 750, 1000

shuffle True, False

MLPClassifier hidden_layer_sizes (50,50,50), (50,100,50), (100,)

activation tanh, relu

solver sgd, adam

alpha 0.0001, 0.05

learning_rate constant, adaptive

BernoulliNB alpha .0, .1, .25, .5, .75, 1.0

GaussianNB No hyperparameters available for this classifier.

KNeighborsClassifier n_neighbors 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16,

18, 20

algorithm auto, ball_tree , kd_tree, brute

NearestCentroid metric euclidean, manhattan

SVC kernel linear, poly, rbf, sigmoid

C 1, 2, 3, 4, 5

gamma auto, scale, .1, .25, .5, .75, 1.0

NuSVC kernel linear, poly, rbf, sigmoid

gamma auto, scale, .1, .25, .5, .75, 1.0

nu .1, .2, .3, .4, .5

LinearSVC C 1, 2, 3, 4, 5

loss hinge, squared_hinge

DecisionTreeClassifier criterion gini, entropy

max_depth 5, 10, 20, 25, 30, 35, 40, 45, 50, 100

min_samples_split 2, 5, 10, .03, .05

min_samples_leaf 1, 5, 10, .03, .05

LinearDiscriminantAnalysis solver svd, lsqr, eigen

QuadraticDiscriminantAnalysis reg_param .0, .1, .25, .5, .75, 1.0

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 65

Appendix F

Number of features used after feature selection per column

Position X Y Z Total

headpos 1 3 2 6

neckpos 0 0 0 0

rshoulderpos 0 0 0 0

relbowpos 1 4 1 6

rwristpos 1 0 0 1

lshoulderpos 0 0 0 0

lelbowpos 0 1 0 1

lwristpos 0 0 0 0

rhippos 0 0 0 0

rkneepos 0 0 0 0

ranklepos 0 0 0 0

lhippos 0 1 0 1

lkneepos 0 0 0 0

lanklepos 0 0 0 0

rfootpos 0 0 0 0

rhandpos 4 3 3 10

lhandpos 1 6 3 10

rhandtippos 0 0 0 1

lhandtippos 0 0 0 0

spinebaspos 0 0 0 0

spinemidpos 0 0 1 1

spineshoulderpos 0 0 0 0

rthumbpos 1 0 0 1

lthumbpos 1 0 0 1

Total 39 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 66

Orientations X Y Z W Total

neckori 3 0 2 2 7

rshoulderori 0 1 2 1 5

relbowori 3 0 0 6 9

rwristori 1 3 0 3 7

lshoulderori 1 0 0 1 2

lelbowori 1 4 3 7 15

lwristori 1 3 0 3 7

rhipori 0 0 0 1 1

rkneeori 0 2 0 3 5

rankleori 0 0 0 0 0

lhipori 1 0 0 0 1

lkneeori 1 0 0 0 1

lankleori 0 0 0 0 0

rhandori 0 1 1 2 4

lhandori 1 0 1 2 4

spinebaseori 0 0 0 0 0

Total 68

Other Total

rhandstate 7

lhandstate 5

rhandconfidence 6

lhandconfidence 3

facepitch 3

faceyaw 0

faceroll 2

Total 26

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 67

Appendix G

All features selected after RFE and their importance

Rank Feature Importance STD

1 rhandState_standard_deviation 0.011884 0.003778

2 rhandconfidence_mean 0.011073 0.003478

3 lhandState_standard_deviation 0.010692 0.003753

4 rhandposY_mean_abs_change 0.010060 0.003034

5 relboworiW_maximum 0.009948 0.003927

6 relboworiX_mean 0.009646 0.003253

7 lelboworiX_mean 0.009517 0.003887

8 lelboworiW_mean 0.009510 0.003606

9 relboworiW_mean 0.009172 0.003212

10 relboworiW_standard_deviation 0.009162 0.003316

11 lhandconfidence_mean 0.009012 0.003217

12 lwristoriW_maximum 0.008988 0.003741

13 lelboworiZ_mean 0.008928 0.002686

14 lelboworiW_maximum 0.008840 0.003020

15 relbowposY_standard_deviation 0.008763 0.003306

16 rwristoriW_maximum 0.008519 0.002553

17 lelboworiW_standard_deviation 0.008458 0.003871

18 lelboworiY_maximum 0.008318 0.003066

19 relboxoriZ_mean 0.008286 0.002692

20 lhandposY_mean_abs_change 0.008251 0.003011

21 rhandState_mean 0.008186 0.002604

22 lhandState_mean 0.008069 0.003277

23 rhandconfidence_mean_abs_change 0.008047 0.002811

24 rhandState_sum_of_reoccurring_values 0.007917 0.002352

25 relboworiW_minimum 0.007838 0.002672

26 lelboworiW_minimum 0.007828 0.002537 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 68

27 lhandState_mean_abs_change 0.007805 0.002965

28 lelboworiY_standard_deviation 0.007749 0.002818

29 neckoriX_standard_deviation 0.007746 0.002870

30 lelbowposY_standard_deviation 0.007737 0.002923

31 lhandposY_maximum 0.007704 0.002892

32 lelboworiZ_maximum 0.007639 0.002638

33 lshoulderoriX_standard_deviation 0.007611 0.002312

34 facepitch_maximum 0.007593 0.002735

35 lhandoriW_maximum 0.007570 0.003164

36 rhandState_count_above_mean 0.007558 0.002473

37 lhandoriZ_mean 0.007557 0.002818

38 lhandposY_standard_deviation 0.007554 0.003183

39 relboxoriY_mean 0.007546 0.002557

40 relboxoriY_minimum 0.007464 0.002582

41 relboxoriZ_standard_deviation 0.007450 0.002780

42 rwristoriY_maximum 0.007442 0.002613

43 relboworiX_standard_deviation 0.007400 0.002527

44 relboxoriY_mean_abs_change 0.007385 0.002982

45 lwristoriW_mean 0.007370 0.002265

46 lhandposY_skewness 0.007368 0.002672

47 lhiporiX_standard_deviation 0.007290 0.002335

48 lhandState_count_above_mean 0.007262 0.002903

49 relboxoriY_standard_deviation 0.007255 0.002952

50 rhandconfidence_standard_deviation 0.007176 0.002800

51 relboxoriY_maximum 0.007169 0.002551

52 rshoulderoriW_minimum 0.007121 0.002513

53 rhandposZ_standard_deviation 0.007117 0.002645

54 neckoriZ_mean 0.007093 0.002764

55 lhandposX_mean 0.007093 0.002865

56 rshoulderoriY_maximum 0.007067 0.002655

57 neckoriZ_minimum 0.007057 0.002746 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 69

58 rhandposY_standard_deviation 0.007044 0.002511

59 rhandoriZ_mean_abs_change 0.007036 0.002522

60 lwristoriY_maximum 0.007022 0.002890

61 rshoulderoriZ_minimum 0.007004 0.002550

62 headposY_standard_deviation 0.006966 0.002913

63 lhandposY_absolute_sum_of_changes 0.006878 0.002439

64 rhandState_mean_abs_change 0.006853 0.002707

65 rshoulderoriY_mean 0.006729 0.001978

66 rhandoriW_mean 0.006695 0.002370

67 rkneeoriY_minimum 0.006682 0.002498

68 lelboworiY_mean 0.006654 0.002407

69 neckoriW_mean_abs_change 0.006606 0.001961

70 lhandposZ_standard_deviation 0.006601 0.002566

71 lhandconfidence_mean_abs_change 0.006600 0.002398

72 rwristoriX_mean_abs_change 0.006589 0.002426

73 rkneeoriW_minimum 0.006546 0.002613

74 rwristoriY_mean_abs_change 0.006524 0.002134

75 relbowposY_skewness 0.006516 0.002327

76 relbowposY_maximum 0.006511 0.002102

77 rhandoriY_mean_abs_change 0.006505 0.002540

78 rhandconfidence_longest_strike_above_mean 0.006465 0.002662

79 lwristoriY_mean_abs_change 0.006443 0.002195

80 facepitch_sum_of_reoccurring_values 0.006416 0.002267

81 rhandposX_maximum 0.006402 0.002195

82 rwristoriY_standard_deviation 0.006381 0.002426

83 headposY_mean_abs_change 0.006365 0.002387

84 lwristoriX_mean_abs_change 0.006316 0.002355

85 facepitch_mean 0.006315 0.002197

86 lwristoriW_mean_abs_change 0.006298 0.002663

87 lshoulderoriW_mean 0.006291 0.002403

88 rhandposZ_longest_strike_below_mean 0.006256 0.002328 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 70

89 rkneeoriW_standard_deviation 0.006248 0.002881

90 rwristoriW_mean_abs_change 0.006215 0.002175

91 relboxoriY_sum_values 0.006189 0.002173

92 lhandposZ_skewness 0.006183 0.002035

93 lelboworiW_mean_abs_change 0.006181 0.002472

94 relbowposZ_standard_deviation 0.006140 0.002132

95 rhandposY_longest_strike_above_mean 0.006131 0.002358

96 lelboworiW_sum_values 0.006116 0.002364

97 lhandoriW_mean_abs_change 0.006104 0.002381

98 relboworiW_mean_abs_change 0.006095 0.002168

99 rhandoriW_maximum 0.006093 0.002127

100 rhandposX_mean 0.006092 0.002015

101 rhandposZ_skewness 0.006065 0.002451

102 lhandState_skewness 0.006052 0.002170

103 lhandposY_longest_strike_above_mean 0.006029 0.002182

104 lhandoriX_mean_abs_change 0.006021 0.002030

105 headposZ_mean_abs_change 0.006020 0.002207

106 headposX_mean_abs_change 0.006015 0.001987

107 relboworiW_sum_values 0.006009 0.002483

108 neckoriX_mean_abs_change 0.005991 0.002673

109 rhandState_skewness 0.005983 0.001990

110 rhandconfidence_absolute_sum_of_changes 0.005916 0.002058

111 neckoriW_standard_deviation 0.005906 0.001925

112 lelboworiW_abs_energy 0.005857 0.002144

113 lhandconfidence_longest_strike_above_mean 0.005842 0.001926

114 rhandposX_skewness 0.005742 0.002105

115 rhandState_longest_strike_above_mean 0.005724 0.001969

116 relboxoriZ_mean_abs_change 0.005718 0.002054

117 rhandposX_longest_strike_below_mean 0.005710 0.002202

118 lkneeoriX_mean_abs_change 0.005699 0.002059

119 rshoulderoriW_sum_values 0.005698 0.002091 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 71

120 relbowposX_last_location_of_maximum 0.005685 0.001961

121 neckoriX_variance 0.005664 0.002023

122 lhandttipposZ_last_location_of_minimum 0.005630 0.001935

123 lhipposY_last_location_of_minimum 0.005618 0.002096

124 neckoriZ_sum_values 0.005562 0.002222

125 rwristposX_last_location_of_maximum 0.005561 0.001979

126 rhandtipposZ_last_location_of_maximum 0.005528 0.002075

127 rhandconfidence_sum_of_reoccurring_data_points 0.005519 0.002105

128 headposZ_last_location_of_minimum 0.005498 0.001974

129 rthumbposX_last_location_of_minimum 0.005487 0.001872

130 rhiporiW_last_location_of_minimum 0.005484 0.001835

131 rwristoriW_sum_values 0.005476 0.002098

132 lelboworiZ_longest_strike_above_mean 0.005473 0.002155

133 spinemidposZ_last_location_of_maximum 0.005441 0.001977

134 lthumbposX_last_location_of_maximum 0.005380 0.002021

135 rkneeoriW_mean_abs_change 0.005373 0.002097

136 lelboworiY_sum_values 0.005301 0.002137

137 lhandposZ_longest_strike_below_mean 0.005299 0.001909

138 relboworiX_longest_strike_below_mean 0.005298 0.001855

139 relbowposY_longest_strike_above_mean 0.005297 0.001971

140 headposY_absolute_sum_of_changes 0.005257 0.001986

141 faceroll_sum_of_reoccurring_values 0.005192 0.001947

142 faceroll_ratio_value_number_to_time_series_length 0.005176 0.001959

143 relboxoriZ_last_location_of_minimum 0.005164 0.001998

144 rkneeoriY_last_location_of_maximum 0.004938 0.001873

146 neckoriZ_abs_energy 0.004851 0.001956

147 lwristoriY_sum_values 0.004693 0.001884

Note. As result of the Random Forest algorithm.

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 72

Appendix H

Best hyperparameters for each algorithm

Algorithm Hyperparameters Best value

AdaBoostClassifier n_estimators 1000

learning_rate .01

algorithm SAMME.R

BaggingClassifier n_estimators 1000

max_samples 1.0

ExtraTreesClassifier criterion Gini

max_depth 25

min_samples_split 1

min_samples_leaf 5

GradientBoostingClassifier learning_rate 0.05

RandomForestClassifier n_estimators 1000

criterion gini

max_depth 20

PassiveAggressiveClassifier C 1

max_iter 100

RidgeClassifier alpha .0

solver sparse_cg

max_iter 100

LogisticRegression C 3

solver liblinear

max_iter 100

SGDClassifier loss log

max_iter 100

shuffle True

Perceptron max_iter 100

shuffle True RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 73

MLPClassifier hidden_layer_sizes (100,)

activation tanh

solver adam

alpha 0.05

learning_rate constant

BernoulliNB alpha 0.1

GaussianNB # No hyperparameters available for this classifier

KNeighborsClassifier n_neighbors 20

algorithm auto

NearestCentroid metric manhattan

SVC kernel rbf

C 5

gamma scale

NuSVC kernel rbf

gamma scale

nu .3

LinearSVC C 1

loss squared_hinge

DecisionTreeClassifier criterion gini

max_depth 25

min_samples_split 10

min_samples_leaf 2

LinearDiscriminantAnalysis solver svd

QuadraticDiscriminantAnalysis reg_param 1.0

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 74

Appendix I

Full results of voting algorithms

Voting method A: Combing top n performing algorithms.

Soft Soft Hard Hard

Combination Test Accuracy Test Accuracy Test Accuracy Test Accuracy

of algorithms Mean STD*3 Mean STD*3

Top 2 0.4643 0.0464 0.4568 0.0499

Top 3 0.4700 0.0570 0.4679 0.0563

Top 4 0.4660 0.0486 0.4733 0.0553

Top 5 0.4593 0.0370 0.4761 0.0491

Top 6 0.4566 0.0336 0.4733 0.0477

Top 7 0.4550 0.0357 0.4698 0.0470

Top 8 0.4539 0.0369 0.4634 0.0485

Top 9 0.4581 0.0354 0.4665 0.0423

Top 10 0.4508 0.0404 0.4624 0.0394

Top 11 0.4509 0.0404 0.4623 0.0410

Top 12 0.4406 0.0405 0.4620 0.0395

Top 13 0.4317 0.0405 0.4582 0.0426

Top 14 0.4321 0.0422 0.4578 0.0469

Top 15 0.4339 0.0406 0.4583 0.0467

All 0.4319 0.0407 0.4534 0.0451

Rank Algorithm Rank Algorithm

1 Random Forest 9 Gradient Boosting

2 Bagging Classifier 10 SGD

3 Extra Tree 11 Ada Boost

4 Linear Discriminant Analysis 12 GaussianNB

5 Logistic Regression 13 BernoulliNB

6 Multi-layer Perceptron 14 K-nearest-neighbour

7 Support Vector Machine 15 Decision Trees

8 Nu Support Vector Machine 16 Quadratic Discriminant Analysis RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 75

Voting method B: Using only different categories of algorithms.

Soft Soft Hard Hard

Combination Test Accuracy Test Accuracy Test Accuracy Test Accuracy

of algorithms Mean STD*3 Mean STD*3

Best 2 0.4502 0.0452 0.4461 0.0512

Best 3 0.4441 0.0422 0.4492 0.0383

Best 4 0.4388 0.0280 0.4412 0.0344

Best 5 0.4399 0.0305 0.4444 0.0316

Best 6 0.4143 0.0377 0.4415 0.0385

Best 7 0.4159 0.0385 0.4415 0.0385

All 0.4222 0.0380 0.4442 0.0414

Rank Algorithm Rank Algorithm

1 Random Forest 5 Support Vector Machine

2 Linear Discriminant Analysis 6 GaussianNB

3 Logistic Regression 7 K-nearest-neighbour

4 Multi-layer Perceptron 8 Decision Trees

RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 76

Appendix J

Classification report of the tuned Random Forest algorithm

Gesture Precision Recall F1-score Support

Airplane 0.81 0.76 0.78 33

Bed 0.50 0.72 0.59 25

Bird 0.59 0.81 0.68 16

Boat 0.38 0.10 0.15 31

Book 0.54 0.54 0.54 24

Bridge 0.38 0.15 0.22 33

Bus 0.54 0.43 0.48 30

Car 0.46 0.46 0.46 24

Castle 0.32 0.36 0.34 33

Chair 0.38 0.64 0.48 25

Comb 0.62 0.69 0.65 26

Cow 0.39 0.32 0.35 28

Crocodile 0.75 0.75 0.75 28

Cup 0.56 0.50 0.53 28

Drumset 0.42 0.77 0.55 22

Fish 0.40 0.68 0.51 28

Guitar 0.52 0.68 0.59 25

Helicopter 0.33 0.25 0.29 16

Horse 0.52 0.39 0.45 28

Lamp 0.36 0.20 0.26 25

Motorcycle 0.45 0.54 0.49 24

Pencil 0.38 0.32 0.35 31

Piano 0.42 0.72 0.53 25

Pig 0.11 0.04 0.06 26

Scissors 0.36 0.39 0.38 31

Spoon 0.37 0.38 0.38 26

Stairs 0.50 0.16 0.24 32 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 77

Table 0.41 0.50 0.45 28

Toothbrush 0.43 0.50 0.47 20

Tortoise 0.24 0.50 0.33 16

Train 0.83 0.59 0.69 32

Triangle 0.55 0.57 0.56 28

Trumpet 0.46 0.59 0.52 22

Violin 0.75 0.67 0.71 27

Xylophone 0.38 0.25 0.30 24

Micro avg. 0.47 0.47 0.48 920

Macro avg. 0.47 0.48 0.46 920

Weighted avg. 0.48 0.47 0.46 920

Note. For this run, the data was split into a train (0.75) and test (0.25) set and no cross

validation was done. Accuracy was 0.4750 on this run. Accuracy without ‘Boat’ and ‘Pig’

was 0.5340.