Running head: RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 1
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS
MICHAEL COOLEN
TILBURG UNIVERSITY
Student number: 2031591
Administration number: u741570
Email address:
Supervisor: dr. P.A. Vogt
Supervisor email address:
Second reader: dr. M. Alimardani
Course: 880502-M-18 (Master thesis/Data Science in Action)
Faculty: Tilburg School of Humanities and Digital Sciences
Department: Department of Cognitive Science and Artificial Intelligence
Program: Data Science & Society
Date: May 29th, 2019
Word count: 9.791 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 2
Preface
This thesis has been written for the master program Data Science & Society at Tilburg
University. Many thanks to my supervisor dr. Paul Vogt and Jan de Wit for providing me with this interesting research opportunity. Also, thanks to my fellow students for all the positive moments throughout the year. Finally, this thesis could not have been written without the support of my family and friends.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 3
Abstract
Human-Robot Interaction is becoming more common. For robots to communicate in a natural way with humans, they will need to develop non-verbal communication. The first step in this process is to recognise human gestures. To recognise gestures, human gestures can be recorded using a motion device such as a Kinect. This research builds on previous research in which a large gesture dataset was created using Human-Robot interaction (de Wit et al.,
2019a). In this previous study, in total 35 different types of gestures were recorded using a
Kinect. Using one-shot learning, gestures were able to be classified with an accuracy of 23%.
The goal of the current research was to find a method to increase this gesture classification accuracy. In this research, time-series analysis was used, which has not been done before in this research field. The dataset was transformed into a featureset with 4,960 features. Using feature selection, 146 features were found to be important. In total 23 machine learning algorithms were tested on the features. It was found that ensemble type algorithms work best for these kind of features. After hyperparameter tuning, it was found that a simple Random
Forest was best in classifying gestures, with an accuracy of 47%. To increase this accuracy, three state-of-the-art ensemble algorithms were tested, which resulted in a classification accuracy of over 50% using CatBoost. For generalization purposes, a fast and simple model was created using the 15 most important time-series features. This model can achieve a classification accuracy of 35%.
Keywords: Human-Robot Interaction, Gesture Recognition, Kinect, Time-series,
Machine Learning
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 4
Contents
PREFACE ...... 2
ABSTRACT ...... 3
INTRODUCTION ...... 6
RELATED WORK ...... 9
RECORDING GESTURES ...... 9
ONE-SHOT LEARNING ...... 10
SUPERVISED LEARNING ...... 11
TIME-SERIES ...... 12
SUPERVISED LEARNING ALGORITHMS ...... 14
METHOD ...... 17
SETUP ...... 17
DATASET DESCRIPTION ...... 17
PRE-PROCESSING ...... 18
FEATURE EXTRACTION ...... 20
FEATURESET PRE-PROCESSING ...... 21
BASELINE ...... 21
FEATURE SELECTION ...... 21
HYPERPARAMETER TUNING ...... 23
VOTING ...... 23
RESULTS ...... 24
ALGORITHM COMPARISON ...... 24
FEATURE SELECTION ...... 26
FIRST COMPARISON ...... 29
HYPERPARAMETER TUNING ...... 31
VOTING ...... 33
EXTRA ALGORITHMS ...... 34
GESTURES ...... 36
DISCUSSION ...... 38
FEATURES ...... 38
ALGORITHMS ...... 39 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 5
LIMITATIONS AND FUTURE RESEARCH ...... 41
CONCLUSION ...... 42
REFERENCES ...... 44
APPENDIX A ...... 56
APPENDIX B ...... 58
APPENDIX C ...... 60
APPENDIX D ...... 62
APPENDIX E ...... 63
APPENDIX F ...... 65
APPENDIX G ...... 67
APPENDIX H ...... 72
APPENDIX I ...... 74
APPENDIX J ...... 76
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 6
Introduction
Have you ever seen WALL·E? This popular Disney film featured a small cute robot that was liked by many (Morris, 2008). Although this robot did not speak one word during the film, you could still make out what he was trying to communicate. Instead of speech, this robot communicated entirely using body language. This film showed us that without language, we can still understand robots just by looking at their body language.
Although hard to put into numbers, many researchers agree that non-verbal communication is more important than verbal communication (Beattie, 2004). By using gestures, the receiver of the conversation can get a better understanding of what you are trying to communicate (Kendon, 1994). Whereas there are cultural differences, it seems that gestures are used everywhere in the world (Graham & Argyle, 1975).
Additionally, we now live in a world where humans are starting to communicate more frequently with robots. Talking verbally to a robot is a problem that has partly been tackled.
For instance, our current phones can recognise simple speech and provide us with information. On the contrary, robots cannot read our non-verbal language nor produce it.
Small applications of simple non-verbal social robots exist, but currently no robot exist that can automatically recognise gestures and communicate in a natural way with humans
(Adăscălitei, Doroftei, Lefeber, & Vanderborght, 2014).
There would be many practical benefits if robots could communicate in a natural way with humans. This research field called Human-Robot Interaction has recently gained much research. Some applications exist in the form of language tutoring for children (de Wit,
Krahmer, & Vogt, 2019b). Similarly, research on how robots can help children learn a second language is currently being done (Vogt et al., 2019). Another field in which social robots could play a significant role is in the elderly support (Montemerlo, Pineau, Roy, Thrun, & Verma,
2002). Furthermore, many applications exist in the entertainment industry, for example a dancing partner robot (Montemayor et al., 2000). More applications of social robots are expected in the future, as research continues.
The first step in this non-verbal communication problem is solving gesture recognition.
Robots do not have eyes, so they must use another method of seeing humans and their RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 7
gestures. Researchers often use motion sensing input devices such as the Microsoft Kinect™ to record human movement (Pfister, West, & Noah, 2014). Using this device, extensive information about human movement is recorded, which can then be used for data analysis.
Multiple approaches can be taken in gesture recognition. After the creation of a new robot, it does not know any gestures. To learn a lot of different gestures, creating many training examples for each gesture would take a lot of time. Therefore, one-shot learning is used as a method to recognise gestures (Escalante, Guyon, Athitsos, Jangyodsuk, & Wan, 2017). When a lot of data has been collected, supervised learning can be used. Using machine learning, gesture classification only improves when a lot of training data is available.
The current research was inspired by previous research in which a large gesture dataset was created (de Wit et al., 2019a). By letting a robot play a game of charades with a human, many gestures were recorded. In this previous study, one-shot learning using the gist of gesture was applied to classify gestures (Cabrera & Wachs, 2017). This led to a gesture classification accuracy of 23%. Now this gesture dataset has been created, supervised learning can be applied to this dataset to improve gesture classification accuracy. This study will use supervised machine learning methods to improve gesture recognition.
Supervised learning using Kinect data requires feature extraction. There is still no fool proof way to extract features from gesture data, as every research seem to use their own method (Biswas & Basu, 2011; Marin, Dominio, & Zanuttigh, 2014; Xia, Chen, & Aggarwal,
2011). In the same way, it is unknown which machine learning algorithms work well on this kind of data. Most researchers use similar algorithms, however, many more algorithms are available (Bhattacharya, Czejdo, & Perez, 2012; D’Orazio, Marani, Renó, & Cicirelli, 2016).
This study will attempt a new method of feature extraction to support the existing literature on this subject. Using Time-series Analysis, this study will attempt to classify gestures.
Time-series features can be viewed as statistics that describe the data, for example mean or standard deviation. Although time-series analysis have been applied successfully to other research fields (Pincus, 1991), this method has not been applied in gesture recognition research.
This leads to the following research question: RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 8
Can time-series analysis be used for gesture recognition?
The two methods that this research will focus on can be converted into two additional sub questions.
a) How well do time-series features work for gesture classification?
Since gesture data has the dimension of time, simple statistics such as mean and variance can be calculated. These features, also called time-series features, are simple to calculate, but might work well for gesture recognition. No research has been found using this method of feature extraction in this research field, so it is unknown how well this method will work.
b) What machine learning algorithms work best on gesture time-series features?
This question will be answered by testing different machine learning algorithms to see which ones work best. Since best is a relative term, the goal is to achieve as high accuracy as possible. However, there might be a trade-off between speed and accuracy. Some algorithm may achieve a slighter better accuracy than another, but if a faster algorithm manages to achieve just a slightly poorer result, than this faster algorithm is considered more efficient and preferred. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 9
Related Work
Some research has been done on recognising gestures. This is a multi-step process wherein the choices of each step have an effect on the final result (Mitra & Acharya, 2007). First gestures have to be recorded and turned into data that can be processed. Next, features have to be extracted from the data that can be used for analysis. Finally, an algorithm has to be used to perform classification.
Recording Gestures
Few methods exist in capturing human movement. One method is to recognise human movement from images or videos (Rahman & Afrin, 2013). Different algorithms can be used to process image data. Research shows that an acceptable classification accuracy is achievable, although often the same algorithms are used (Morency, Quattoni, & Darrell, 2007). Analysing videos with gestures is also possible, but research in this field is often limited to only hand- gestures or other simple gestures (Lee, Lee, Lee, & Hong, 2004)
In more recent years, motion sensing input devices such as the Microsoft Kinect™ have come available. Although originally developed as an Xbox gaming accessory, researchers have shown interest in these devices because of its ability to recognise human movement (Ren,
Yuan, Meng, & Zhang, 2013). This device can record human movement in 3D through time using an infrared laser projector and a RGB (Red, green, blue) camera image (Zhang, 2012).
There are multiple ways to utilize the Kinect device for research. Researchers can decide to only use the depth camera (Uddin, Thang, & Kim, 2010). In addition, the use of only the RGB sensor data is also seen in research (Biswas & Basu, 2011). Lastly, the Kinect device can also provide skeleton data, which will provide the X, Y and Z coordinates of skeleton joints of a subject through time (Le & Nguyen, 2013; Raptis, Kirovski, & Hoppe, 2011).
More recently, a Leap Motion device has come available (Leap Motion Inc, 2012). While working similar to a Kinect device, this device specialises in recording hand and finger movement. The Leap Motion can more accurately record hand movements than a Kinect
(Weichert, Bachmann, Rudak, & Fisseler, 2013). Research shows basic gestures can be recognised with an acceptable accuracy (Marin, Dominio, & Zanuttigh, 2014).
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 10
One-shot Learning
Gesture classification can be done using a few approaches. While regular machine learning tasks often use hundreds of training examples, one-shot learning refers to classifying gestures based on just a few training samples (Escalante, Guyon, Athitsos, Jangyodsuk, &
Wan, 2017). A new developed robot has to start learning to discriminate between gestures.
Researchers often provide a few training examples so the robot can begin learning. Many methods for one-shot gesture classification have been developed.
One method of recognising gestures is using Template Matching (Aggarwal & Cai, 1999).
In this method, movement is identified by applying templates or so called meshes to an image.
These templates can be used to compare and match gestures. One research uses parameters extracted from the depth motion data to differentiate between gestures. First, the background was removed from frames using a grayscale threshold. Thereafter, features were extracted from these frames (Mahbub, Imtiaz, Roy, Rahman, & Ahad, 2013). Another research took a similar approach and used both the RGB and depth sensor data from a Kinect. A process of phological denoising on depth images was done and human silhouettes were segmented using temporal segmentation (Wu, Zhu, & Shao, 2012).
Another method for extracting gestures is by using Action Recognition (Ji & Liu, 2010).
In this field, different approaches can be taken. One research used a language-motivated approach that works similar to how topics of documents can be detected for its contents. Using a hierarchical Bayesian model, visual features of videos of action poses can be connected with classes of activities (Malgireddy & Nwogu, 2013). Other researchers created a Moving Pose framework, which is a descriptor that uses both pose information as well as speed and acceleration of human body joints within a short time (Zanfir, Leordeanu, & Sminchisescu,
2013). Another method selects only relevant frames from a video and then uses motion descriptors for temporal segmentation in combination with Dynamic Time Warping (Konečný
& Hagara, 2014).
Moreover, another method in gesture recognition is by using Manifold Learning. This is a machine learning approach wherein dimensionality reduction is done. The datasets are thought of as high dimensional (Seung & Lee, 2000). One research used the geometry of tensor RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 11
space for action recognition. Using tensors to represent videos, only useful information is extracted which then can be used for gesture recognition (Lui, 2012).
Other methods of gesture recognition is using Principal Motion Components. Using only a single training video, a 2D map of motion energy is extracted from a video (Escalante,
Guyon, Athitsos, Jangyodsuk, & Wan, 2017). This 2D map can be used to categorise gestures.
The benefit of this method is its performance and efficiency.
Recent research on this topic focusses on how humans produce gestures instead of the gesture itself (Cabrera & Wachs, 2017). From a single gesture recording, biomechanical features are extracted which are called the gist (Cabrera & Wachs, 2016). This is a natural approach, because the gist represents what humans remember after seeing the gesture in combination with the cognitive processes that were used to replicate it. From a gist, many new realistic observations are created that are similar to the one provided. The goal is to generate a large dataset of similar observations. This is done by adding meaningful variability to these features. After this process is done, classifiers can be used for recognition. This approach was also used in recent research (de Wit et al., 2019a).
Supervised Learning
Another approach in gesture recognition is using Supervised Learning. This method can only be successfully applied if many training example are available (Graves, 2012). Since the previous research (de Wit et al., 2019a) managed to create a large gesture database, gesture classification accuracy can be potentially improved using this method. Supervised learning requires effective features to be extracted from the data.
The data collected in previous research also contains the Skeleton Feature Representation which is often used for gesture analysis (Raptis, Kirovski, & Hoppe, 2011). This feature representation contains the subject’s joint positions represented in X, Y and Z coordinates, over time (Shotton et al., 2011). The origin (푥 = 0, 푦 = 0, 푧 = 0) is located at the centre of the sensor.
Looking from the sensor’s point of view (see Figure 1), the X dimensions grows to the left.
Next, the Y dimension grows up, so it gives an indication of height. Finally, the Z dimension grows out in the direction the camera faces (Microsoft, 2014). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 12
Figure 1. Kinect Coordinate System (Microsoft, 2014).
Many methods are used to extract features from these skeleton representations. One way to extract features is to calculate the joint angles with respect to the person’s torso (Gu,
Do, Ou, & Weihua, 2012). Other researchers were able to distinguish between a positive and negative emotional dance by extracting features based on the upper body, velocity acceleration and angle between different joints (Saha, Shreya, Konar, & Nagar, 2013). Similarly, other researchers have tried to select the most important joints and then calculating the angles between them (Le & Nguyen, 2013). This led to a good success when a single frame was selected, but performed poorly when time-series was used.
Looking at a more complicated way of feature extraction, Principal Component Analysis was used by some researchers to divide the skeleton positions into three sections called torso, first-degree joints and second-degree joints (Raptis, Kirovski, & Hoppe, 2011). Another way of extracting features is done by developing a Dynamic Time Warping template, wherein a similarity value is produced by warping time sequences of joint positions (Celebi, Aydin,
Temiz, & Arici, 2013). It is hard to compare these individual feature extraction methods, but what most researchers agree on is that structural hand information around the hand joints are one of the most useful features that allow discrimination between gestures (Escalera et al.,
2013).
Time-series
Time is an important component of human life. Human have been analysing changing events for many centuries. For example, one of the most known time-series analysis is that of the weather (Chen & Hwang, 2000). The temperature changes trough time and patterns can be RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 13
found by analysing the change over time. A more advanced example of time-series analysis is earthquake prediction analysis, where the time and date of earthquakes are recoded in order to predict where the next one will occur (Moustra, Avraamides, & Christodoulou, 2011). What these analyses have in common is that they all use the component of time. Multiple observations are done over time in order to analyse these dynamics. Correspondingly, gesture recognition can be viewed as a time-series problem, as during recording human movement is observed for some period of time.
Time can be characterized in multiple ways (Fulcher, 2018). For example, a time-series might follow a distribution such as a normal distribution. Likewise, the values recorded during a time-series might be correlated with each other. Additionally, simple statistics such as median, maximum or minimum can be used to describe characteristics of time-series. In short, the time-series are described or characterized using statistics. This leads to a feature based representation of time, which can be used for analysis.
Time-series statistics have been used many times before in previous studies. For example, researchers have shown that entropy can be successfully used for prediction (Pincus,
1991). Other researchers have shown that statistics such as mean, standard deviation, skewness, and kurtosis can be used to analyse control chart pattern data (Nanopoulos, Alcock,
& Manolopoulos, 2001). Other researchers used thirteen features such as skewness, kurtosis, chaos, and nonlinearity to summarise time-series (Wang, Smith, & Hyndman, 2006).
There are multiple ways to analyse time-series. If all time-series are of equal length, similarity distances such as the Euclidean Distance can be computed between them (Ding,
Trajcevski, Scheuermann, Wang, & Keogh, 2008). Furthermore, if time-series are not of equal length, methods such as Dynamic Time Warping (DTW) can be used (Berndt & Clifford, 1994).
DTW is often used in combination with an algorithm such as nearest-neighbor, often leading to great results (Bagnall, Lines, Bostrom, Large, & Keogh, 2017). However, the literature describes that instead of creating new algorithms to analyse time-series, research should focus on transforming time-series into useful features and use proven algorithms to analyse them
(Bagnall, Davis, Hills, & Lines, 2012; Fulcher & Jones, 2014). To summarise, research should focus on computing different statistics over features and analyse them with conventional algorithms. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 14
Looking at gesture recognition, some research has been done on using time-series analysis. For example, DTW has been used as method for recognising sign language (Ten Holt,
Reinders, & Hendriks, 2007). Furthermore, the Euclidean Distance has been used to recognise gestures using Kinect video recordings (Ren, Yuan, Meng, & Zhang, 2013). Another method of analysing time-series is using Hidden Markov Models. This algorithm can be easily applied on time-series such as those in gestures. They are relatively easy to implement, and research shows that a good accuracy can be achieved in recognising gestures (Uddin, Thang, & Kim,
2010). However, what lacks in the literature is research on using time-series in combination with conventional algorithms. According to previous research, this should be a direction of research that has a lot of potential (Bagnall, Davis, Hills, & Lines, 2012; Fulcher & Jones, 2014).
The upside of using conventional algorithms is that they are tested and proven to be reliable.
Supervised Learning Algorithms
Many machine learning algorithms are available for supervised gesture classification.
There is no best algorithm available for classification (Wolpert, 1996), therefore researchers must use different algorithms to see what works. Focusing just on research that uses features extracted from the skeleton feature representation, some supervised machine learning algorithms are often being used.
Firstly, one of the most basic classification methods is a Decision Tree. This algorithm has been used in classifying gestures (Patsadu, Nukoolkit, & Watanapa, 2012). The benefit of using a Decision Tree is that they are very fast and easy to comprehend (Safavian & Landgrebe,
1991). However, they often lack performance compared to more advanced algorithms
(Bhattacharya, Czejdo, & Perez, 2012).
Secondly, Ensemble algorithms are slightly more comprehensive but still considered to be simple (Dietterich, 2000). They can be dived into three categories namely: Bagging, Boosting and Random Forest. The most popular algorithm seems to be Random Forests, which is also often used in gesture classification (Shotton et al., 2011). The reasons for this is that they are simple, easy to implement and use few processing resources (Segal, 2003). Boosting algorithms such as ADABoost are used in previous research with good results (Saha, Datta, Konar, &
Janarthanan, 2014). Bagging algorithms however, are not found to be used in previous research. They should have some potential similarity to other ensemble algorithms (Dietterich, RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 15
2000). Although there are more advanced algorithms available that are potentially better for gesture classification, ensemble algorithms seem a solid way for this task (Shotton et al., 2011).
Furthermore, Nearest Neighbors is a simple supervised learning method that can classify data. The most common algorithm of this category, K-nearest-neighbors, is often used in multi- class classification (Escalera et al., 2013). In gesture classification, this algorithm is also often used because it is simple and fast (Lai, Konrad, & Ishwar, 2012). Some research shows that achieving a good accuracy is possible on a simple gesture classification task (Saha, Datta,
Konar, & Janarthanan, 2014). Some other neighbor classification algorithms that differ slightly from nearest neighbors are Nearest Centroid and Radius Neighbor. These have not been used in previous research on gestures, so it not known whether they work as well as Nearest
Neighbors.
Moreover, it seems that Regression methods such as Logistic Regression are sparsely used in gesture classification. One study shows that Logistic Regression can be successfully used in gesture recognition (Itauma & Kivrak, 2012). The reason why logistic regression is seldom used is unknown. They should have some potential as one research found that they performed well compared to other basic machine learning algorithms (Rosa-Pujazón, Barbancho, Tardón,
& Barbancho, 2016).
Additionally, Naïve Bayes has not been used as an algorithm for classifying gestures. It should however have some potential, as similar research has shown that humans can be recognised with this algorithm using Kinect data (Preis, Kessel, Werner, & Linnhoff-Popien,
2012). According to previous research, Naïve Bayes should level well with other types of algorithms (Rish, 2001). Two algorithms that can be used for classification are Gaussian (John
& Langley, 1995) and Bernoulli (Narayanan, Arora, & Bhatia, 2013). It is not known how well they will perform in classifying gestures.
Next, a solid machine learning algorithm often used for classification is Support Vector
Machine. Although more depending on processing power, using SVMs often lead to a better accuracy than simpler algorithms (Hsu & Lin, 2002). This algorithm has also been applied to gesture classification, often with good success (Orasa, Nukoolkit, & Watanapa, 2012). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 16
In contrast to SVM, Relevance Vector Machines are seldom applied in gesture recognition. One research shows that RVMs achieve a similar result to SVMs (Nguyen & Hai-
Son, 2015). It is unknown how well they will perform in time-series based gesture recognition.
Finally, Artificial Neural Networks are often used for gesture classification (Joshi, Ghosh,
Betke, Sclaroff, & Pfister, 2017). The benefit of using Neural Networks is that raw data can be used and no features need to be created (Chen & Koskela, 2015). On the contrary, they often take a long time to process, however, they often lead to a better accuracy result (Cho & Xi,
2014). A basic form of an ANN is a Multi-Layer Perceptron. It has been shown that a MLP can be used to recognise people (Sinha, Chakravarty, & Bhowmick, 2013). More complicated
Neural Networks are available for gesture recognition, however that is beyond the scope of the current research.
Other algorithms that cannot be categorised are Gaussian Process (Rasmussen, 2003) and
Discriminant Analysis (Altman, 1968). These algorithms have not been used in previous research using a Kinect. These algorithms will also be tested in current research to see how well they perform.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 17
Method
Setup
For the current research, the latest version of Python was used (3.7.3). Different basic
Python packages were used for data management and calculations (see table 1). These packages were the latest ones available at the time of this writing. Programming and calculations were done using Jupyter notebooks (version 5.7.4). It should be noted that computations were done using a Windows 10 system with a 3.40 GHz (4 core) processor with
16 GB of memory. For replication purposes, using a similar system is advised as many computations took a long time to run and the system occasionally ran out of memory.
Table 1
Python packages and their version used in this study
Package Version
NumPy 1.16.2
Pandas 0.24.2
SciPy 1.2.1
Scikit-learn 0.20.3
IPython 7.4.0
Matplotlib 3.0.3
Dataset Description
The dataset used in the current research is the Lowlands/NEMO dataset, which was created in previous research (de Wit et al., 2019a). This dataset was created using human-robot interaction, specifically, by letting a robot play a game of charades with a human. The human performed a gesture which was recorded using a Kinect. Thereafter, the robot tried to recognise the gesture that was performed by the human. Although much more has been done in this study, this section of the study resulted in a large gesture dataset.
The dataset consists of 3,760 files and each file represents one participant performing a gesture. In total gestures for 35 different concepts were recorded. Each recorded gesture contained the X, Y, and Z positions of limbs such as the head, spine, hands and many more. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 18
Moreover, they also contained the orientations of these limbs in X, Y, Z, and W dimensions.
Lastly, they contained some other columns such as the state of the hands (open or closed), the confidence interval of these states, and some information about the face (pitch, roll, and yaw).
The data files have 155 columns in total (see Appendix A for a description). Some participants performed an additional attempt in expressing the gesture. Whether the current gesture data was the result of the first or second attempt could be retrieved from the filename. Likewise, data was gathered at the NEMO or Lowlands venue and this information could also be retrieved from the filename.
Pre-processing
There was no missing data in the dataset. Before processing, all fields in the data were converted to float except the integer columns (rhandState, lhandState, facepitch, faceyaw and faceroll). The columns ‘rhandconfidence’ and ‘lhandconfidence’ contained the values ‘low’ and ‘high’ and were recoded to 0 and 1.
After an inspection of the data, some idleness was detected in some cases. This meant that the Kinect started to record, however almost no movement was detected, probably because the participant did not move yet. Similarly, in some cases this idleness was recorded at the end, presumably because the participant was finished with his gesture, but the Kinect was still recording (see Figure 2 and 3).
In order to remove this idleness, two cut-off points were calculated based on the up and down (Y) movement of the right hand. First, a threshold of movement was calculated:
푇ℎ푟푒푠ℎ표푙푑: 0.10 × 푎푏푠(푚푎푥 − 푚𝑖푛). Then, the first cut-off point was calculated by starting from the beginning and selecting the first observation that moved more than the threshold.
The second cut-off point was calculated by starting at the end and selecting the first observation that moved more than the threshold. Since the hands were presumed to be the limb that gives the most information (Escalera et al., 2013), using the up and down detection of the right hand seemed to be a rational choice. The right hand was chosen, because most people are right-handed and achieve higher accuracy on movement tasks using this hand
(Hanna et al., 1997).
After additional inspection, some glitched cases were detected (first case in Figure 2).
These glitched cases characterise themselves by having mixed up rows. The time column was RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 19
not ordered and this resulted in a weird order of data. This could cause problems in some research. However, since this research uses time-series analysis, this is not a problem as time- series statistics such as mean and standard deviation calculated over a column do not change when the order changes. Thus, these cases were not excluded from further analysis.
Figure 2. Detection of right hand movement (Y dimension) for 10 random cases. Data within the two cut-off points is used.
Figure 3. Using two cut-off points to detect hand movement. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 20
Feature Extraction
Since the data consist of a time-series, it is possible to calculate time-series statistics on the data. Some examples of these statistics are mean, variance or standard deviation. The
Python package TSFRESH (version 0.11.2) was used to calculate these statistics in an easy manner (Christ, Braun, Neuffer, & Kempa-Liehr, 2018). In total, TSFRESH can currently calculate 65 different features on time-series. In the current research 32 different features were chosen (see Appendix B for a full list). These features were chosen because they were the easiest to comprehend and their calculation speed was much faster than the other features.
For all the 3,760 gestures, multiple actions were performed. Firstly, the chosen time- series features were calculated on each column. Since the gesture files have 155 columns, this resulted in a featureset of 65 × 155 = 4,960 columns. Thereafter, the current gesture (target), the origin of the data (Lowlands or Nemo, recoded to 0 and 1) and the attempt (First or second) were added as columns. However, they were not used in any model calculation as to prevent leakage. This resulted in a final featureset with the 3,760 rows and 4,963 columns.
A copy of this featureset was saved in which outliers were removed. In the current study it is theorized that the time humans take to perform gestures is normally distributed. If many people perform the same gesture, they are likely to take approximately the same time performing it. One side note on this is that humans can take another approach while performing the same gestures. For example, there are several ways to express an airplane.
Nonetheless, removing outliers based on gesture duration also removes erroneous cases such as those in which the recording went on for too long after the gesture ended.
For this reason, outliers were removed based on the time it took for a participant to perform a gesture. The detection of outliers was based on the z-score using the SciPy python package (Jones, Oliphant, & Peterson, 2001-). Participants were removed if they had a z-score of +3 or -3. This resulted in removing 83 outliers in total (2.21% of the data). The maximum removed cases for a class was 5, for the gesture fish. This finally resulted in a featureset with
3,677 rows (see Appendix C for a full result).
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 21
Featureset Pre-processing
Correlated features were removed for two reasons. Firstly, the current featureset had too many features (4,963), since the number of features is higher than the number of cases
(3,760) the featureset is high dimensional, which will cause problems in modelling (Donoho,
2000). One of these problems is overfitting, were the model is learning too recognise to many characteristics of the training set so it cannot generalize well on the test set (or new data) (Xing,
Jordan, & Karp, 2001). Another problem is that fitting the algorithms will take too long with this many features. Hence, features with a correlation of .90 or higher were removed. In total
2,521 features were removed, resulting in a feature set of 2,440 columns.
Baseline
In the original research, gesture classification was done using one-shot learning. Using k-nearest neighbors approach the researchers classified the gestures. This resulted in an accuracy of 23% which will be used as a baseline for the current study.
In the current study, in total 23 different machine learning algorithms were used to model the data. From all available supervised classification algorithm categories, at least one algorithm was chosen. Some ensemble algorithms such as AdaBoost and Random Forest were used, as they are a popular choice in gesture recognition (Shotton et al., 2011). From linear algorithms, classifiers such as Logistic Regression and Ridge Regression were used. Linear algorithms were chosen because they are seldom used in previous research. Additionally, some Support Vector Machine classifiers were used such as SVC, NuSVC and LinearSVC.
They are used in some previous and have resulted in a good accuracy (Hsu & Lin, 2002). A full list of algorithms used in this research can be found in Appendix D. For each algorithm a
10-fold cross validation was performed in order to obtain a non-biased accuracy. Furthermore, the time it took for each algorithm to run was recorded.
Feature Selection
There are multiple ways to select features in classification. One solid method to consider is Principal Component Analysis. PCA is often used on large datasets to transform the variables into a set of independent components (Jolliffe, 2014). It reduces the number of variables by combining them into components that explain the most variance of the data.
Although PCA is often used, a problem with this method is that the new principal components RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 22
can be difficult to interpret. Looking at the limitations of PCA, if this method was used on the current featureset, the resulting principal components would be extremely hard to interpret.
Furthermore, a PCA will take a long time to run on a featureset of this size. So, PCA is deemed not suitable for the current featureset.
Another way to select features is by using Feature Importance. Some ensemble machine learning algorithms such as Random Forest have a built-in method of ranking features (Scikit- learn, n.d.). With this method the importance of features can be compared to each other.
However, this method does not give information about how many features should be selected for further analysis.
A good method to select features is Recursive Feature Elimination (RFE). RFE ranks features by recursively running models with a subset of features in order to find a set of features that result in the model with the highest accuracy (Guyon, Weston, Barnhill, &
Vapnik, 2002). It starts by using the base model with all the features. Then it will drop the feature with the least importance and runs the model without this feature. It will stop when no more improvement in accuracy is found (Yan & Zhang, 2015). The only problem with using
RFE on this featureset is that it will take a long time to run.
The base featureset contains 2,512 features, so running a lot of models to find the most important features is required. To increase the processing speed, the featureset was split into four feature sets. Then on those four featureset RFE was applied to find the best features of that featureset. This resulted in four different sets of important features. These were combined into a combination set and RFE was applied to this set to find the most important features.
After feature selection, all the algorithms were tested again with the best selected features. This first comparison shows how effective feature selection is on each individual algorithm. Feature selection may be a step to avoid, since processing time is large and the gains are small. It will be done in this study in order to show the maximum potential of the algorithms. Research has shown that Decision Trees and SVMs often gain from feature selection (Navot, Gilad-Bachrach, Navot, & Tishby, 2005; Tirelli & Pessani, 2011). Furthermore, feature selection can prevent overfitting (Reunanen, 2003). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 23
Hyperparameter Tuning
Some algorithms don’t perform very well until their hyperparameters are tuned. Using default hyperparameters results in not showing the algorithm’s full potential. Tuning hyperparameters can also prevent overfitting. To see which hyperparameters work best, different hyperparameters will be tuned. For algorithms that have a long fitting time, not many hyperparameters can be tested as processing time will be too long. To see a list of hyperparameters tested in this study see Appendix E.
Voting
To further improve the accuracy, a majority voting classifier will be tested. This type of classifier combines multiple algorithms into one ‘super’ algorithm for the prediction of classes
(Kuncheva & Rodríguez, 2014). A benefit of this is that it can balance out individual algorithm’s weaknesses. Two methods of voting are available. Firstly, hard voting is based on the majority vote. If ten algorithms are used with a voting classifier and six of them predict that a gesture is a crocodile, then the voting classifier predicts the gesture will be a crocodile.
Secondly, soft voting uses probabilities in which the class with the highest average probability is chosen (see Figure 4).
Hard Voting
Classifier KNN Decision Random Gradient Logistic SVC
Tree Forest Boost Regression
Prediction Bus Motorcycle Motorcycle Bus Motorcycle Car
Final Prediction ↓ Motorcycle
Soft Voting
Classifier KNN Decision Random Gradient Logistic SVC
Tree Forest Boost Regression
Prediction Bus Motorcycle Motorcycle Bus Motorcycle Car
Probability 67% 60% 17% 32% 20% 51%
Predictions Bus: (67 + 32) / 2 = 49.5% Motorcycle: (60 + 17 + 20) / 3 = 32.33% Car: 51% Final Prediction ↓ Car
Figure 4. Hard and soft voting algorithm. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 24
Results
Algorithm Comparison
Two feature sets were tested in establishing a baseline, one with outliers and one without. In total 23 algorithms were run on the feature sets. All algorithms were using 10-fold cross validation in order to acquire a non-biased test accuracy. Additionally, all algorithms were run using their default hyperparameters. This resulted in a baseline which can be seen in Table 2 and Figure 5.
Looking at the results, it is noticeable that a lot of algorithms tend to overfit on the training data even though 10-fold cross validation was used. This can be an indication that too many features are used. The algorithm that produces the highest accuracy is the
RandomForestClassifier. Another goal of this comparison was to find an algorithm that can be used in feature selection. A fast model is needed, because during Recursive Feature
Elimination many models are tested.
Surprisingly, Support Vector Machine algorithms did not perform very well. This can be explained as SVMs are often very sensitive to hyperparameter tuning, which was not done in this step (Duan, Keerthi, & Poo, 2003). What’s more is that the top five methods are all ensemble algorithms. This also explains why they are overfitting as ensemble algorithms have the tendency to overfit (Dietterich, 2000).
The average difference between the accuracy of the featureset with outliers and without the outliers was 0.13%, which is negligible. As 2.21% of the data was removed from the no outliers featureset during pre-processing, it seems no important information is lost. In the next analyses only the featureset without outliers was used.
Two algorithms did not successfully run during the first baseline test. Firstly,
GaussianProcessClassifier resulted in a ‘Memory Error’. An explanation for this is that the algorithm is trying to create a matrix of shape (푛, 푛), in this case (3677, 3677). This will result in a matrix with 13,520,329 elements, which is too large for the memory (Metzen, 2018).
Secondly, RadiusNeighborsClassifier resulted in the error “No neighbours found” using
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 25
Table 2
A baseline of the algorithm’s performances
Classifier Train Test Test Execution Diff.
Accuracy Accuracy Accuracy Time Outliers
Mean Mean 3*STD Seconds Test acc.
RandomForestClassifier 1.0000 0.3617 0.0440 10.63 -0.0013
ExtraTreesClassifier 1.0000 0.3506 0.0583 4.71 +0.0026
GradientBoostingClassifier 1.0000 0.2677 0.0419 1327.45 -0.0153
RidgeClassifier 1.0000 0.2617 0.0330 1.51 +0.0024
NuSVC 0.9686 0.2591 0.0229 189.82 +0.0071
BaggingClassifier 0.9965 0.2526 0.0273 65.18 -0.0075
LogisticRegression 1.0000 0.2418 0.0298 9.37 +0.0118
BernoulliNB 0.5535 0.2364 0.0450 0.49 -0.0022
PassiveAggressiveClassifier 0.9975 0.2363 0.0425 23.38 +0.0087
LinearSVC 1.0000 0.2302 0.0309 183.65 +0.0072
SGDClassifier 0.9003 0.2095 0.0863 8.76 +0.0140
NearestCentroid 0.4037 0.1918 0.0198 0.32 +0.0022
Perceptron 0.7930 0.1908 0.0574 7.83 +0.0104
SVC 0.5111 0.1898 0.0326 197.00 +0.0050
DecisionTreeClassifier 1.0000 0.1737 0.0275 9.78 +0.0007
MLPClassifier 0.2797 0.1284 0.1509 160.11 -0.0445
KneighborsClassifier 0.3708 0.1145 0.0188 1.14 +0.0039
GaussianNB 0.4597 0.0918 0.0133 0.52 -0.0012
AdaBoostClassifier 0.1078 0.0884 0.0657 32.43 -0.0088
LinearDiscriminantAnalysis 1.0000 0.0637 0.0190 19.56 -0.0259
QuadraticDiscriminantAnalysis 1.0000 0.0341 0.0181 0.98 +0.0039
Average difference -0.0013
Note. Execution time is the average time for fitting the estimator on the train set. Since
accuracies are normally distributed, statistically the STD*3 captures 99.73% of the scores,
so it gives an indication of the worst possible scenario (accuracy – STD*3).
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 26
default hyperparameters. This can be explained as this classifier tries to use fewer nearest neighbours for classification based on the radius hyperparameter (Sckit-learn, 2019). The default radius of 1.0 is too small to find any neighbours in some cases, which results in this error. Both these algorithms were excluded from further analysis.
Figure 5. Comparing baseline accuracy of the algorithms.
Feature Selection
Recursive Feature Elimination was used to extract the best subset of features. This method works based on an algorithm that recursively tests models until it finds a subset of features that result in the highest accuracy. As many models are tested, it is advisable to use a very fast algorithm in order to minimize processing time. The algorithm that was selected for the use of RFE was the ExtraTreeClassifier, as it is one of the fastest algorithms. It only performed slightly worse than the most accurate algorithm, but its execution time is faster.
Before Recursive Feature Elimination was used, the featureset was split up into four subsets to speed up processing. Then RFE was run on each individual subset to find the best features of that subset. Next, the best features of each subset were combined back to one featureset. This resulted in a featureset with 216 features that were considered to be most important. Finally, RFE was run again on this final featureset to find the best features. RFE managed to find 146 features that were considered to be the most important (see Table 3). RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 27
Table 3
The results of Recursive Feature Elimination
# Features Featureset 1 Featureset 2 Featureset 3 Featureset 4
Before RFE 1398 1832 1397 777
After RFE 80 50 59 27
# Features Final featureset
Before RFE 216
After RFE 146
Note. Feature sets were split up into four sets to speed up processing. RFE was run on the
final featureset to find the most important features.
This study started with in total 32 different time-series features that were calculated on the columns. The featureset after RFE resulted in only 18 of those time-series features being used (see Figure 6). The most important ones and how many columns with this statistic are used are as follows: Mean abs change (27), standard deviation (23), mean (21), maximum (16) and sum values (8). It is noteworthy that the time-series features that were used are all simple statistics. More complicated time-series features such as ‘Mean second derivate central’ or
‘First location of maximum’ are not used.
Figure 6. Number of time-series features per type selected by RFE. E.g. the final featureset contained the statistic ‘Mean absolute change’ of 26 different columns. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 28
Looking at the columns of the data, it seems that some columns are most important.
The most important columns of the data and how many time-series features were used on them are as follows: lelbowori (15), rhandpos (10), lhandpos (10) and relbowori (9) (see Figure
7). Noticeable about this is that the most important columns are the orientations of the elbows and the positions of the hands. However, the position of the shoulders do not change much, nor the positions of the feet are often not seen. According to RFE, orientation is more invariant than position bases features, namely, 68 and 39 respectively (see Appendix F for a full list).
Furthermore, other columns of the data are used as well, such as rhandstate (7) and rhandconfidence (6). These columns may be often overlooked by researchers, but they seem to give important information.
Figure 7. The occurrence of columns in the final featureset. E.g. the final featureset contained 15 different time-series statistics about ‘lelbowori’.
After running an ensemble algorithm, the importance of the features can be retrieved.
According to the ExtraTreeClassifier, these are the most important features (see Table 4). As expected, the top eleven features are all about the hands or elbows (see Appendix G for a full list of features). Surprisingly, the standard deviation of the state of the right-hand (open or closed) seems to be the most important feature. Mostly, the mean and standard deviation, RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 29
which are considered to be the most basic time-series feature seems to give the most information in classifying gestures.
Table 4
The 8 most important features of the featureset
Feature Importance STD*3
rhandState_standard_deviation 0.011884 0.003778
rhandconfidence_mean 0.011073 0.003478
lhandState_standard_deviation 0.010692 0.003753
rhandposY_mean_abs_change 0.010060 0.003034
relboworiW_maximum 0.009948 0.003927
relboworiX_mean 0.009646 0.003253
lelboworiX_mean 0.009517 0.003887
lelboworiW_mean 0.009510 0.003606
Note. According to the feature importance method of
the Extra Tree classifier. The importance is the
average of a 10-fold cross validation.
First Comparison
After feature selection, the first comparison was calculated by running each algorithm again (see Table 5 and Figure 8). This resulted in an average accuracy increase of 8.32%.
Surprisingly, the LinearDiscriminantAnalysis had a remarkable accuracy increase of 33.73% and moved from being one of the worst classifiers to one of the best. The two best algorithms for this featureset at this stage are currently still the RandomForestClassifier and
ExtraTreeClassifier, which both had an accuracy increase of about 5%. The three SVC algorithms also had a significant accuracy increase, so they should not be counted out.
Noticeably, the top five algorithms consists of four different algorithm categories, so it seems that, at this stage, there is not one best algorithm category for this type of data. Also, overfitting seem to be less after feature selection, however the ensemble methods still overfit as much as before. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 30
Table 5
First comparison of the performances of the algorithms after feature selection
Classifier Train Test Test Diff.
Accuracy Accuracy Accuracy Execution Baseline
Mean Mean 3*STD Time Test acc.
ExtraTreesClassifier 1.0000 0.4082 0.0485 1.11 +0.0576
RandomForestClassifier 1.0000 0.4046 0.0352 3.46 +0.0429
LinearDiscriminantAnalysis 0.6241 0.4010 0.0331 0.15 +0.3373
RidgeClassifier 0.5819 0.3738 0.0365 0.05 +0.1121
LinearSVC 0.7235 0.3628 0.0378 19.35 +0.1327
LogisticRegression 0.6270 0.3590 0.0392 1.34 +0.1172
NuSVC 0.6961 0.3466 0.0431 10.33 +0.0874
MLPClassifier 1.0000 0.3352 0.0514 49.49 +0.2068
GradientBoostingClassifier 1.0000 0.3233 0.0513 136.36 +0.0556
SVC 0.4852 0.3019 0.0539 9.33 +0.1121
SGDClassifier 0.4979 0.2864 0.0697 1.24 +0.0769
PassiveAggressiveClassifier 0.4466 0.2730 0.0779 0.69 +0.0367
BaggingClassifier 0.9964 0.2721 0.0317 4.45 +0.0195
NearestCentroid 0.3252 0.2423 0.0359 0.02 +0.0506
GaussianNB 0.3264 0.2380 0.0463 0.04 +0.1462
Perceptron 0.3552 0.2294 0.1218 0.45 +0.0387
BernoulliNB 0.3408 0.2278 0.0207 0.04 -0.0086
KneighborsClassifier 0.4682 0.2092 0.0188 0.09 +0.0947
DecisionTreeClassifier 1.0000 0.1776 0.0486 0.68 +0.0039
AdaBoostClassifier 0.1337 0.1121 0.0622 3.54 +0.0237
QuadraticDiscriminantAnalysis 1.0000 0.0367 0.0142 0.11 +0.0026
Average difference +0.0832
Note. The accuracies are compared to the baseline accuracies before feature selection.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 31
Figure 8. Comparing algorithms after feature selection.
Hyperparameter Tuning
After feature selection, many hyperparameters for each algorithm were tested in order to find each algorithm’s maximum potential. For each algorithm, different hyperparameters were tested (see Appendix E for all tested hyperparameters). The final hyperparameters that led to the highest accuracy can be found in Appendix H. After hyperparameter tuning, every algorithm was run again using 10-fold cross validation with the best hyperparameters in order to see how much improvement was gained (see Table 6 and Figure 9).
Interestingly, the top three algorithms are all ensemble algorithms. It seems that SVC algorithms still lack behind, as its accuracy is beaten by Logistic and Ridge Regression.
Furthermore, Linear Regression algorithms work mediocre on average. Moreover, Neighbour algorithms and Naïve Base algorithms perform poorly. Surprisingly, by looking at discriminant algorithms, it seems that LinearDiscriminant works very well, but
QuadraticDiscriminant does not.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 32
Table 6
Second comparison, comparing accuracies after hyperparameter tuning
Classifier Train Test Test Diff 1st
Accuracy Accuracy Accuracy Execution Comp.
Mean Mean 3*STD Time Test Acc.
RandomForestClassifier 1.0000 0.4715 0.0583 42.01 +0.0669
BaggingClassifier 1.0000 0.4497 0.0414 542.08 +0.1776
ExtraTreesClassifier 1.0000 0.4347 0.0471 1.16 +0.0265
LinearDiscriminantAnalysis 0.6244 0.4280 0.0392 0.189 +0.0270
RidgeClassifier 0.5970 0.4044 0.0460 7.44 +0.0306
LogisticRegression 0.6888 0.3967 0.0254 17.60 +0.0377
MLPClassifier 0.8381 0.3895 0.0384 30.39 +0.0543
SVC 0.8387 0.3895 0.0455 15.11 +0.0662
NuSVC 0.9198 0.3887 0.0473 16.29 +0.0421
LinearSVC 0.7338 0.3872 0.0269 25.65 +0.0244
GradientBoostingClassifier 0.9936 0.3511 0.0409 155.29 +0.0278
SGDClassifier 0.5511 0.3242 0.0496 3.05 +0.0378
NearestCentroid 0.3593 0.2907 0.0429 0.03 +0.0483
AdaBoostClassifier 0.3736 0.2882 0.0301 76.51 +0.1761
PassiveAggressiveClassifier 0.4186 0.2700 0.0501 0.68 -0.0030
GaussianNB 0.3461 0.2678 0.0325 0.05 +0.0299
BernoulliNB 0.3625 0.2629 0.0467 0.04 +0.0537
KneighborsClassifier 0.3663 0.2607 0.0445 0.13 +0.0831
Perceptron 0.3822 0.2528 0.1079 0.44 +0.0234
DecisionTreeClassifier 0.4933 0.2210 0.0424 0.56 +0.0434
QuadraticDiscriminantAnalysis 0.1293 0.2149 0.0424 0.12 +0.1781
Average difference +0.0596
Note. Accuracies are compared to the first comparison.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 33
Figure 9. Comparing the performance of the algorithms after hyperparameter tuning.
Voting
A voting classifier was used to classify the data. Voting can result in a ‘super’ algorithm, although it does not guarantee a higher accuracy result than the current best algorithm. Not all classifiers can be used in a voting algorithm, the algorithm needs to have a function that can compute probabilities of possible outcomes for samples in the featureset. The following classifiers used in this study cannot be used for voting: PassiveAggressiveClassifier,
RidgeClassifier, Perceptron, NearestCentroid and LinearSVC. For the algorithms that could be used, both soft and hard voting was done. Two methods were used, firstly algorithms were combined starting with the top two algorithms and then recursively adding the next one.
Secondly, only the best algorithms of their category were used such as Random Forest from
Ensemble and Logistic Regression from Linear Regression. Similarly, first the top two algorithms were used and then another one was added until all algorithms were used. Voting using the top five algorithms led to an accuracy result of 47.61% using hard voting. Using only different algorithm categories, a result of 45.02% was found using the best two algorithms with soft voting. A full result of these methods can be found in Appendix I.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 34
Extra Algorithms
Relevance Vector Machines (RVC) were not used until now because its processing time would be too long. They have been seldom used in previous research, but they have shown some promising results as they performed equally good as SVMs in some studies (Nguyen &
Hai-Son, 2015). One reason why they might not be used very often is because there is only one implementation of RVC in Python (Ritchie & Jonathan, 2019). This package however, does not have any documentation. Another reason why it is not used more often is because the fitting time of RVCs are usually longer than SVMs. Nevertheless, a RVC with default hyperparameters was tried. Even though RVC was not tuned, this algorithm managed to achieve an accuracy of 31.72%, which was lower than all SVC algorithms.
Since Ensemble algorithms performed very well on this featureset, some state-of-the- art ensemble algorithms were used. Firstly, the XGBoost algorithm was tried (Chen &
Guestrin, 2016). This is an algorithm that uses parallel tree boosting (XGBoost, 2016). Secondly, the lightGBM algorithm was used (Ke et al., 2017). LightGBM is also a gradient boosting decision tree algorithm created by Microsoft. Lastly, the most recent CatBoost algorithm was used (Dorogush, Ershov, & Gulin, 2018). One characteristic of these algorithms is that they can run on a GPU, which makes them very fast. Only one custom parameter (n_estimators: 1000) was used to make them more comparable to the tuned algorithms. Setting this hyperparameter had a large effect on the Random Forest algorithm during hyperparameter tuning, so it was expected that it would have a similar effect on these algorithms. The results show that all these algorithms perform very well, with CatBoost achieving a classification accuracy of 51.16%.
Finally, for practical uses a simple model was created. This model can be used as a baseline for future research. The most important features were selected to create a model that would result in an acceptable accuracy. Random Forest was chosen as classifier because it was proven to be the best simple algorithm in this research. In total, the top 15 features were selected, resulting in a simple and fast model. Two hyperparameters were found to be beneficial to the accuracy improvement: (max_depth=20, n_estimators=1000). This resulted in a model that could achieve a gesture classification accuracy of 35%. The final performance of all algorithms used in the study can be seen in Table 7 and Figure 10.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 35
Table 7
Final result of all algorithms
Classifier Train Test Test Diff.
Accuracy Accuracy Accuracy Previous
Mean Mean 3*STD Research
CatBoost 0.9861 0.5116 0.0711 +0.2780
Voting A – Hard 1.0000 0.4761 0.0491 +0.2425
RandomForestClassifier 1.0000 0.4715 0.0583 +0.2379
LightGBM 1.0000 0.4617 0.0384 +0.2281
XGBoost 1.0000 0.4543 0.0463 +0.2207
VOTING B – Soft 0.9235 0.4502 0.0452 +0.2166
BaggingClassifier 1.0000 0.4497 0.0414 +0.2161
ExtraTreesClassifier 1.0000 0.4347 0.0471 +0.2011
LinearDiscriminantAnalysis 0.6244 0.4280 0.0392 +0.1944
RidgeClassifier 0.5970 0.4044 0.0460 +0.1708
LogisticRegression 0.6888 0.3967 0.0254 +0.1631
MLPClassifier 0.8381 0.3895 0.0384 +0.1559
SVC 0.8387 0.3895 0.0455 +0.1559
NuSVC 0.9198 0.3887 0.0473 +0.1551
LinearSVC 0.7338 0.3872 0.0269 +0.1536
GradientBoostingClassifier 0.9936 0.3511 0.0409 +0.1175
Simple Model 1.0000 0.3498 0.0373 +0.1162
SGDClassifier 0.5511 0.3242 0.0496 +0.0906
Relevance Vector Machine 0.6506 0.3172 0.0315 +0.0836
NearestCentroid 0.3593 0.2907 0.0429 +0.0571
AdaBoostClassifier 0.3736 0.2882 0.0301 +0.0546
PassiveAggressiveClassifier 0.4186 0.2700 0.0501 +0.0364
GaussianNB 0.3461 0.2678 0.0325 +0.0342
BernoulliNB 0.3625 0.2629 0.0467 +0.0293
KneighborsClassifier 0.3663 0.2607 0.0445 +0.0271
Perceptron 0.3822 0.2528 0.1079 +0.0192 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 36
BASELINE STUDY 0.2336 -
DecisionTreeClassifier 0.4933 0.2210 0.0424 -0.0126
QuadraticDiscriminantAnalysis 0.1293 0.2149 0.0424 -0.0187
Note. Accuracies are compared to previous research (de Wit et al., 2019a).
Voting A: RandomForestClassifier, BaggingClassifier, ExtraTreeClassifier,
LinearDiscriminantAnalysis and LogisticRegression.
Voting B: RandomForestClassifier and LinearDiscriminantAnalysis
Figure 10. Final accuracy result of all algorithms.
Gestures
Some gestures are easier to classify than others. To see how well each gesture was classified, the tuned Random Forest algorithm was run and a classification report was retrieved (see Table 8). For this run, the classification accuracy was 47.50%. The gestures were ordered based on their F1-score to see their classification performance. The top five gestures all had a good precision, recall and F1-score. However, the algorithm had problems classifying certain gestures. The gestures ‘pig’ and ‘boat’ all had a F1-score of respectively 0.06 and 0.15.
Although the precisions scores of the bottom five gestures except pig were fine, their recall scores were low. This means that these gestures were classified correctly in many cases when RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 37
they were classified. However, the algorithm had problems finding all relevant instances of these gestures in the dataset.
To see how much influence the two hardest classifiable gestures had on the accuracy, the same algorithm was run again without these gestures in the featureset. This resulted in an accuracy of 53.40%, which is an improvement of 5.90%. There are a few possible reasons why the algorithms had problems classifying these gestures. Firstly, participants could have problems expressing these gestures. Secondly, the gestures that were classified poorly could look too much like other gestures. Furthermore, participants could have taken different approaches in expressing gestures. For example, when expressing a bridge, there are many types of bridges that a participant can choose to express. Moreover, in the case of expressing a lamp, the participants might have only moved their hands slightly, as a lamp is small. This may be problematic for the sensor, as it may be hard to pick up these small movements.
Table 8
Easiest 5 and hardest 5 gestures for the RF algorithm to classify
Top 5 Precision Recall F1-score Support
Airplane 0.81 0.76 0.78 33
Crocodile 0.75 0.75 0.75 28
Violin 0.75 0.67 0.71 27
Train 0.83 0.59 0.69 32
Bird 0.59 0.81 0.68 16
Bottom 5 Precision Recall F1-score Support
Pig 0.11 0.04 0.06 26
Boat 0.38 0.10 0.15 31
Bridge 0.38 0.15 0.22 33
Stairs 0.50 0.16 0.24 32
Lamp 0.36 0.20 0.26 25
Note. A full classification report can be found in Appendix J. For this run, the data was
split into a train (0.75) and test (0.25) set and no cross validation was done. Accuracy was
0.4750 on this run. Accuracy without ‘Boat’ and ‘Pig’ was 0.5340.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 38
Discussion
The initial goal of this study was to see if time-series analysis could be used in supervised gesture recognition. This was done by focusing on using time-series features and different machine learning algorithms. In this research, many steps were taken to use Kinect data for gesture classification. By using an unique approach for feature extraction, this research hopes to contribute to the ongoing Human-Robot interaction research. To answer the research question, first the two sub questions will be answered.
Features
A sub question of this research was: “How well do time-series features work for gesture classification?”. This study extracted time-series features from the data to classify gesture data, which is unique to this field. The reason for this choice was that time-series analyses showed promising results in other research fields (Wang, Smith, & Hyndman, 2006). Furthermore, previous research suggested that researchers should focus on using time-series features in combination with conventional supervised machine learning algorithms (Bagnall, Davis, Hills,
& Lines, 2012; Fulcher & Jones, 2014). Looking at the results, it has been shown that time-series features work well for supervised gesture classification.
There are some benefits in using time-series features. Firstly, it is an easy way to reduce the size of the dataset, which also results in reduced processing time for the algorithms.
Secondly, time-series features are simple to calculate, reducing the time in pre-processing.
Moreover, time-series features are easy to comprehend as a mean or standard deviation are basic statistics. Furthermore, it also gives information about which columns of the data provide the most information.
The simplest time-series features such as ‘mean’, ‘standard deviation’, ‘maximum’,
‘minimum’ and ‘mean absolute change’ work very well for this kind of data. Additionally, these kind of statistics are among the easiest to comprehend, the more advanced time-series features did not perform as well as the more simple features.
In this study, Recursive Feature Elimination was used as a feature selection method.
Although processing time was long, it successfully selected the most important features. It managed to select the 146 most important features out of a featureset with 4,960 features. By RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 39
ranking the features, the most important features have been found. What has been shown is that some limbs provide more information than others. The position of the hands seem to be one of the most important information that the Kinect provides, which is in agreement with previous research (Escalera et al., 2013). Similarly, the orientations of the shoulders give a lot of information as well. In fact, the top 25 most important features are only about the hand positions or orientations of the shoulders.
One question arising from this research is how to detect outliers. In this study outliers were detected based on the time it took for participants to perform the gesture. Outliers were defined as those that had a z-score of +3 or -3 based on the gesture’s duration. Only a small part of the data was removed (2.21%) and this had a negligible effect on the algorithms’ performances. A better outlier detection method should lead to an increase in accuracy as it will remove cases that deviate from the data too much (Osborne & Overbay, 2014). However, it is hard to define what outliers actually are in a gesture dataset, since the people that performed the gestures in the previous study were free how to choose how to produce a gesture (de Wit et al., 2019a). Also, no method of outlier detection is used in similar research.
Gestures are always classified even if the participant is performing a bogus gesture. To solve this, one method that could potentially work could occur during recording of new gestures. If the classifier is trying to classify the gesture, it could discard the recording of the participant when the probability of the predicted gesture is too low. This should occur when the gesture performed by a participant is deviating too much from the similar gestures already recorded.
One problem in the current research is the detection of movement. The participant started moving some time after the Kinect started recording. This resulted in idleness in the data which was removed with a basic method. Although this method seemed to work fairly well, better methods of movement detection can be created. To prevent idleness in the data, some measures could be taken when recording gestures. For example, only begin recording when the hands move at least a certain distance. This should lead to less idleness in the data.
Algorithms
The second sub question of this research was “What machine learning algorithms work best on gesture time-series features?”. To answer this question, many supervised machine learning algorithms were tested in this study. By running each algorithm again after each RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 40
processing step showed what the effects were of each step on every algorithm. The top five algorithms (voting excluded) in this study were all ensemble algorithms. These algorithms seem to work the best on this kind of data. A simple Random Forest seems to work exceptionally well. The benefits of using Random Forests is that they are fast and simple
(Dietterich, 2000).
SVM algorithms did not perform as well as expected. All SVMs were outperformed by a simple Logistic Regression. This is in contrast to previous research, which found that SVMs perform better than simpler algorithms (Hsu & Lin, 2002). Moreover, in previous gesture classification research using a Kinect, it was found that SVMs worked well (Orasa, Nukoolkit,
& Watanapa, 2012). The disadvantage of SVMs is that they require more processing time. Since
SVMs were outperformed by faster algorithms, it would not be advisable to use SVMs on this kind of features.
Other type of algorithms lacked behind even more. This research shows that the best algorithm is depends on the dataset. This is in agreement with previous research (Wolpert,
1996). The best practice of trying many algorithms to see what works had led to surprising results. It was not expected that more simple algorithms would outperform more advanced algorithms.
Combing algorithms into a super algorithm was not very successful. Combing the top five algorithms into one voting algorithm only led to a negligible accuracy increase of 0.46% over the Random Forest algorithm when hard voting was used. Combing only different type of algorithms with each other did not led to a higher accuracy than the Random Forest algorithm. It seems that voting is not worth it as it cost more processing time than just using a single algorithm.
Testing the most modern ensemble algorithms has led to some promising results. Even though XGBoost and Light GBM were not tuned, they performed similarly to a full tuned
Random Forest. As ensemble algorithm gained about 2% to 6% in accuracy after hyperparameter tuning, it is expected that these two algorithms will beat a Random Forest in classification. Looking at CatBoost, this algorithm has shown excellent results. This latest ensemble algorithm outperformed the Random Forest algorithm by 4% even when it was not tuned. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 41
This study has shown that selecting features is a crucial step when using time-series features. It resulted in an average accuracy increase of 8.32% for the algorithms. Some algorithms were more sensitive to feature selecting than others. While ensemble algorithms only saw a minor increase, SVMs saw a major increase in accuracy. More surprisingly, the
Linear Discriminant Analysis went to being one of the worst performing algorithms to being one of the best. But the most benefit of this step is the decrease in fitting time.
Hyperparameter tuning was also key in achieving a better accuracy. This research saw an average accuracy increase of 5.96% after this step. Most algorithms saw only a few percent gain, but the algorithms ‘Bagging’, ‘AdaBoost’ and ‘Quadratic Discriminant Analysis’ all saw an increase of approximately 17%.
Limitations and Future Research
More time-series features should be tested. Currently, 32 of the 65 available features that TSFRESH can calculate were used. It is unknown how well the more comprehensive time- series feature will perform. What this research has shown is that the simplest time-series features were most important, however, no comprehensive time-series feature were used. The other 33 time-series features that were not calculated are more comprehensive. Resultingly, it will negatively impact processing time.
The time-series features used in this study might be considered simple. However, this also has benefits. They are very fast to calculate, and they are easy to comprehend. More smart features might lead to a higher accuracy. Features such as the joint angles with respect to the person’s torso as used in previous research might be more successful features (Gu, Do, Ou, &
Weihua, 2012). Similarly, calculating the velocity acceleration and angle between different joints as used in previous research might improve classification accuracy (Saha, Shreya, Konar,
& Nagar, 2013).
Supervised classification is a great method for gesture classification, but it has its limitations. This approach requires the use of a lot of data, so this method can only be applied after many gestures are recorded and classified. Using this method will not be effective on little data. One-shot learning methods such those used in previous research (Cabrera & Wachs,
A Human-Centered Approach to One-Shot Gesture Learning, 2017) are still required for RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 42
gesture classification in new robots. Only after collecting a lot of data, this method can be used to further improve gesture classification.
Even though the use of voting algorithms did not lead to a significant improvement in accuracy, more advanced methods to combine algorithms might be more successful. Voting is also not guaranteed to increase performance (Kuncheva & Rodríguez, 2014). Stacking is another way to combine algorithms (Du & Swamy, 2014), which has not been tried before on this kind of data. Research shows that stacking often leads to better results than using individual algorithms (Džeroski & Ženko, 2004).
This research has used the most conventional supervised learning algorithms to classify the data. Future research could focus on Neural Networks. Since the Kinect records a lot of data, Artificial Neural Networks should be a good choice. However, using time-series in combination with ANNs might not be as successful, as time-series turn a large dataset in a small featureset, while ANNs work better with more data (Alwosheel, van Cranenburgh, &
Chorus, 2018).
This study has shown that the hand positions contain the most information. Future research could use a Kinect sensor in combination with a Leap Motion device. The Leap
Motion device specialises in hands and is more accurate in recording these movements
(Weichert, Bachmann, Rudak, & Fisseler, 2013). More accurate recordings might make it easier for algorithms to distinguish between gestures.
Conclusion
This research has shown a new method for classification gestures based on a large
Kinect dataset. Time-series statistics work well as features, their processing speed and simplicity make it a good way to extract features from the data. A strong point about this research is that many algorithms were tested after multiple processing steps. Furthermore, all algorithms were tested using 10-fold cross validation to retrieve non-biased accuracies.
Ensemble algorithms outperform all other algorithm types using this kind of data. A basic Random Forest works well after hyperparameter tuning. More complicated algorithms such as SVMs perform worse compared to other algorithms. Since their processing time is longer, they should not be used. RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 43
For generalization purposes, a simple model of 15 time-series features using a Random
Forest was created. This model can still achieve an acceptable classification accuracy. For similar datasets this model can be used to establish a baseline. This model can also be used for practical uses, as it is very fast.
For maximum classification accuracy, the latest ensemble algorithm named CatBoost should be used. It may be harder to implement this algorithm, but its performance is very promising. Unlike regular algorithms, the latest ensemble algorithms can also use a GPU, which speeds up processing by a lot.
The creators of the original dataset managed to achieve a classification accuracy of 23%
(de Wit et al., 2019a). This research has shown that this accuracy can be improved to over 50% with supervised learning using simple time-series features. Using the most important features, a simple Random Forest seems to be one of the best algorithms for this type of data. The simplicity of the features and this algorithm makes this method fast classification.
In conclusion, time-series features can be successfully used in supervised gesture recognition. Their simplicity and computation speed should make using these features an interesting choice. This research contributes to the ongoing research of human-robot interaction. Recognising gestures is an important step for robots in non-verbal communication.
Ultimately this will help robots communicate naturally with humans.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 44
References
Adăscălitei, F., Doroftei, I., Lefeber, D., & Vanderborght, B. (2014). Controlling a social robot-
performing nonverbal communication through facial expressions. Advanced Materials Research(837), 525-530. doi:10.4028/www.scientific.net/AMR.837.525
Aggarwal, J. K., & Cai, Q. (1999). Human motion analysis: A review. Computer vision and
image understanding, 73(3), 428-440. doi:10.1006/cviu.1998.0744
Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate
bankruptcy. The journal of finance, 23(4), 589-609. doi:j.1540-6261.1968.tb00843.x
Alwosheel, A., van Cranenburgh, S., & Chorus, C. G. (2018). Is your dataset big enough?
Sample size requirements when using artificial neural networks for discrete choice
analysis. Journal of choice modelling, 28, 167-182. doi:10.1016/j.jocm.2018.07.002
Bagnall, A., Davis, L., Hills, J., & Lines, J. (2012). Transformation based ensembles for time
series classification. Proceedings of the 2012 SIAM international conference on data mining
(pp. 308-318). Anaheim, CA: Society for Industrial and Applied Mathematics.
doi:10.1137/1.9781611972825.27
Bagnall, A., Lines, J., Bostrom, A., Large, J., & Keogh, E. (2017). The great time series
classification bake off: a review and experimental evaluation of recent algorithmic
advances. Data Mining and Knowledge Discovery, 31(1), 606-660. doi:10.1007/s10618-
016-0483-9
Beattie, G. (2004). Visible thought: The new psychology of body language. London: Routledge.
doi:10.4324/9780203500026
Berndt, D. J., & Clifford, J. (1994). Using dynamic time warping to find patterns in time
series. 359-370, 10(16), 359-370.
Bhattacharya, S., Czejdo, B., & Perez, N. (2012). Gesture classification with machine learning
using kinect sensor data. 2012 Third International Conference on Emerging Applications of
Information Technology (pp. 348-351). Kolkata, India: IEEE.
doi:10.1109/EAIT.2012.6407958 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 45
Biswas, K., & Basu, S. K. (2011). Gesture recognition using microsoft kinect®. In The 5th
international conference on automation, robotics and applications. (pp. 100-103).
Wellington: IEEE. doi: 10.1109/ICARA.2011.6144864
Cabrera, M. E., & Wachs, J. P. (2016). Embodied gesture learning from one-shot. 2016 25th
IEEE International Symposium on Robot and Human Interactive Communication (RO-
MAN) (pp. 1092-1097). New York City, USA : IEEE.
doi:10.1109/ROMAN.2016.7745244
Cabrera, M. E., & Wachs, J. P. (2017). A Human-Centered Approach to One-Shot Gesture
Learning. Frontiers in Robotics and AI, 4-8. doi:10.3389/frobt.2017.00008
Celebi, S., Aydin, A. S., Temiz, T. T., & Arici, T. (2013). Gesture recognition using skeleton
data with weighted dynamic time warping. International Joint Conference on Computer
Vision, Imaging and Computer Graphics Theory and Applications (pp. 620-625). Barcelona,
United States: Springer.
Chen, S. M., & Hwang, J. R. (2000). Temperature prediction using fuzzy time series. IEEE
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 30(2), 263-275.
doi:10.1109/3477.836375
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-
794). San Francisco, United States: ACM. doi:10.1145/2939672.2939785
Chen, X., & Koskela, M. (2015, February). Skeleton-based action recognition with extreme
learning machines. Neurocomputing, 149, 387-396. doi:10.1016/j.neucom.2013.10.046
Cho, K., & Xi, C. (2014). Classifying and visualizing motion capture sequences using deep
neural networks. International Conference on Computer Vision Theory and Applications
(VISAPP). (pp. 122-130). Lisabon, Portugal: IEEE.
Christ, M., Braun, N., Neuffer, J., & Kempa-Liehr, A. W. (2018). Time Series FeatuRe
Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). (307,
Ed.) Neurocomputing, 72-77. doi:10.1016/j.neucom.2018.03.067 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 46
D’Orazio, T., Marani, R., Renó, V., & Cicirelli, G. (2016). Recent trends in gesture recognition:
how depth data has improved classical approaches. Image and Vision Computing, 56-
72. doi:10.1016/j.imavis.2016.05.007 de Wit, J., de Haas, M., Krahmer, E., Vogt, P., Merckens, M., Oostdijk, R., . . . Wolfert, P.
(2019a). Playing charades with a robot: Collecting a large dataset of human gestures
through HRI. Companion Proceedings of the 2019 ACM/IEEE International Conference on
Human-Robot Interaction (HRI 2019) (pp. 634-635). Daegu, South Korea: IEEE.
doi:10.1109/HRI.2019.8673220 de Wit, J., Krahmer, E., & Vogt, P. (2019b). Social robots as language tutors: Challenges and
opportunities. Workshop on the challenges of working on social robots that
collaborate with people. ACM CHI Conference on Human Factors in Computing Systems.
Glasgow: ACM.
Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on
multiple classifier systems (pp. 1-15). Berlin, Heiderlberg: Springer. doi:10.1007/3-540-
45014-9_1
Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., & Keogh, E. (2008). Querying and
mining of time series data: experimental comparison of representations and distance
measures. Proceedings of the VLDB Endowment, 1(2), 1542-1552.
doi:10.14778/1454159.1454226
Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of
dimensionality. AMS math challenges lecture, 1(32), 375. doi:10.1.1.329.3392
Dorogush, A. V., Ershov, V., & Gulin, A. (2018). CatBoost: gradient boosting with categorical
features support. arXiv preprint arXiv:1810.11363.
Du, K. L., & Swamy, M. N. (2014). Combining Multiple Learners: Data Fusion and Emsemble
Learning. Neural Networks and Statistical Learning (pp. 621-643). London, United
Kingdom: Springer. doi:10.1007/978-1-4471-5571-3_20
Duan, K., Keerthi, S. S., & Poo, A. N. (2003). Evaluation of simple performance measures for
tuning SVM. Neurocomputing, 51, 41-59. doi:10.1016/S0925-2312(02)00601-X RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 47
Džeroski, S., & Ženko, B. (2004). Is combining classifiers with stacking better than selecting
the best one? Machine learning, 54(3), 255-273.
doi:10.1023/B:MACH.0000015881.36452.6e
Escalante, H. J., Guyon, I., Athitsos, V., Jangyodsuk, P., & Wan, J. (2017). Principal motion
components for one-shot gesture recognition. Pattern Analysis and Applications, 20(1),
167-182. doi:10.1007/s10044-015-0481-3
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., . . . Escalante, H. (2013).
Multi-modal gesture recognition challenge 2013: Dataset and results. Proceedings of the
15th ACM on International conference on multimodal interaction (pp. 445-452). Sydney,
Australia: ACM. doi:10.1145/2522848.2532595
Fulcher, B. D. (2018). Feature-based time-series analysis. Boca Raton, FL: CRC Press.
Fulcher, B. D., & Jones, N. S. (2014). Highly comparative feature-based time-series
classification. IEEE Transactions on Knowledge and Data Engineering, 26(12), 3026-3037.
doi:10.1109/TKDE.2014.2316504
Graham, J., & Argyle, M. (1975). A cross‐cultural study of the communication of extra‐verbal
meaning by gestures. International Journal of Psychology, 10(1), 56-67.
doi:10.1080/00207597508247319
Graves, A. (2012). Supervised sequence labelling. Supervised sequence labelling with recurrent
neural networks (pp. 5-13). Berlin, Heidelberg, Germany: Springer. doi:10.1007/978-3-
642-24797-2_2
Gu, Y., Do, H., Ou, Y., & Weihua, S. (2012). Human gesture recognition through a kinect
sensor. 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO) (pp.
1379-1384). Guangzhou, China: IEEE. doi:10.1109/ROBIO.2012.6491161
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification
using support vector machines. Machine learning, 46(1-3), 390-422.
doi:10.1023/A:1012487302797 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 48
Hanna, G. B., Drew, T., Clinch, P., Shimi, S., Dunkley, P., Hau, C., & Cuschieri, A. (1997).
Psychomotor skills for endoscopic manipulations: differing abilities between right
and left-handed individuals. Annals of surgery, 225(3), 33.
Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector
machines. IEEE transactions on Neural Networks, 13(2), 415-425. doi:10.1109/72.991427
Itauma, I. I., & Kivrak, H. K. (2012). Gesture imitation using machine learning techniques.
2012 20th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4).
Mugla, Turkey: IEEE. doi:10.1109/SIU.2012.6204822
Ji, X., & Liu, H. (2010). Advances in view-invariant human motion analysis: a review. IEEE
Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1),
13-24. doi:10.1109/TSMCC.2009.2027608
John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers.
Proceedings of the Eleventh conference on Uncertainty in artificial intelligence (pp. 338-345).
Montréal, Qué, Canada: Morgan Kaufmann Publishers Inc.
Jolliffe, I. (2014). Principal component analysis. Berlin Heidelberg: Springer. doi:10.1007/978-3-
642-04898-2_455
Jones, E., Oliphant, E., & Peterson, P. (2001-). SciPy: Open Source Scientific Tools for Python.
Retrieved 04 10, 2019, from https://www.scipy.org/
Joshi, A., Ghosh, S., Betke, M., Sclaroff, S., & Pfister, H. (2017). Personalizing gesture
recognition using hierarchical Bayesian neural networks. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 6513-6522). Hawaii, United
States: IEEE. doi: 10.1109/CVPR.2017.56
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., & Liu, T. Y. (2017). Lightgbm: A
highly efficient gradient boosting decision tree. Advances in Neural Information
Processing Systems (pp. 3146-3154). Long Beach, California, United States: NIPS.
Kendon, A. (1994). Do gestures communicate? A review. Research on language and social
interaction, 27(3), 175-200. doi:10.1207/s15327973rlsi2703_2 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 49
Konečný, J., & Hagara, M. (2014). One-shot-learning gesture recognition using hog-hof
features. The Journal of Machine Learning Research, 15(1), 2513-2532. doi:10.1007/978-3-
319-57021-1_12
Kuncheva, L. I., & Rodríguez, J. J. (2014). A weighted voting framework for classifiers
ensembles. Knowledge and Information Systems, 38(2), 259-275. doi:10.1007/s10115-012-
0586-6
Lai, K., Konrad, J., & Ishwar, P. (2012). A gesture-driven computer interface using Kinect.
IEEE Southwest Symposium on Image Analysis and Interpretation (pp. 185-188). Santa Fe:
IEEE. doi:10.1109/SSIAI.2012.6202484
Le, T.-L., & Nguyen, M.-Q. (2013). Human posture recognition using human skeleton
provided by Kinect. In 2013 international conference on computing, management and
telecommunications (ComManTel) (pp. 340-345). Ho Chi Minh City, Vietnam: IEEE.
doi:10.1109/ComManTel.2013.6482417
Leap Motion Inc. (2012). The Leap. Retrieved from https://www.leapmotion.com/
Lee, J., Lee, Y., Lee, E., & Hong, S. (2004). Hand region extraction and gesture recognition
from video stream with complex background through entropy analysis. The 26th
Annual International Conference of the IEEE Engineering in Medicine and Biology Society
(pp. 1513-1516). San Fransico, United States: IEEE. doi:10.1109/IEMBS.2004.1403464
Lui, Y. M. (2012). Human gesture recognition on product manifolds. Journal of Machine
Learning Research, 13(Nov), 3297-3321.
Mahbub, U., Imtiaz, H., Roy, T., Rahman, M. S., & Ahad, M. A. (2013). A template matching
approach to one-shot-learning gesture recognition. Pattern Recognition Letters, 34(15),
1780–1788. doi:10.1016/j.patrec.2012.09.014
Malgireddy, M. R., & Nwogu, I. G. (2013). Language-motivated approaches to action
recognition. The Journal of Machine Learning Research, 14(1), 2189-2212.
Marin, G., Dominio, F., & Zanuttigh, P. (2014, October). Hand gesture recognition with leap
motion and kinect devices. In 2014 IEEE International Conference on Image Processing
(ICIP) (pp. 1565-1569). Paris, France: IEEE. doi:10.1109/ICIP.2014.7025313 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 50
Metzen, J. H. (2018). Gaussian processes classification. Retrieved from
https://github.com/scikit-learn/scikit-
learn/blob/master/sklearn/gaussian_process/gpc.py#L400
Microsoft. (2014, October 21). CameraSpacePoint Structure. Retrieved from Kinect for
Windows SDK 2.0: https://docs.microsoft.com/en-us/previous-
versions/windows/kinect/dn772836%28v%3dieb.10%29
Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions on Systems,
Man, and Cybernetics, Part C (Applications and Reviews), 37(3), 311-324.
doi:10.1109/TSMCC.2007.893280
Montemayor, J., Alborizi, H., Druin, A., Hendler, J., Pollack, D., Porteous, J., . . . Lal, A.
(2000). From PETS to Storykit: Creating new technology with an intergenerational design
team. Pitts-burgh, PA, USA.
Montemerlo, M., Pineau, J., Roy, N., Thrun, S., & Verma, V. (2002). Experiences with a mobile
robotic guide for the elderly. AAAI/IAAI, 587-592.
Morency, L.-P., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for
continuous gesture recognition. 2007 IEEE conference on computer vision and pattern
recognition. (pp. 1-8). Minneapolis, United States: IEEE. doi:10.1109/CVPR.2007.383299
Morris, J. (Producer), & Stanton, A. (Director). (2008). WALL·E [Motion Picture]. Walt Disney
Studios Motion Pictures.
Moustra, M., Avraamides, M., & Christodoulou, C. (2011). Artificial neural networks for
earthquake prediction using time series magnitude data or Seismic Electric Signals.
Expert systems with applications, 38(12), 15032-15039. doi:10.1016/j.eswa.2011.05.043
Nanopoulos, A., Alcock, R., & Manolopoulos, Y. (2001). Feature-based classification of time-
series data. International Journal of Computer Research, 10(3), 49-61.
Narayanan, V., Arora, I., & Bhatia, A. (2013). Fast and accurate sentiment classification using
an enhanced Naive Bayes model. International Conference on Intelligent Data
Engineering and Automated Learning (pp. 194-201). Berlin, Heidelberg: Springer.
doi:10.1007/978-3-642-41278-3_24 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 51
Navot, A., Gilad-Bachrach, R., Navot, Y., & Tishby, N. (2005). Is feature selection still
necessary? International Statistical and Optimization Perspectives Workshop "Subspace,
Latent Structure and Feature Selection" (pp. 127-138). Berlin, Heidelberg: Springer.
doi:10.1007/11752790_8
Nguyen, D.-D., & Hai-Son, L. (2015). Kinect gesture recognition: Svm vs. rvm. Seventh
International Conference on Knowledge and Systems Engineering (KSE) (pp. 395-400). Ho
Chi Minh City, Vietnam: IEEE. doi:10.1109/KSE.2015.35
Orasa, P., Nukoolkit, C., & Watanapa, B. (2012). Human gesture recognition using Kinect
camera. 2012 Ninth International Conference on Computer Science and Software
Engineering (JCSSE) (pp. 28-32). Bangkok, Thailand: IEEE.
doi:10.1109/JCSSE.2012.6261920
Osborne, J. W., & Overbay, A. (2014). The power of outliers (and why researchers should
always check for them). Practical assessment, research & evaluation, 9(6), 1-12.
Patsadu, O., Nukoolkit, C., & Watanapa, B. (2012). Human gesture recognition using Kinect
camera. 2012 Ninth International Conference on Computer Science and Software
Engineering (JCSSE) (pp. 28-32). UTCC, Bangkok Thailand: IEEE. doi:10.1007/s10462-
012-9356-9
Pfister, A., West, A. M., & Noah, J. A. (2014). Comparative abilities of Microsoft Kinect and
Vicon 3D motion capture for gait analysis. Journal of medical engineering & technology,
38(5), 274-280. doi:10.3109/03091902.2014.909540
Pincus, S. M. (1991). Approximate entropy as a measure of system complexity. Proceedings of
the National Academy of Sciences, 88(6), 2297-2301. doi:10.1073/pnas.88.6.2297
Preis, J., Kessel, M., Werner, M., & Linnhoff-Popien, C. (2012). Gait recognition with kinect.
1st international workshop on kinect in pervasive computing, (pp. 1-4). New Castle, United
Kingdom.
Rahman, M. H., & Afrin, J. (2013). Hand gesture recognition using multiclass support vector
machine. International Journal of Computer Applications., 1, 39-43. doi:10.5120/12852-
9367 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 52
Raptis, M., Kirovski, D., & Hoppe, H. (2011). Real-time classification of dance gestures from
skeleton animation. Proceedings of the 2011 ACM SIGGRAPH/Eurographics symposium
on computer animation (pp. 147-156). Vancouver, British Columbia, Canada: ACM.
doi:10.1145/2019406.2019426
Rasmussen, C. E. (2003). Gaussian processes in machine learning. Summer School on Machine
Learning (pp. 63-71). Berlin, Heidelberg, Germany: Springer. doi:10.1007/978-3-540-
28650-9_4
Ren, Z., Yuan, J., Meng, J., & Zhang, Z. (2013). Robust part-based hand gesture recognition
using kinect sensor. IEEE transactions on multimedia, 15(5), 1110-1120.
doi:10.1109/TMM.2013.2246148
Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods.
Journal of Machine Learning Research, 3(Mar), 1371-1382.
Rish, I. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 workshop on
empirical methods in artificial intelligence. 3, pp. 41-46. Seattle, Washington, United
States: AAAI.
Ritchie, J., & Jonathan, F. (2019, 04 16). Scikit-rvm. Retrieved from Relevance Vector Machine
implementation using the scikit-learn API: https://github.com/JamesRitchie/scikit-
rvm
Rosa-Pujazón, A., Barbancho, I., Tardón, L. J., & Barbancho, A. M. (2016). Fast-gesture
recognition and classification using Kinect: an application for a virtual reality
drumkit. Multimedia Tools and Applications, 75(4), 8137-8164. doi:10.1007/s11042-015-
2729-8
Safavian, S. R., & Landgrebe, D. (1991). A survey of decision tree classifier methodology.
IEEE transactions on systems, man, and cybernetics, 21(3), 660-674. doi:10.1109/21.97458
Saha, S., Datta, S., Konar, A., & Janarthanan, R. (2014). A study on emotion recognition from
body gestures using Kinect sensor. 2014 International Conference on Communication and
Signal Processing (pp. 56-60). Bangkok, Thailand: IEEE.
doi:10.1109/ICCSP.2014.6949798 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 53
Saha, S., Shreya, G., Konar, A., & Nagar, A. K. (2013). Gesture recognition from indian
classical dance using kinect sensor. 2013 Fifth International Conference on Computational
Intelligence, Communication Systems and Networks (pp. 3-8). Madrid, Spain: IEEE.
doi:10.1109/CICSYN.2013.11
Scikit-learn. (n.d.). Feature importances with forests of trees. Retrieved from scikit-learn:
https://scikit-
learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-
auto-examples-ensemble-plot-forest-importances-py
Sckit-learn. (2019, 04 15). Nearest Neighbors. Retrieved from Nearest Neighbors Classification:
https://scikit-learn.org/stable/modules/neighbors.html#classification
Segal, M. R. (2003). Machine learning benchmarks and random forest regression. Kluwer
Academic Publisher.
Seung, H. S., & Lee, D. D. (2000). The manifold ways of perception. Science, 290(5500), 2268-
2269. doi:10.1126/science.290.5500.2268
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., . . . Blake, A. (2011).
Real-time human pose recognition in parts from single depth images. Conference on
Computer Vision and Pattern Recognition (CVPR 2011) (pp. 1297-1304). Colorado
Springs, United States: IEEE. doi:10.1109/CVPR.2011.5995316
Sinha, A., Chakravarty, K., & Bhowmick, B. (2013). Person identification using skeleton
information from kinect. Proceedings of the International Conference on Advances in
Computer-Human Interactions (pp. 101-108). Nice, France: IARIA XPS Press.
Ten Holt, G. A., Reinders, M. J., & Hendriks, E. A. (2007). Multi-dimensional dynamic time
warping for gesture recognition. Thirteenth annual conference of the Advanced School for
Computing and Imaging, 300(1).
Tirelli, T., & Pessani, D. (2011). Importance of feature selection in decision-tree and artificial-
neural-network ecological applications. Alburnus alburnus alborella: A practical
example. Ecological informatics, 6(5), 309-315. doi:10.1016/j.ecoinf.2010.11.001 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 54
Uddin, M. Z., Thang, N. D., & Kim, T.-S. (2010). Human activity recognition via 3-D joint
angle features and Hidden Markov models. IEEE International Conference on Image
Processing (pp. 713-716). Hong Kong, China: IEEE. doi:10.1109/ICIP.2010.5651953
Vogt, P., van den Berghe, R., de Haas, M., Hoffman, L., Kanero, J., Ezgi, M., . . . Kumar
Pandey, A. (2019). Second language tutoring using social robots: A large-scale study.
Proceedings of the 2019 ACM/IEEE International Conference on Human-Robot Interaction
(HRI 2019) (pp. 497-505). Daegu, South Korea: IEEE. doi:10.1109/HRI.2019.8673077
Wang, X., Smith, K., & Hyndman, R. (2006). Characteristic-based clustering for time series
data. Data mining and knowledge Discovery, 13(3), 335-364. doi:10.1007/s10618-005-0039-
x
Weichert, F., Bachmann, D., Rudak, B., & Fisseler, D. (2013). Analysis of the accuracy and
robustness of the leap motion controller. Sensors, 13(5), 6380-6393.
doi:10.3390/s130506380
Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural
computation, 8(7), 1341-1390. doi:10.1162/neco.1996.8.7.1341
Wu, D., Zhu, F., & Shao, L. (2012). One shot learning gesture recognition from rgbd images.
2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops (pp. 7-12). Rhode Island, United States: IEEE.
doi:10.1109/CVPRW.2012.6239179
XGBoost. (2016). XGBoost. Retrieved from XGBoost Python Package:
https://xgboost.readthedocs.io/en/latest/python/index.html
Xia, L., Chen, C. C., & Aggarwal, J. K. (2011, June). Human detection using depth
information by kinect. CVPR 2011 workshops, 15-22. doi:10.1109/CVPRW.2011.5981811
Xing, E. P., Jordan, M. I., & Karp, R. M. (2001). Feature selection for high-dimensional
genomic microarray data. ICML, 1, 601-608. doi:10.1.1.20.9408
Yan, K., & Zhang, D. (2015). Feature selection and analysis on correlated gas sensor data with
recursive feature elimination. Sensors and Actuators B: Chemical, 212, 353-363.
doi:10.1016/j.snb.2015.02.025 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 55
Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3d
kinematics descriptor for low-latency action recognition and detection. Proceedings of
the IEEE international conference on computer vision (pp. 2752-2759). Nice, France: IEEE.
Zhang, Z. (2012). Microsoft kinect sensor and its effect. IEEE multimedia, 19(2), 4-10.
doi:10.1109/MMUL.2012.24
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 56
Appendix A
Columns of Kinect recording file
Name Description
time Seconds elapsed after recording began headpos X, Y, Z neckpos X, Y, Z rshoulder X, Y, Z relbowpos X, Y, Z rwristpos X, Y, Z lshoulderpos X, Y, Z lelbospos X, Y, Z lwristpos X, Y, Z rhippos X, Y, Z rkneepos X, Y, Z rankeplos X, Y, Z lhippos X, Y, Z lkneepos X, Y, Z lanklepos X, Y, Z rfootpos X, Y, Z lfootpos X, Y, Z rhandpos X, Y, Z lhandpos X, Y, Z rhandtippos X, Y, Z lhandtippos X, Y, Z spinebasepos X, Y, Z spidemidpos X, Y, Z spineshoulderpos X, Y, Z rhandstate Integer: (-1: Closed, 1: Open) lhandstate Integer: (-1: Closed, 1: Open) rhandconfidence String: (Low, High) lhandconfidence String: (Low, High) rthumbpos X, Y, Z lthumbpos X, Y, Z neckori X, Y, Z, W rshoulderori X, Y, Z, W relbowori X, Y, Z, W rwristori X, Y, Z, W lshouderori X, Y, Z, W lelbowori X, Y, Z, W lwristori X, Y, Z, W rhipori X, Y, Z, W rkneeori X, Y, Z, W RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 57
rankleori X, Y, Z, W lhipori X, Y, Z, W lkneeori X, Y, Z, W lankleori X, Y, Z, W rhandori X, Y, Z, W lhandori X, Y, Z, W spinebaseori X, Y, Z, W spinemidori X, Y, Z, W spineshoulderori X, Y, Z, W facepitch Integer: +90° : -90° faceyaw Integer: +90° : -90° faceroll Integer: +90° : -90° Note. All columns are float unless indicated otherwise.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 58
Appendix B
List of features and their usage
Number of columns with this
Feature name feature selected
Abs energy 2
Absolute sum of changes 3
Count above mean 2
Count below mean 0
First location of maximum 0
First location of minimum 0
Has duplicate 0
Has duplicate max 0
Has duplicate min 0
Kurtosis 0
Last location of maximum 6
Last location of minimum 6
Length 0
Longest strike above mean 7
Longest strike below mean 4
Maximum 16
Mean 21
Mean abs change 27
Mean change 0
Mean second derivate central 0
Median 0
Minimum 8
Percentage of reoccurring datapoints to all datapoints 0
Percentage of reoccurring values to all values 0
Ratio value number to time series length 1
Skewness 7 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 59
Standard Deviation 23
Sum of reoccurring data points 1
Sum of reoccurring values 3
Sum values 8
Variance 1
Variance larger than standard deviation 0
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 60
Appendix C
Outliers removed per gesture
Gesture Outliers removed
Airplane 4
Bed 3
Bird 2
Boat 2
Book 2
Bridge 1
Bus 3
Car 2
Castle 1
Chair 2
Comb 2
Cow 1
Crocodile 3
Cup 2
Drumset 1
Fish 5
Guitar 2
Helicopter 2
Horse 4
Lamp 2
Motorcycle 2
Pencil 2
Piano 4
Pig 1
Scissors 2
Spoon 4
Stairs 2 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 61
Table 4
Toothbrush 2
Tortoise 1
Train 4
Triangle 4
Trumpet 1
Violin 2
Xylophone 2
# Total 83
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 62
Appendix D
Machine learning algorithms used in this study
Algorithm Category
Decision Tree Tree
Random Forest Ensemble (Random Forest)
Extra Trees Ensemble (Random Forest)
Bagging Ensemble (Bagging)
Gradient Boosting Ensemble (Boosting)
AdaBoostClassifier Ensemble (Boosting)
Logistic Regression Linear Regression
Ridge Regression Linear Regression
Passive Aggressive Regression Linear Regression
Perceptron Linear Regression
SGD Linear Regression
K Nearest Neighbors Neighbors
Nearest Centroid Neighbors
Radius Neighbor Neighbors
GaussianNB Naïve Bayes
BernoulliNB Naïve Bayes
SVC Support Vector Machine
NuSVC Support Vector Machine
LinearSVC Support Vector Machine
Multi-Layer Perceptron Neural Networks
Linear Discriminant Analysis Discriminant Analysis
Quadratic Discriminant Analysis Discriminant Analysis
Gaussian Process Gaussian Process
Relevance Vector Machine Relevance Vector Machine
XGBoost Ensemble (Boosting)
LightGBM Ensemble (Boosting)
CatBoost Ensemble (Boosting)
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 63
Appendix E
All tested hyperparameters per algorithm
Algorithm Hyperparameters Values tested
AdaBoostClassifier n_estimators 5, 100, 250, 500, 750, 1000
learning_rate .01, .03, .05, .1, .25
algorithm SAMME, SAMME.R
BaggingClassifier n_estimators 5, 100, 250, 500, 750, 1000
max_samples .1, .25, .5, .75, 1.0
ExtraTreesClassifier criterion gini, entropy
max_depth 5, 10, 20, 25, 30, 35, 40, 45, 50, 100
min_samples_split 2, 5, 10, .03, .05
min_samples_leaf 1, 5, 10, .03, .05
*GradientBoostingClassifier learning_rate .01, .03, .05, .1, .25
RandomForestClassifier n_estimators 5, 100, 250, 500, 750, 1000
criterion gini, entropy
max_depth 5, 10, 20, 25, 30, 35, 40, 45, 50, 100
PassiveAggressiveClassifier C 1, 2, 3, 4, 5
max_iter 5, 100, 250, 500, 750, 1000
RidgeClassifier alpha .0, .1, .25, .5, .75, 1.0
solver auto, svd, cholesky, lsqr,
sparse_cg, sag, saga
max_iter 5, 100, 250, 500, 750, 1000
LogisticRegression C 1, 2, 3, 4, 5
solver newton-cg, lbfgs, liblinear, sag,
saga
max_iter 5, 100, 250, 500, 750, 1000
SGDClassifier loss hinge, log, modified_huber,
squared_hinge, perceptron
max_iter 5, 100, 250, 500, 750, 1000
shuffle True, False RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 64
Perceptron max_iter 5, 100, 250, 500, 750, 1000
shuffle True, False
MLPClassifier hidden_layer_sizes (50,50,50), (50,100,50), (100,)
activation tanh, relu
solver sgd, adam
alpha 0.0001, 0.05
learning_rate constant, adaptive
BernoulliNB alpha .0, .1, .25, .5, .75, 1.0
GaussianNB No hyperparameters available for this classifier.
KNeighborsClassifier n_neighbors 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16,
18, 20
algorithm auto, ball_tree , kd_tree, brute
NearestCentroid metric euclidean, manhattan
SVC kernel linear, poly, rbf, sigmoid
C 1, 2, 3, 4, 5
gamma auto, scale, .1, .25, .5, .75, 1.0
NuSVC kernel linear, poly, rbf, sigmoid
gamma auto, scale, .1, .25, .5, .75, 1.0
nu .1, .2, .3, .4, .5
LinearSVC C 1, 2, 3, 4, 5
loss hinge, squared_hinge
DecisionTreeClassifier criterion gini, entropy
max_depth 5, 10, 20, 25, 30, 35, 40, 45, 50, 100
min_samples_split 2, 5, 10, .03, .05
min_samples_leaf 1, 5, 10, .03, .05
LinearDiscriminantAnalysis solver svd, lsqr, eigen
QuadraticDiscriminantAnalysis reg_param .0, .1, .25, .5, .75, 1.0
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 65
Appendix F
Number of features used after feature selection per column
Position X Y Z Total
headpos 1 3 2 6
neckpos 0 0 0 0
rshoulderpos 0 0 0 0
relbowpos 1 4 1 6
rwristpos 1 0 0 1
lshoulderpos 0 0 0 0
lelbowpos 0 1 0 1
lwristpos 0 0 0 0
rhippos 0 0 0 0
rkneepos 0 0 0 0
ranklepos 0 0 0 0
lhippos 0 1 0 1
lkneepos 0 0 0 0
lanklepos 0 0 0 0
rfootpos 0 0 0 0
rhandpos 4 3 3 10
lhandpos 1 6 3 10
rhandtippos 0 0 0 1
lhandtippos 0 0 0 0
spinebaspos 0 0 0 0
spinemidpos 0 0 1 1
spineshoulderpos 0 0 0 0
rthumbpos 1 0 0 1
lthumbpos 1 0 0 1
Total 39 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 66
Orientations X Y Z W Total
neckori 3 0 2 2 7
rshoulderori 0 1 2 1 5
relbowori 3 0 0 6 9
rwristori 1 3 0 3 7
lshoulderori 1 0 0 1 2
lelbowori 1 4 3 7 15
lwristori 1 3 0 3 7
rhipori 0 0 0 1 1
rkneeori 0 2 0 3 5
rankleori 0 0 0 0 0
lhipori 1 0 0 0 1
lkneeori 1 0 0 0 1
lankleori 0 0 0 0 0
rhandori 0 1 1 2 4
lhandori 1 0 1 2 4
spinebaseori 0 0 0 0 0
Total 68
Other Total
rhandstate 7
lhandstate 5
rhandconfidence 6
lhandconfidence 3
facepitch 3
faceyaw 0
faceroll 2
Total 26
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 67
Appendix G
All features selected after RFE and their importance
Rank Feature Importance STD
1 rhandState_standard_deviation 0.011884 0.003778
2 rhandconfidence_mean 0.011073 0.003478
3 lhandState_standard_deviation 0.010692 0.003753
4 rhandposY_mean_abs_change 0.010060 0.003034
5 relboworiW_maximum 0.009948 0.003927
6 relboworiX_mean 0.009646 0.003253
7 lelboworiX_mean 0.009517 0.003887
8 lelboworiW_mean 0.009510 0.003606
9 relboworiW_mean 0.009172 0.003212
10 relboworiW_standard_deviation 0.009162 0.003316
11 lhandconfidence_mean 0.009012 0.003217
12 lwristoriW_maximum 0.008988 0.003741
13 lelboworiZ_mean 0.008928 0.002686
14 lelboworiW_maximum 0.008840 0.003020
15 relbowposY_standard_deviation 0.008763 0.003306
16 rwristoriW_maximum 0.008519 0.002553
17 lelboworiW_standard_deviation 0.008458 0.003871
18 lelboworiY_maximum 0.008318 0.003066
19 relboxoriZ_mean 0.008286 0.002692
20 lhandposY_mean_abs_change 0.008251 0.003011
21 rhandState_mean 0.008186 0.002604
22 lhandState_mean 0.008069 0.003277
23 rhandconfidence_mean_abs_change 0.008047 0.002811
24 rhandState_sum_of_reoccurring_values 0.007917 0.002352
25 relboworiW_minimum 0.007838 0.002672
26 lelboworiW_minimum 0.007828 0.002537 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 68
27 lhandState_mean_abs_change 0.007805 0.002965
28 lelboworiY_standard_deviation 0.007749 0.002818
29 neckoriX_standard_deviation 0.007746 0.002870
30 lelbowposY_standard_deviation 0.007737 0.002923
31 lhandposY_maximum 0.007704 0.002892
32 lelboworiZ_maximum 0.007639 0.002638
33 lshoulderoriX_standard_deviation 0.007611 0.002312
34 facepitch_maximum 0.007593 0.002735
35 lhandoriW_maximum 0.007570 0.003164
36 rhandState_count_above_mean 0.007558 0.002473
37 lhandoriZ_mean 0.007557 0.002818
38 lhandposY_standard_deviation 0.007554 0.003183
39 relboxoriY_mean 0.007546 0.002557
40 relboxoriY_minimum 0.007464 0.002582
41 relboxoriZ_standard_deviation 0.007450 0.002780
42 rwristoriY_maximum 0.007442 0.002613
43 relboworiX_standard_deviation 0.007400 0.002527
44 relboxoriY_mean_abs_change 0.007385 0.002982
45 lwristoriW_mean 0.007370 0.002265
46 lhandposY_skewness 0.007368 0.002672
47 lhiporiX_standard_deviation 0.007290 0.002335
48 lhandState_count_above_mean 0.007262 0.002903
49 relboxoriY_standard_deviation 0.007255 0.002952
50 rhandconfidence_standard_deviation 0.007176 0.002800
51 relboxoriY_maximum 0.007169 0.002551
52 rshoulderoriW_minimum 0.007121 0.002513
53 rhandposZ_standard_deviation 0.007117 0.002645
54 neckoriZ_mean 0.007093 0.002764
55 lhandposX_mean 0.007093 0.002865
56 rshoulderoriY_maximum 0.007067 0.002655
57 neckoriZ_minimum 0.007057 0.002746 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 69
58 rhandposY_standard_deviation 0.007044 0.002511
59 rhandoriZ_mean_abs_change 0.007036 0.002522
60 lwristoriY_maximum 0.007022 0.002890
61 rshoulderoriZ_minimum 0.007004 0.002550
62 headposY_standard_deviation 0.006966 0.002913
63 lhandposY_absolute_sum_of_changes 0.006878 0.002439
64 rhandState_mean_abs_change 0.006853 0.002707
65 rshoulderoriY_mean 0.006729 0.001978
66 rhandoriW_mean 0.006695 0.002370
67 rkneeoriY_minimum 0.006682 0.002498
68 lelboworiY_mean 0.006654 0.002407
69 neckoriW_mean_abs_change 0.006606 0.001961
70 lhandposZ_standard_deviation 0.006601 0.002566
71 lhandconfidence_mean_abs_change 0.006600 0.002398
72 rwristoriX_mean_abs_change 0.006589 0.002426
73 rkneeoriW_minimum 0.006546 0.002613
74 rwristoriY_mean_abs_change 0.006524 0.002134
75 relbowposY_skewness 0.006516 0.002327
76 relbowposY_maximum 0.006511 0.002102
77 rhandoriY_mean_abs_change 0.006505 0.002540
78 rhandconfidence_longest_strike_above_mean 0.006465 0.002662
79 lwristoriY_mean_abs_change 0.006443 0.002195
80 facepitch_sum_of_reoccurring_values 0.006416 0.002267
81 rhandposX_maximum 0.006402 0.002195
82 rwristoriY_standard_deviation 0.006381 0.002426
83 headposY_mean_abs_change 0.006365 0.002387
84 lwristoriX_mean_abs_change 0.006316 0.002355
85 facepitch_mean 0.006315 0.002197
86 lwristoriW_mean_abs_change 0.006298 0.002663
87 lshoulderoriW_mean 0.006291 0.002403
88 rhandposZ_longest_strike_below_mean 0.006256 0.002328 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 70
89 rkneeoriW_standard_deviation 0.006248 0.002881
90 rwristoriW_mean_abs_change 0.006215 0.002175
91 relboxoriY_sum_values 0.006189 0.002173
92 lhandposZ_skewness 0.006183 0.002035
93 lelboworiW_mean_abs_change 0.006181 0.002472
94 relbowposZ_standard_deviation 0.006140 0.002132
95 rhandposY_longest_strike_above_mean 0.006131 0.002358
96 lelboworiW_sum_values 0.006116 0.002364
97 lhandoriW_mean_abs_change 0.006104 0.002381
98 relboworiW_mean_abs_change 0.006095 0.002168
99 rhandoriW_maximum 0.006093 0.002127
100 rhandposX_mean 0.006092 0.002015
101 rhandposZ_skewness 0.006065 0.002451
102 lhandState_skewness 0.006052 0.002170
103 lhandposY_longest_strike_above_mean 0.006029 0.002182
104 lhandoriX_mean_abs_change 0.006021 0.002030
105 headposZ_mean_abs_change 0.006020 0.002207
106 headposX_mean_abs_change 0.006015 0.001987
107 relboworiW_sum_values 0.006009 0.002483
108 neckoriX_mean_abs_change 0.005991 0.002673
109 rhandState_skewness 0.005983 0.001990
110 rhandconfidence_absolute_sum_of_changes 0.005916 0.002058
111 neckoriW_standard_deviation 0.005906 0.001925
112 lelboworiW_abs_energy 0.005857 0.002144
113 lhandconfidence_longest_strike_above_mean 0.005842 0.001926
114 rhandposX_skewness 0.005742 0.002105
115 rhandState_longest_strike_above_mean 0.005724 0.001969
116 relboxoriZ_mean_abs_change 0.005718 0.002054
117 rhandposX_longest_strike_below_mean 0.005710 0.002202
118 lkneeoriX_mean_abs_change 0.005699 0.002059
119 rshoulderoriW_sum_values 0.005698 0.002091 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 71
120 relbowposX_last_location_of_maximum 0.005685 0.001961
121 neckoriX_variance 0.005664 0.002023
122 lhandttipposZ_last_location_of_minimum 0.005630 0.001935
123 lhipposY_last_location_of_minimum 0.005618 0.002096
124 neckoriZ_sum_values 0.005562 0.002222
125 rwristposX_last_location_of_maximum 0.005561 0.001979
126 rhandtipposZ_last_location_of_maximum 0.005528 0.002075
127 rhandconfidence_sum_of_reoccurring_data_points 0.005519 0.002105
128 headposZ_last_location_of_minimum 0.005498 0.001974
129 rthumbposX_last_location_of_minimum 0.005487 0.001872
130 rhiporiW_last_location_of_minimum 0.005484 0.001835
131 rwristoriW_sum_values 0.005476 0.002098
132 lelboworiZ_longest_strike_above_mean 0.005473 0.002155
133 spinemidposZ_last_location_of_maximum 0.005441 0.001977
134 lthumbposX_last_location_of_maximum 0.005380 0.002021
135 rkneeoriW_mean_abs_change 0.005373 0.002097
136 lelboworiY_sum_values 0.005301 0.002137
137 lhandposZ_longest_strike_below_mean 0.005299 0.001909
138 relboworiX_longest_strike_below_mean 0.005298 0.001855
139 relbowposY_longest_strike_above_mean 0.005297 0.001971
140 headposY_absolute_sum_of_changes 0.005257 0.001986
141 faceroll_sum_of_reoccurring_values 0.005192 0.001947
142 faceroll_ratio_value_number_to_time_series_length 0.005176 0.001959
143 relboxoriZ_last_location_of_minimum 0.005164 0.001998
144 rkneeoriY_last_location_of_maximum 0.004938 0.001873
146 neckoriZ_abs_energy 0.004851 0.001956
147 lwristoriY_sum_values 0.004693 0.001884
Note. As result of the Random Forest algorithm.
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 72
Appendix H
Best hyperparameters for each algorithm
Algorithm Hyperparameters Best value
AdaBoostClassifier n_estimators 1000
learning_rate .01
algorithm SAMME.R
BaggingClassifier n_estimators 1000
max_samples 1.0
ExtraTreesClassifier criterion Gini
max_depth 25
min_samples_split 1
min_samples_leaf 5
GradientBoostingClassifier learning_rate 0.05
RandomForestClassifier n_estimators 1000
criterion gini
max_depth 20
PassiveAggressiveClassifier C 1
max_iter 100
RidgeClassifier alpha .0
solver sparse_cg
max_iter 100
LogisticRegression C 3
solver liblinear
max_iter 100
SGDClassifier loss log
max_iter 100
shuffle True
Perceptron max_iter 100
shuffle True RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 73
MLPClassifier hidden_layer_sizes (100,)
activation tanh
solver adam
alpha 0.05
learning_rate constant
BernoulliNB alpha 0.1
GaussianNB # No hyperparameters available for this classifier
KNeighborsClassifier n_neighbors 20
algorithm auto
NearestCentroid metric manhattan
SVC kernel rbf
C 5
gamma scale
NuSVC kernel rbf
gamma scale
nu .3
LinearSVC C 1
loss squared_hinge
DecisionTreeClassifier criterion gini
max_depth 25
min_samples_split 10
min_samples_leaf 2
LinearDiscriminantAnalysis solver svd
QuadraticDiscriminantAnalysis reg_param 1.0
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 74
Appendix I
Full results of voting algorithms
Voting method A: Combing top n performing algorithms.
Soft Soft Hard Hard
Combination Test Accuracy Test Accuracy Test Accuracy Test Accuracy
of algorithms Mean STD*3 Mean STD*3
Top 2 0.4643 0.0464 0.4568 0.0499
Top 3 0.4700 0.0570 0.4679 0.0563
Top 4 0.4660 0.0486 0.4733 0.0553
Top 5 0.4593 0.0370 0.4761 0.0491
Top 6 0.4566 0.0336 0.4733 0.0477
Top 7 0.4550 0.0357 0.4698 0.0470
Top 8 0.4539 0.0369 0.4634 0.0485
Top 9 0.4581 0.0354 0.4665 0.0423
Top 10 0.4508 0.0404 0.4624 0.0394
Top 11 0.4509 0.0404 0.4623 0.0410
Top 12 0.4406 0.0405 0.4620 0.0395
Top 13 0.4317 0.0405 0.4582 0.0426
Top 14 0.4321 0.0422 0.4578 0.0469
Top 15 0.4339 0.0406 0.4583 0.0467
All 0.4319 0.0407 0.4534 0.0451
Rank Algorithm Rank Algorithm
1 Random Forest 9 Gradient Boosting
2 Bagging Classifier 10 SGD
3 Extra Tree 11 Ada Boost
4 Linear Discriminant Analysis 12 GaussianNB
5 Logistic Regression 13 BernoulliNB
6 Multi-layer Perceptron 14 K-nearest-neighbour
7 Support Vector Machine 15 Decision Trees
8 Nu Support Vector Machine 16 Quadratic Discriminant Analysis RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 75
Voting method B: Using only different categories of algorithms.
Soft Soft Hard Hard
Combination Test Accuracy Test Accuracy Test Accuracy Test Accuracy
of algorithms Mean STD*3 Mean STD*3
Best 2 0.4502 0.0452 0.4461 0.0512
Best 3 0.4441 0.0422 0.4492 0.0383
Best 4 0.4388 0.0280 0.4412 0.0344
Best 5 0.4399 0.0305 0.4444 0.0316
Best 6 0.4143 0.0377 0.4415 0.0385
Best 7 0.4159 0.0385 0.4415 0.0385
All 0.4222 0.0380 0.4442 0.0414
Rank Algorithm Rank Algorithm
1 Random Forest 5 Support Vector Machine
2 Linear Discriminant Analysis 6 GaussianNB
3 Logistic Regression 7 K-nearest-neighbour
4 Multi-layer Perceptron 8 Decision Trees
RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 76
Appendix J
Classification report of the tuned Random Forest algorithm
Gesture Precision Recall F1-score Support
Airplane 0.81 0.76 0.78 33
Bed 0.50 0.72 0.59 25
Bird 0.59 0.81 0.68 16
Boat 0.38 0.10 0.15 31
Book 0.54 0.54 0.54 24
Bridge 0.38 0.15 0.22 33
Bus 0.54 0.43 0.48 30
Car 0.46 0.46 0.46 24
Castle 0.32 0.36 0.34 33
Chair 0.38 0.64 0.48 25
Comb 0.62 0.69 0.65 26
Cow 0.39 0.32 0.35 28
Crocodile 0.75 0.75 0.75 28
Cup 0.56 0.50 0.53 28
Drumset 0.42 0.77 0.55 22
Fish 0.40 0.68 0.51 28
Guitar 0.52 0.68 0.59 25
Helicopter 0.33 0.25 0.29 16
Horse 0.52 0.39 0.45 28
Lamp 0.36 0.20 0.26 25
Motorcycle 0.45 0.54 0.49 24
Pencil 0.38 0.32 0.35 31
Piano 0.42 0.72 0.53 25
Pig 0.11 0.04 0.06 26
Scissors 0.36 0.39 0.38 31
Spoon 0.37 0.38 0.38 26
Stairs 0.50 0.16 0.24 32 RECOGNISING GESTURES USING TIME-SERIES ANALYSIS 77
Table 0.41 0.50 0.45 28
Toothbrush 0.43 0.50 0.47 20
Tortoise 0.24 0.50 0.33 16
Train 0.83 0.59 0.69 32
Triangle 0.55 0.57 0.56 28
Trumpet 0.46 0.59 0.52 22
Violin 0.75 0.67 0.71 27
Xylophone 0.38 0.25 0.30 24
Micro avg. 0.47 0.47 0.48 920
Macro avg. 0.47 0.48 0.46 920
Weighted avg. 0.48 0.47 0.46 920
Note. For this run, the data was split into a train (0.75) and test (0.25) set and no cross
validation was done. Accuracy was 0.4750 on this run. Accuracy without ‘Boat’ and ‘Pig’
was 0.5340.