University of Nevada, Reno

Learning, Recognizing and Early Classification of Spatio-Temporal using Spike Timing Neural Networks

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering

by Banafsheh Rekabdar

Dr. Monica Nicolescu/Thesis Advisor

Dr. Mircea Nicolescu/Thesis Co-Advisor

May 2017

THE GRADUATE SCHOOL

We recommend that the dissertation prepared under our supervision by

BANAFSHEH REKABDAR

Entitled

Learning, Recognizing and Early Classification Of Spatio-Temporal Patterns Using Spike Timing Neural Networks

be accepted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Monica Nicolescu, Ph.D., Advisor

Mircea Nicolescu, Ph.D., Committee Member

George Bebis, Ph.D., Committee Member

Sushil Louis, Ph.D., Committee Member

Raul Rojas, Ph.D., Graduate School Representative

David W. Zeh, Ph. D., Dean, Graduate School

May, 2017

i

Abstract

by Banafsheh Rekabdar

Learning and recognizing spatio-temporal patterns is an important problem for all biological systems. Gestures, movements, activities, all encompass both spatial and temporal information that is critical for implicit communication and learning. This dissertation presents a novel, unsupervised approach for learning, recognizing and early classifying spatio-temporal patterns using spiking neural networks for human- robotic domains. The proposed spiking approach has five variations which have been validated on images of handwritten digits and human hand gestures and motions.

The main contributions of this work are as follows: i) it requires a very small number of training examples, ii) it enables early recognition from only partial information of the , iii) it learns patterns in an unsupervised manner, iv) it accepts variable sized input patterns, v) it is invariant to scale and translation, vi) it can recognize pat- terns in real-time and, vii) it is suitable for human-robot interaction applications and has been successfully tested on a PR2 robot. This dissertation presents comparison between all variations of this approach with well-known supervised and unsupervised machine learning techniques on in-house and publicly available datasets. Although the proposed approaches in this dissertation are unsupervised, they outperform other state-of-the-art and in some cases, provide comparable results with other methods. ii Acknowledgements

The research journey is the most challenging and valuable experience I have ever had in my life. This dissertation is the milestone that marks all my efforts and achievements along the research journey. The friendships I have made during these years at University of Nevada, Reno are equally important as well. This dissertation was made possible due to the generous help of so many people in my life. Without them I would not be where I am today. Their support means the most to me. I would like to extend thanks to all of them.

I would like to express my sincerest respect and gratitude towards my advisor, Dr

Monica Nicolescu. I am grateful for her insights and advices to guide me through research problems. I also appreciate her patience to allow me to search and work on research topics that I am interested in. Her teaching of how to write and present in a scholarly manner is very beneficial. I am lucky to receive this amount of training under the expertise of Dr Nicolescu. She has really been there for me and I cannot thank her enough.

My special thanks go out to my co-advisor Dr. Mircea Nicolescu that has provided me with inspiration and motivation to strive to produce better work everyday and for his generous support and patience.

I would like to thank the professors who share their precious time to be on my dissertation committee: Dr. George Bebis, Dr. Sushil Louis and Dr. Raul Rojas. I iii also would like to thank those with the CSE department who are always kind to offer help: Lisa Cody and Heather Lara.

Finally but not least, I would like to thank my parents for their endless love and support since the day I was born. I also want to thank my lovely sister Nasim for her encouragement and company. My family makes me who I am today. Without their support, this work would not have been completed. iv

Contents

Abstracti

Acknowledgements ii

List of Figures vii

List of Tables ix

1 Introduction1 1.1 Spatio-Temporal Pattern Classification...... 1 1.2 Early Detection Problem...... 2 1.3 Early Detection of Human Hand Gestures in Human-Robot Interaction Systems...... 3 1.4 Contribution...... 4 1.5 Conclusion...... 8

2 Previous Work9 2.1 Statistical Approaches...... 10 2.2 Gesture Recognition...... 11 2.3 Spatio-Temporal Patterns...... 12 2.4 Biological Neural Networks...... 12 2.5 Conclusion...... 14

3 SNN Approach 1 for Hand-Written Digits 15 3.1 Approach...... 16 3.1.1 Network Structure...... 16 3.1.2 Temporal Structure of Data...... 18 3.1.3 Network Training Approach...... 19 3.1.4 Finding Polychronous Neuronal Groups (PNGs)...... 21 3.1.5 Classification Algorithm...... 24 3.1.6 Early Detection...... 25 3.2 Experimental Results...... 27 Contents v

3.2.1 Training Stage...... 27 3.2.2 Generalization Results...... 28 3.2.3 Comparison Results...... 29 3.2.4 Early Detection...... 32 3.3 Conclusion...... 37

4 SNN Approach 2 for Hand-Written Digits 39 4.1 Approach...... 40 4.1.1 Spike Timing Neural Network Structure...... 40 4.1.2 Temporal Structure of Data...... 40 4.1.3 Training Spike Timing Neural Networks...... 41 4.1.4 Modeling Data with Temporal Patterns of Firing Neurons.. 41 4.1.5 Classification Algorithm...... 43 4.1.6 Parallelizing the Classification Algorithm...... 47 4.1.7 Early Classification...... 47 4.2 Experimental Results...... 48 4.2.1 Comparison with Other Approaches...... 51 4.2.1.1 Unsupervised Spike Timing Neural Network with Jac- card Index...... 52 4.2.1.2 Support Vector Machines...... 52 4.2.1.3 Regularized Logistic Regression...... 52 4.2.1.4 Ensemble Neural Networks...... 52 4.2.1.5 Bayes Network...... 53 4.2.1.6 Multilayer Perceptron...... 54 4.2.1.7 Radial Basis Function Network...... 54 4.2.1.8 Random Forest...... 55 4.2.1.9 Stacked Denoising Auto Encoder...... 55 4.2.2 Early detection results...... 57 4.3 Conclusion...... 62

5 SNN Approach 1 for Human Hand Gestures 64 5.1 Approach...... 65 5.1.1 Mapping of Spatio-Temporal Patterns to Spike Trains..... 65 5.1.2 Finding Polychronous Neuronal Groups (PNGs)...... 67 5.1.3 Classification Algorithm...... 70 5.1.4 Early Detection...... 70 5.2 Experimental Results...... 71 5.2.1 Classification Results...... 71 5.2.2 Early Detection Results...... 75 5.2.3 Comparison with Other Approaches...... 79 5.3 Discussion and future work...... 84 Contents vi

5.4 Conclusion...... 85

6 LCS based SNN Approach 2 for Human Hand Gestures 86 6.1 Approach...... 87 6.1.1 Mapping of Spatio-Temporal Patterns to Spike Trains..... 87 6.1.2 Finding Polychronous Neuronal Groups (PNGs)...... 87 6.1.3 Classification Algorithm...... 87 6.2 Early Detection...... 88 6.2.1 Experimental Results...... 88 6.3 Conclusion...... 95

7 Real-time SNN Approach 3 for Human Hand Gestures 96 7.1 General Approach...... 97 7.1.1 Mapping of Spatio-Temporal Patterns to Spike Trains..... 98 7.1.2 Finding Polychronous Neuronal Groups (PNGs)...... 98 7.1.3 Classification Algorithm...... 100 7.1.4 Real-Time Recognition...... 100 7.1.5 Early Classification (Intent Recognition)...... 102 7.1.6 Early Classification Measurements...... 104 7.2 Results...... 104 7.2.1 In-house Dataset...... 104 7.2.2 6dmg Dataset...... 105 7.2.3 Real-Time Classifier...... 105 7.2.4 Comparison with Other Approaches...... 110 7.2.5 Early Detection Results...... 114 7.3 Conclusion...... 115

8 Conclusion and Future Work 117

Bibliography 119 vii

List of Figures

1.1 Training procedure...... 5 1.2 Testing procedure...... 6

3.1 Spatio-temporal pattern representation; Left: a sample pattern; Mid- dle: pixel intensity values; Right: corresponding spiking pattern... 18 3.2 Spike timing dependent plasticity (STDP)...... 20 3.3 Training algorithm...... 21 3.4 Possible combinations of 3 anchor neurons for a pattern...... 22 3.5 Algorithm for building the models for the training patterns...... 24 3.6 Classification algorithm...... 25 3.7 Early detection approach...... 26 3.8 Digits used for training...... 27 3.9 Confusion matrix for multi-class problem...... 29 3.10 A subset of correctly classified testing samples...... 30 3.11 A subset of misclassified testing samples...... 30 3.12 Generic concepts for point of first detection and point of early confident detection ...... 32 3.13 Average of early detection rates (percentages)...... 34 3.14 Average of correct duration (percentages)...... 35 3.15 Comparison between averages of point of first detection(%) and point of early confident detection(%)...... 35 3.16 Left: Early recognition results for a digit one, Right: Early recognition results for a digit five...... 36 3.17 Early recognition results for a digit eight (incorrectly classified).... 37

4.1 Two model strings a = (a1 a2 a3) and b = (b1 b2 b3). Each character contains a set of neuron numbers which fired at the same time step. The values which are shown in edges describe the Jaccard similarity between two characters, i.e., the Jaccard similarity for a1 and b1 equals to 0.33. Also the overall LCS similarity for a and b is 0.66...... 44 4.2 Confusion matrix of classification results...... 49 4.3 Confusion matrix of top-3 accuracy...... 50 4.4 Accuracy vs. learning rate for SDAE...... 57 List of Figures viii

4.5 Average of early detection rates (percentages)...... 58 4.6 Average of correct duration (percentages)...... 59 4.7 Comparison between averages of point of first detection(%) and point of early confident detection(%)...... 59 4.8 Early recognition results for a digit eight (correctly classified)..... 60 4.9 Early recognition results for a digit nine (correctly classified)..... 61 4.10 Early recognition results for a digit nine (incorrectly classified).... 61 4.11 Early recognition results for a digit two (incorrectly classified)..... 62 4.12 Early recognition results for a digit three (a tie)...... 63

5.1 Representation of spatio-temporal patterns...... 66 5.2 Training data: first and third row; Correctly classified testing data: second and forth row...... 72 5.3 The top-3 candidates for similarity...... 73 5.4 Overlap (in number of PNGs) between all pairs of digit models.... 74 5.5 Confusion matrix...... 75 5.6 Average of the number of PNGs for each model, with standard deviations 76 5.7 Average of early detection rates (percentages)...... 76 5.8 Average of correct duration (percentages)...... 77 5.9 Comparison between averages of point of first detection(%) and point of early confident detection(%)...... 78 5.10 Similarity metric- a digit one sample...... 79 5.11 Similarity metric- a digit eight sample...... 80

6.1 Confusion matrix of classification results...... 91 6.2 Confusion matrix of top-3 accuracy...... 92 6.3 Average of early detection rates (percentages)...... 92 6.4 Average of correct duration (percentages)...... 93 6.5 Comparison between averages of point of first detection(%) and point of early confident detection(%)...... 93

7.1 The process of finding PNGs...... 99 7.2 Real-time recognition process...... 103 7.3 Screenshot of our real-time classifier...... 107 7.4 Training data: first and third row; Correctly classified testing data: second and forth row (in-house dataset)...... 108 7.5 Average of early detection rates (percentages)...... 115 7.6 Average of correct duration (percentages)...... 115 ix

List of Tables

1.1 Capabalities of the proposed approaches (unordered PNGs-Jaccard- digits: uPNG-Jac-h, unordered PNGs-Jaccard-hand gestures:uPNG- Jac-h, ordered PNGs-LCS-hand gestures: oPNG-LCS-h, characters- Jaccard-LCS-digits: ch-JacLCS-d, unordered PNGs-Jaccard-cuda: uPNG- Jac-c, Small training set: Small tr set, Early classification: Early class, Scale invariance: Scale invar, Real-time classifier: Real-time cl)....8 1.2 Features of the proposed approaches (unordered PNGs-Jaccard-digits: uPNG-Jac-h, unordered PNGs-Jaccard-hand gestures:uPNG-Jac-h, or- dered PNGs-LCS-hand gestures: oPNG-LCS-h, characters-Jaccard- LCS-digits: ch-JacLCS-d, unordered PNGs-Jaccard-cuda: uPNG-Jac- c, Network Response: Net Res, Encoding Method: Enc Met, Classifi- cation Method: Clas Met)...... 8

3.1 Multi-class classification results for a network with Gaussian distribu- tion for each individual digit...... 28 3.2 Multi-class classification results for a network with Gaussian distribu- tion for all digits combined...... 29 3.3 Result of comparison experiment (unit : %)...... 31

4.1 Classification results for each individual digit. SR: success rate, ER: error rate, RR: rejection rate...... 48 4.2 Classification results for all digits combined...... 48 4.3 Classification results with different threshold values. Thr: Threshold, Acc: Accuracy...... 51 4.4 Classification results for all digits combined for top-3 accuracy..... 51 4.5 Specification of Bayes Network...... 54 4.6 Specification of na¨ıve Bayes...... 54 4.7 Spec of RBF network...... 55 4.8 Specification of random forest...... 55 4.9 Parameters of SDAE. LR: Learning Rate, HL: Hidden Layers, HN: Hidden Nodes...... 56 4.10 Parameters of feedforward neural network (initialized by SDAE weights). LR: Learning Rate, HL: Hidden Layers, HN: Hidden Nodes...... 56 List of Tables x

4.11 Accuracy result of comparison experiment (unit : %)...... 56

5.1 Classification results for each individual digit and all digits combined. SR: success rate, ER: error rate, RR: rejection rate...... 72 5.2 Classification results of top-3 accuracy for each individual digit and for all digits combined. SR: success rate, ER: error rate, RR: rejection rate 79 5.3 Parameters of SVM...... 81 5.4 Parameters of ENN. (LR: Learning Rate, N: Nodes)...... 82 5.5 Parameters of DBN...... 82 5.6 Parameters of SDAE...... 83 5.7 Parameters of feedforward neural network (initialized by SDAE weights, LR: Learning Rate, N: Nodes)...... 83 5.8 Comparison Results of the proposed approach with other approaches. 84

6.1 Classification results for each individual digit. SR: success rate (%), ER: error rate (%), RR: rejection rate (%)...... 89 6.2 Classification results for all digits...... 89 6.3 Classification results for each individual digit for top-3 accuracy. SR: success rate, ER: error rate, RR: rejection rate...... 90 6.4 Classification results for all digits combined for top-3 accuracy.... 90 6.5 Comparison Results of the proposed approach with other approaches (PR: Proposed Approach)...... 91 6.6 Parameters of SVM...... 91 6.7 Parameters of ENN. (LR: Learning Rate)...... 94 6.8 Parameters of DBN...... 94 6.9 Parameters of feedforward neural network (initialized by DBN weights) 94 6.10 Parameters of SDAE...... 95 6.11 Parameters of feedforward neural network (initialized by SDAE weights) 95

7.1 Tracked and stored information for each gesture in 6dmg dataset... 105 7.2 Classification results for each individual(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset).. 107 7.3 Classification results for all digits(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset)...... 107 7.4 Top-3 classification results for each individual and all digits(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset)...... 109 7.5 Top-3 classification results for all digits(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset)..... 109 7.6 Specification of recurrent neural network. (LR: Learning rate, BS: Batch size, LF: Loss function, NE: Number of epochs, NN: Number of neurons, OS: Output size, CrEn: Cross entropy)...... 111 List of Tables xi

7.7 Specification of continuous hidden markov model. (NE: Number of epochs, NC: Number of components, EF: Estimating function, Cov-t: covariance type, 6dmg-d: 6dmg digit dataset, 6dmg-c: 6dmg character dataset, EM: Expectation maximization algorithm, BW: Baum Welch) 112 7.8 Specification of discreet hidden markov model. (NS: Number of state, NO: Number of outputs , EF: Estimating function, 6dmg-d: 6dmg digit dataset, 6dmg-c: 6dmg character dataset, EM: Expectation max- imization algorithm, BW: Baum Welch)...... 112 7.9 Specification of spiking neural network. (NN: Number of neurons).. 112 7.10 Comparison results, 6dmg-digit dataset (%)...... 113 7.11 Comparison results, In house dataset, digits, angles features (%)... 113 7.12 Comparison results, 6dmg character dataset (%)...... 114 1

Chapter 1

Introduction

1.1 Spatio-Temporal Pattern Classification

Spatio-temporal pattern recognition is an important problem for all biological sys- tems. Activities like gestures, movements and detecting them by observation contain both spatial and temporal information. This information is essential in communica- tion, collaboration and learning by observation and demonstration. Spatio-temporal patterns and gestures are extensively used in social interactions, as they carry inten- tional meanings of a person’s own goals. While people can quickly recognize such gestures and predict others’ intentions based on the observed movement patterns, the same task is more challenging for an autonomous robot system. In order to facilitate and support such interactions, which naturally happen between people, for domains 2 in which humans and robots interact with each other, the ability for real-time, early recognition of human gestures becomes of key importance.

Existing approaches to learning spatio-temporal patterns are typically supervised, offline, rely on extensive amounts of training data, require the observation of the entire pattern for recognition, cannot process variable sized input patterns, can only handle patterns in a fixed frame and they are not scale/translation invariant. In contrast, the proposed methods are mostly-unsupervised and are robustly trainable on a small training set with limited number of samples, they are able to classify variable sized inputs, at different scales/locations, they are able to early classify patterns. Also a real-time version of spiking networks has been developed, which is suitable for human-robot interaction (HRI) applications.

1.2 Early Detection Problem

This research is motivated by two robotic problems that rely on an autonomous sys- tems ability to encode and recognize spatiotemporal patterns: intent recognition [1] and imitation learning [2][3]. Evidence from psychology indicates that people’s abil- ity to recognize the intent of others relies on a mechanism for representing, predicting and interpreting the actions of other agents [4]. In a similar way, learning from demonstration focuses on the problem of understanding and encoding observations of a teacher’s actions. In both domains, the activities observed by the autonomous 3 system contain both spatial and temporal information. What is more important, however, is that both domains require that these patterns be encoded in a way that enables early recognition, even before the entire pattern is observed. For example, people would quickly infer a persons intention of grasping a cup, simply by observing their hand posture and movement toward a cup on the table, well before the actual grasp is finalized.

1.3 Early Detection of Human Hand Gestures in

Human-Robot Interaction Systems

For natural human-robot interactions a robotic system should understand gesture- based non-verbal communication. This requirement is further emphasized in collab- orative processes involving cooperative actions, such as smooth passing of objects between robots and human counterparts [5], navigation in crowded environment [6], and assistive applications of robotics [7]. The first step toward this goal is to be able to understand gestures at real-time. In collaborative scenarios, it is crucial for robots to understand what their human teammate is doing early on in order to proactively help them complete the task. This requires early classification of patterns. Therefore, both real-time processing and early classification are important features of a robotic system in human-robot interaction domains. One of the most complex robotic plat- forms designed for research, specially in human-robot interaction domain, is the PR2 4 robot by Willow Garage. The numerous sensors on this robot makes it very suit- able for tasks that require high resolution . I chose to use PR2 as the implementation and evaluation platform, mainly because of its sensing capabilities.

Furthermore, the arms on PR2 are capable of complex, diligent movements. The arms are suitable for human-robot interaction and not for industrial, heavy-weight tasks.

1.4 Contribution

Spike timing neural networks (SNNs) are suitable methods to model spatio-temporal patterns. In this dissertation I propose 5 methods based on SNNs with axonal con- ductance delays [8] that address the problem of learning, recognizing and early clas- sifying of spatio-temporal patterns. These patterns are typically encountered when representing gestures or other human actions.

All the proposed techniques have 2 phases; training and testing. The two phases are shown in Figure 1.1 and Figure 1.2. The first step in the training phase is to map spatio-temporal patterns (input data) in to the neurons of spike timing network. This could be achieved by encoding spatio-temporal patterns into neural spike trains. The neural spike trains will be used to stimulate the spike timing network. In my work the input patterns are images of hand written digits, videos of human hand gestures, and real time data. 5

Figure 1.1: Training procedure

For input consisting of images, each pixel is assigned to one neuron of the spike timing network. This is a one to one mapping. Firing time of the neurons is based on the temporal information encoded in the input patterns. This would be described in Chapter3 and Chapter4. For input consisting of videos of human gestures and real time data, the hand positions are extracted. The angle between two consecutive 6

Figure 1.2: Testing procedure hand positions is computed. Each angle is assigned to a group of five neurons. This is a one to five mapping. The firing time of the neurons is based on the order of their corresponding angles, as described in Chapter5, Chapter6 and Chapter7. The training samples are not labeled, as the training part of the proposed approaches is unsupervised. The synaptic weights in the spike timing network are updated during training process based on the spike-timing dependent plasticity rule (STDP). After the network is fully trained, each training pattern is presented to the network. The network’s responses are considered as the output model corresponds to the input pattern. Hence one output model will be created and corresponded to each training class. These models are used for the classification purpose later on. To classify an unseen pattern in the classification phase (Figure 1.2), it is mapped into a spike train.

Then an output model will be created based on the network’s response to the pattern. 7

The classification decision is made based on the comparison between the testing and all the training models.

In contrast with all other types of neural networks, a spike timing neural network does not have input and output layers. Hence the overall behaviour of the network in response to a specific input is considered as the output. Three different types of responses are considered in the proposed approaches. They are 1) unordered sets of polychronous neural groups (PNG), which encode stereotypical time-locked firing patterns (Chapter3, Chapter5 and Chapter7), 2) ordered sets of PNGs (Chapter6), and 3) a string of “characters”, in which each character is a set of neurons that fired at a particular time step (Chapter4). In the classification phase, three similarity criteria are defined. For unordered set of PNGs, the Jaccard index is used (Unordered PNGs-

Jaccard-digits, Unordered PNGs-Jaccard-hand gestures). For ordered sets of PNGs the longest common subsequence (LCS) has been used (ordered PNGs-LCS-hand ges- tures). The combination of Jaccard and LCS (characters-Jaccard-LCS-digits) has been used for the strings of characters. All approaches are capable of early detection of the patterns. For the real time data, a real time classifier is proposed (unordered

PNGs-Jaccard-cuda Chapter7). This approach is a parallelized CUDA implementa- tion of Unordered PNGs-Jaccard-digits approach. The proposed approaches alongside their features are shown in Table 1.2. Table 1.1 shows the capabilities of the five pro- posed methods. 8

Table 1.1: Capabalities of the proposed approaches (unordered PNGs-Jaccard- digits: uPNG-Jac-h, unordered PNGs-Jaccard-hand gestures:uPNG-Jac-h, or- dered PNGs-LCS-hand gestures: oPNG-LCS-h, characters-Jaccard-LCS-digits: ch- JacLCS-d, unordered PNGs-Jaccard-cuda: uPNG-Jac-c, Small training set: Small tr set, Early classification: Early class, Scale invariance: Scale invar, Real-time classifier: Real-time cl)

uPNG-Jac-d uPNG-Jac-h oPNG-LCS-h ch-JacLCS-d uPNG-Jac-c

Small tr set X X X X X Unsupervised X X X X X Early class X X X X X Scale invar - X X - X Real-time cl - - - - X Dataset Digits Gestures Gestures Digits Gestures Chapter # 3 5 6 4 7

Table 1.2: Features of the proposed approaches (unordered PNGs-Jaccard-digits: uPNG-Jac-h, unordered PNGs-Jaccard-hand gestures:uPNG-Jac-h, ordered PNGs- LCS-hand gestures: oPNG-LCS-h, characters-Jaccard-LCS-digits: ch-JacLCS-d, unordered PNGs-Jaccard-cuda: uPNG-Jac-c, Network Response: Net Res, Encod- ing Method: Enc Met, Classification Method: Clas Met)

Net Res Input Enc Met Clas Met uPNG-Jac-d Unordered PNGs Pixel 1 to 1 Jaccard uPNG-Jac-h Unordered PNGs Angle 1 to 5 Jaccard oPNG-LCS-h Ordered PNGs Angle 1 to 5 LCS ch-JacLCS-d Fired Neurons Pixel 1 to 1 Jaccard+LCS uPNG-Jac-c Unordered PNGs Angle 1 to 5 Jaccard

1.5 Conclusion

The rest of this dissertation is organized as follows. Chapter 2 presents a literature review of spatio-temporal pattern recognition problem and existing approaches to solve it. Chapters 3,4,5,6 and 7 include the details of the five proposed approaches followed by their experimental results. Finally this dissertation is concluded and future directions are provided for extending this work in Chapter 8. 9

Chapter 2

Previous Work

Researchers in many fields have been interested in the problem of classifying sequence data. In my case, the problem of sequence classification is motivated by intent recog- nition, the problem of inferring a human’s unseen mental states from their visible actions [9]. The ability to infer intentions from sensor streams is critical if robots are to operate in unstructured social environments that are ubiquitous wherever humans are found. Roboticists have explored a number of potential solutions to the intent recognition problem, including symbolic approaches [10] and probabilistic methods

[1]. 10

2.1 Statistical Approaches

In recent years, the most successful methods have relied on statistical modeling, in particular hidden Markov models, or HMMs. However, even methods not designed for sequence analysis have been used for the task. By passing a fixed-width window over a sequence, one can obtain a set of finite-dimensional vectors; it is then possible to pass these vectors as inputs to standard classification methods such as logistic regression, support vector machines, and classical feedforward neural networks [11].

Although this approach can be successful, it runs the risk of missing correlations between sequence elements that fall into two different windows.

In general, the standard approaches to modeling robot systems that evolve over time are either continuous-time parametric approaches such as Kalman filters [12], or discrete-time models such as Markov chains [13] or Hidden Markov models [14].

In social settings, the assumptions made by Kalman filters and similar approaches

(linear state dynamics, normally-distributed noise, continuous state space) are either too strong or do not map well onto models of social interaction. Although it is pos- sible to abandon most of the assumptions of a Kalman filter by using an extended

Kalman filter, building a dynamical system to reliably model social interaction is still challenging. Hidden Markov models, being discrete-time models with discrete state spaces, are much better suited to modeling systems over time, and have been used successfully in a number of settings [14][15]. However, even HMMs make conditional 11 independence assumptions that limit their usefulness in many applied settings. The limitations encountered in practice motivated my search for other approaches.

2.2 Gesture Recognition

A wide range of approaches have been developed for gesture recognition, from image processing and computer vision to statistical modeling and connectionist systems.

Statistical modeling techniques used in this context include PCA, Hidden Markov

Models [14][16][17], Kalman filtering [18], particle filtering [19][20] and condensation algorithms [21][22][23][24]. Finite state machines (FSM) have also been employed in modeling human gestures [24][25][26][27]. Many gesture recognition systems use a va- riety of computer vision techniques [28], such as feature extraction, object detection, clustering and classification, as well as image-processing techniques [29], such as anal- ysis of shape, texture, color or motion cues, optical flow, segmentation and contour modeling [30]. Examples of connectionist approaches [31] used in gesture recognition include multilayer perceptrons (MLP), time-delay neural networks (TDNN) and ra- dial basis function networks (RBFN). The methods are also distinguished based on whether they consider a temporal aspect. For static gesture recognition, techniques such as logistic regression, support vector machines, and classical feedforward neural networks [11] give very good performance on data sets in which the patterns are scaled and centered within a particular window [32]. Approaches that use both spatial and temporal component of the pattern rely on techniques such as dynamic time warping, 12

Hidden Markov Models [33] and Time Delay Neural Networks [34], but they are also sensitive to the scale of the pattern.

2.3 Spatio-Temporal Patterns

Spatio-temporal patterns occur in numerous application domains: some of the most representative examples which have been extensively addressed by pattern recogni- tion research include handwriting and human actions/gestures. Initial approaches to handwriting recognition have involved the use of Hidden Markov Models (HMMs)

[35], cluster generative statistical dynamic time warping [36], support vector ma- chines with a Gaussian dynamic time warping kernel [37] and analytical methods with

Delta-Lognormal parameters for recognizing handwriting strokes [38]. More recently, research in recurrent neural networks [39] and deep neural networks has demonstrated improved performance over HMM-based approaches [40]. However, these methods re- quire significant amounts of training data and are also sensitive to the scale of the patterns.

2.4 Biological Neural Networks

Biological neural networks offer an alternative to traditional statistical methods. Such networks pass messages through the timing of a spike train from one neuron to an- other [41]. Computational models of such systems are easy to specify with ordinary 13 differential equations, and may be computed quickly on a fairly large scale [42]. In biological systems, the connections between these neurons may be modified through experience; many researchers suspect that these modifications constitute the bulk of learning, and some researchers have attempted to use the biologically-inspired spike-timing dependent plasticity to do machine learning [43]. Although in general these systems have met with somewhat modest success, there are some indications that careful use of spike timings can facilitate very general forms of computation [8].

Moreover, researchers have shown preliminary work in which polychronization can be used with reservoir computing methods to perform supervised classification tasks

[44]. My work continues along these lines by exploring the extent to which purely unsupervised learning can exploit polychronization with temporally-dependent data to build time-aware representations that facilitate traditional classification. I differ from that previous work in my emphasis on mostly-unsupervised approaches, and in my emphasis on training from limited sample sizes. Moreover, as early detection is critical for my motivating applications, I show the capability of my approaches for classification from partial patterns.

The above systems typically require large amounts of training data, and for the most part are sensitive to scale and translation. Biological neural networks offer an alter- native to traditional statistical methods. Such networks pass messages through the timing of a spike train from one neuron to another [41]. Computational models of such systems are easy to specify with ordinary differential equations, and may be computed quickly on a fairly large scale [42]. In biological systems, the connections 14 between these neurons may be modified through experience; many researchers sus- pect that these modifications constitute the bulk of learning, and some researchers have attempted to use the biologically-inspired spike-timing dependent plasticity to do machine learning [43]. Although in general these systems have met with some- what modest success, there are some indications that careful use of spike timings can facilitate very general forms of computation [8]. Moreover, researchers have shown preliminary work in which polychronization can be used with reservoir computing methods to perform supervised classification tasks [44][45].

2.5 Conclusion

To summarize, this chapter presents a literature review of the problem of spatio- temporal pattern recognition, It’s application in the gesture recognition domain, as well as the existing approaches to solve this problem. 15

Chapter 3

SNN Approach 1 for Hand-Written

Digits

In this chapter I describe a method that relies on spiking networks with axonal con- ductance delays, which learn encoding of individual patterns as sets of polychronous neural groups. Classification is performed using a similarity metric between sets, based on a modified version of the Jaccard index. The approach is evaluated on a data set of hand-drawn digits that encode the temporal information on how the digit has been drawn. It brings the following main contributions: i) it learns the patterns in an unsupervised manner, ii) it uses a very small number of training samples, and iii) it enables early classification of the pattern from observing only a small fraction of the pattern. In addition, the method is compared with three other standard pattern clas- sification methods: support vector machines, logistic regression with regularization 16 and ensemble neural networks, all trained with the same data set. The results show that the proposed approach can successfully learn these patterns from a significantly small number of training samples, can identify patterns before their completion, and it performs better than or comparable with the three other supervised methods.

3.1 Approach

3.1.1 Network Structure

Our network consists of 320 neurons that are connected according to a predefined probability distribution function. Each neuron is connected to 10% of the rest of the neurons, meaning that each neuron has 32 synapses connecting that neuron to others. Each synapse has a fixed conduction delay between 1 ms to 20 ms, which stays the same during the network life cycle. These delays are randomly selected using a uniform distribution. Each neuron in the system can be either excitatory or inhibitory, with a ratio of excitatory to inhibitory neurons of 4:1. Therefore, the network has 256 excitatory and 64 inhibitory neurons. Based on [8], only excitatory neurons can be stimulated and can be considered as potential inputs.

Each synapse has a weight associated with it: initial weight values are +6 if the corresponding neuron is excitatory and -5 if the neuron is inhibitory. These weights determine how strong the connection is and how strong the firing of one neuron is affecting the other neuron. In this approach, the maximum weight of a synapse is 10. 17

A spiking neuron model proposed by Izhikevich has been used [8] and described by

the following equations:

v0 = 0.04v2 + 5v + 140 − u − I (3.1a)

u0 = a(bv − u) (3.1b)

ifv ≥ 30 mV, then v ← c, u ← u + d (3.1c)

In equations 3.1(a-c), v and u are membrane potential and recovery variables respec- tively, a is the time scale of the recovery variable, b is the sensitivity of the recovery variable, c is the after-spike reset value for the membrane potential and d is the after-spike reset for recovery variable. For excitatory neurons the parameters are set as a=0.02, b=0.2, c=-65, d=8, and for inhibitory neurons the parameters are set as a=0.1, b=0.2, c=-65, d=2 [8].

In this work a two-dimensional Gaussian distribution function is used to establish the connectivity between neurons. The standard deviation of the Gaussian distribution is equal to 3. With this distribution physically adjacent neurons in the 2D arrangement

(as described in Section 3.1.2) have higher probabilities for connection than neurons that are farther away. By using standard deviation equal to 3 the possibility of connection between each neuron and 9 adjacent neurons in 2D arrangement is higher 18

Figure 3.1: Spatio-temporal pattern representation; Left: a sample pattern; Mid- dle: pixel intensity values; Right: corresponding spiking pattern than farther neurons. This means that neurons that form a neighborhood in the corresponding input have higher probabilities to be connected to each other.

3.1.2 Temporal Structure of Data

To illustrate how the learning approach can classify spatiotemporal patterns, a test domain is chosen that has been used extensively before in similar approaches [44]. The dataset consists of handwritten digits from zero to nine, stored as a grey-level image with width and height of 16 pixels (created by hand using Gimp). In addition to the spatial information inherent in the patterns, temporal information regarding how the pattern was drawn is also encoded: from the first pixel (beginning of pattern) to the last (end of pattern), the intensity of the pixels decreases from highest to lowest (fade tapering) (Fig 3.1). With this approach an intensity-based image is used to encode a relative temporal relation from intensity pixel values. To feed the input to the network neurons are stimulated in a manner that uses the temporal structure present in the data. Toward this goal, first an association between data elements in the input and 19 the neurons (excitatory ones) should be found in the network (variable length inputs can be easily handled in this way.) This association tells us which neuron should be stimulated when a particular data element is present in the input. In this work a separate neuron is assigned to every pixel of the image. This one-to-one association between input pixels and network neurons helps to stimulate corresponding neurons of image pixels that have a grey-level value other than 255 (white), which is the background color in the data. A second important factor related to providing input to the network is to determine when to stimulate a neuron. The timing of neuron

firings is determined based on the temporal information contained in the input signal.

From the pixel intensity values a time-firing pattern is generated that consists of a list of the neurons in decreasingly sorted order from highest to lowest intensity values.

This order will be used to fire the neurons in the training phase as described in

Section 3.1.3.

3.1.3 Network Training Approach

During training, each pattern is presented to the network with 1-second intervals as follows: based on the time-firing pattern that encodes the ordering of neuron firings

(Section 3.1.2), the neuron that corresponds to the highest intensity value will fire

first, followed by the lower intensity value neurons in decreasing sorted order, with each neuron firing at 1ms intervals, as shown in Figure 3.1. Thus, each pattern will last a number of milliseconds equal to its temporal length. 20

0.3

Figure 3.2: Spike timing dependent plasticity (STDP).

After each millisecond, the Spike-timing Dependent Plasticity (STDP) rule is used to

update the synaptic weights [8][46]. According to STDP, the timing of spikes between

a pre and postsynaptic neurons determines the amount of synaptic weight changes. If

a presynaptic spike arrives at postsynaptic neuron before the latter fires, the strength

−t/τ+ of the connection is increased by A+e . On the other hand, the synaptic weight

−t/τ− is decreased by A−e if the presynaptic spikes arrives the postsynaptic neurons

just after the postsynaptic neuron fired (Figure 3.2). In this work I use A+= 0.1, A−

= 0.12 and tau+=tau−=20ms which are the same as [8].

Since the patterns lengths are smaller than 1000 (equivalent to 1 second), after the end of the stimulus the network is allowed to run without any stimulation, while still updating the synaptic weights using STDP. Stimulating a neuron is achieved by providing it with an input current of 20 mA. During training, the input patterns are presented to the network by rotating through all five instances of each digit

(zero through nine). The network is trained until the PNGs formed in the network become (section 3.1.4) stable (the set of PNGs active in the network does not change). 21

Figure 3.3 illustrates this process for a network trained with only two different input patterns. After stimulating the network with the input patterns 1450 times, the firing patterns in the network, representative of the underlying PNGs, will become stable and not change.

Figure 3.3: Training algorithm

3.1.4 Finding Polychronous Neuronal Groups (PNGs)

Polychronous neuronal groups are sets of strongly connected neurons capable of firing time-locked, although not synchronous, spikes [8]. Neurons exhibiting time-locked

firing patterns will fire at different times, but always with the same amount of time delay between the individual firings. Such groups emerge as a result of stimulating a network with particular spike-timing patterns during a training phase. Using the 22

0.3

Figure 3.4: Possible combinations of 3 anchor neurons for a pattern

trained network a model of each class is built, consisting of all the persistent PNGs

that are activated by a pattern from that class. PNGs typically require a set of 3

neurons, called anchor neurons, to fire in a particular time-locked pattern in order

to activate the entire chain [47]. In order to find all the PNGs corresponding to

a pattern, all possible combinations of 3 neurons are taken from the corresponding

timed pattern and these subsets of neurons are stimulated in the network in the same

order and using the same timing as in the pattern. For example, for the pattern in

Fig 3.4, consisting of 20 neurons, all the subgroups of three neurons are selected: (1,

2, 3), (1, 2, 4), . . . , (1, 2, 20), (2, 3, 4), . . . , (2, 3, 20), . . . , (18, 19, 20). Only three

neurons at a time are stimulated, instead of the entire pattern, to be able to identify

emerging PNGs for each triplet. However, the entire model for a pattern consists of

the union of all the PNGs emerging from all possible combinations of anchor neurons.

After each set of three neurons is stimulated, the network is allowed to run without

any input until 500 ms have elapsed from the beginning of the stimulus and record all

the neuron firings (Neuroni, firedT imei) within this time. A neuron is considered to fire if its voltage is greater than 30mV. A PNG has a path length of k if the longest 23 chain of firing connections from the first to last firing is k. If a PNG of length greater than 3 emerges as a result of one such stimulus, it will be added to the pattern model.

A PNG is identified by the group of three neurons. To build a unique identifier a number is identified consisting of the three anchor neuron numbers separated by zeros.

For example a PNG with identifier 23077021 represents a PNG with anchor neurons

23, 77 and 21. To find if a PNG has emerged for a particular triplet of neurons the following process is followed: at each time step of the stimulus, for all Neuroni that have fired in that step, if any of their presynaptic Neuronj have fired within a time less than or equal to 10ms from neuron i’s firing time firedT imei, the following tuple is saved: < Neuronj, firedT imej, Neuroni, firedT imei >. This will provide a list of

“edges” that connects neurons whose firing patterns are linked to each other. Using this set of edges and by doing a depth first search in the resulting graph, the longest possible path is found that connects any of the neurons that have fired during the

500 ms interval. The length of this path is the length of the PNG associated with a particular triplet of angles.

For each pattern class, the result will be a group of PNG sets obtained from all the sample belonging to that class. Each PNG set contains all PNGs activated by one of the individual training sample of that particular class. These sets will be used for classification as described in Section 3.1.5. During training the synaptic weights are continuously updated, meaning that some new PNGs could emerge, while others would disappear. Overall, after sufficient training, a set of persistent PNGs is encoded in the network. To find the persistent PNGs all the synaptic weights are saved every 24

Figure 3.5: Algorithm for building the models for the training patterns

10 seconds during the last 1000 seconds of training time (resulting in 100 instances of synaptic weights). A PNG is considered to be persistent if it occurs more than 20 times within these 100 instances. Only persistent PNGs are added to the class model.

Fig 3.5 shows the algorithm for building the models for the training patterns.

3.1.5 Classification Algorithm

In the testing phase, all testing samples are presented to the trained network one by one (similar to training) and the PNG set is found for each testing input. To obtain these PNGs, for a particular testing pattern, all possible combinations of 3 anchor neurons are stimulated similar to the neurons described in section 3.1.3. Then the

PNGs that have a minimum path length of 3 are found, similar to section 3.1.4. Next, the similarity between the PNG set of the testing sample and all training models is found. A similarity measure for sets called Jaccard Index is adapted to define a similarity measure between the input PNGs and the models of the classes [48]. If

A and B are two PNG sets corresponding to two patterns, the similarity measure between A and B is: 25

Figure 3.6: Classification algorithm

sim(A, B) = 1 − |AxorB|/|A ∪ B| (3.2)

The sign | | in equation (3.2) refers to the number of elements of the set within. In this manner a similarity measure between the test samples and each of the training samples is obtained. For classification, the class of the sample from the training set that results in the greatest similarity measure is chosen (Fig 3.6).

3.1.6 Early Detection

As mentioned above, one of the major motivations for this work is the ability to recognize temporal patterns as early as possible, well before the entire pattern has been observed. Each pattern is encoded by a set of PNGs that get activated when the pattern is presented to the entire network. To measure how early this system can detect such patterns, the following approach is taken. 26

Figure 3.7: Early detection approach

As each pattern (digit) is drawn, all the PNGs are found that get activated by all different combinations of three anchor neurons, from the neurons corresponding to the part of the pattern presented until that point. After each such step, the previous classification algorithm (Section 3.1.5) is applied to this set of PNGs. Fig 3.7 shows different iterations of early detection algorithm. At the first iteration the neurons corresponding to the first three pixels of the pattern is stimulated, after which the classification algorithm is applied. At the second iteration the first four neurons of the pattern are stimulated and the PNGs based on all combination of three out of the four neurons are found, after which the classification algorithm is applied again. The same approach is used for the rest of the input patterns, stimulating one more neuron from the pattern and observing the PNGs that emerge from the pattern observed up to that point. 27

3.2 Experimental Results

3.2.1 Training Stage

To validate the proposed approach a dataset consisting of handwritten digits are cre- ated. This dataset contains 5 training samples and 50 testing samples for each digit.

These grey-level images contain spatiotemporal data of handwritten decimal digits

(Fig 3.8). The network is trained with a Gaussian connectivity model (Section 3.1.1).

For the Gaussian model network a standard deviation of 3 is used. Each training sample is used to stimulate the network as described in Section 3.1.3 and is presented to the network 6000 times for multi-class classification. In this case all digits of classes from zero to nine are considered.

Figure 3.8: Digits used for training 28

3.2.2 Generalization Results

To validate the performance of the proposed approach the following measures are computed: i) the success rate (the percentage of correctly classified test samples), ii) the error rate (the percentage of misclassified test samples), and iii) the rejection rate (the percentage of instances for which no classification can be made). Rejected patterns occur when the similarity measures for the pattern are all zero or there is a tie between multiple classes. Table 3.1 and Table 3.2 shows the results for the multi-class problem for each digit individually and for the entire dataset combined respectively.

Table 3.1: Multi-class classification results for a network with Gaussian distribu- tion for each individual digit

0 1 2 3 4 5 6 7 8 9 Success rate 72% 84% 78% 70% 76% 94% 92% 98% 54% 96% Error rate 14% 8% 10% 20% 4% 0% 8% 2% 36% 2% Rejection rate 14% 8% 12% 10% 20% 6% 0% 0% 10% 2%

Fig 3.9 shows the confusion matrix of the classification results. The columns are actual classes and rows are shown the predicted classes. The eleventh row is for the tie or unpredicted class. For example, in this confusion matrix, of the 50 actual digits four, the system predicted that one was digit six, the other one was digit two, and 10 were unpredicted or tie.

Based on the matrix digit eight is the hardest to classify, mainly because some parts of its spatio-temporal pattern have significant overlap with other digits like two, seven and nine. However, the remaining digits are distinguished very well. Digit seven has 29

Table 3.2: Multi-class classification results for a network with Gaussian distribu- tion for all digits combined

All Digits Success rate 81.40% Error rate 10.4% Rejection rate 8.2%

Figure 3.9: Confusion matrix for multi-class problem the best results because it has the minimum overlap with other patterns. Fig 3.10 shows a subset of the testing data that is classified correctly and Fig 3.11 shows a subset of misclassified samples.

3.2.3 Comparison Results

In order to better assess the performance of the proposed unsupervised approach, it has been compared with three state of the art supervised recognition methods: 30

Figure 3.10: A subset of correctly classified testing samples

Figure 3.11: A subset of misclassified testing samples support vector machines [49], regularized logistic regression [50] and ensemble neural network [51]. Libsvm [52] is used for the SVM implementation with a polynomial kernel. The gamma is set to 1024 and the penalty parameter is set to 8. For Regular- ized Logistic regression, the Matlab implementation is used, with the regularization parameter is set to 0.1. Since the number of training samples in the dataset is too small in comparison to the size of feature set, an ensemble neural network (imple- mented in Matlab) instead of a single neural network is used. Rekabdar et al. [51] 31

shows that generalization ability of ensemble neural network is greater than a single

neural network. In the ensemble neural network, 15 neural networks are used. All of

them have one hidden layer, but their number of hidden neurons is different across

networks. The number of hidden nodes varies from 4 to 32 by step 2 (i.e. even num-

bers in the range of 4 to 32). The learning rate and threshold are set to 0.02 and

0.005 respectively. The same training dataset is used to train all the networks. In

the testing phase, for each of the testing samples, 15 decisions are obtained. Then a

simple majority voting approach is used to produce the final output. In order to train

the feed-forward network, the Levenberg-Marquardt back propagation algorithm is

used [53]. For the hidden and output layer the tansig transfer function and pure- lin activation functions are used respectively. To set the parameters, the model was tested with changing parameter values gradually, and then the parameters with the best performance of the method are chosen. For the ensemble neural network the average results of 10 trials for model generation are computed.

Table 3.3: Result of comparison experiment (unit : %)

SVM Logistic Regression ENN Proposed method Accuracy 86% 76.6% 55% 81.4%

Table 3.3 shows the performance of these three methods along with the proposed

method. The proposed unsupervised method performs significantly better than logis-

tic regression and the ensemble neural networks. SVMs perform only slightly better

than the proposed method. It is important to note that the training parts of SVM,

logistic regression and ensemble neural network are all supervised, as opposed to the 32

unsupervised training approach of the proposed method. These results show that the

proposed spike timing method can compete with and even outperform other state of

the art supervised methods.

3.2.4 Early Detection

To provide a quantitative evaluation of the proposed method’s performance on early

pattern classification two metrics are analysed, as introduced in [1]: the early detection

rate and the correct duration. These metrics rely on two important timings, shown

generically in Fig. 3.12.

Figure 3.12: Generic concepts for point of first detection and point of early con- fident detection

In Fig. 3.12, for simplicity of explanation, a pattern of 10ms duration is assumed. Each

10% interval on the X-axis represents the timing between stimulating two subsequent 33 neurons. In this analysis percentages are used rather than milliseconds due to the fact that different patterns of the same class may have different durations. We are interested therefore in the percentage of the input that is required by the system to correctly identify the pattern. The green line represents the similarity computed by the system with respect to the correct class (as shown, this may go up and down as more of the pattern is being observed). The other lines represent similarity measures with respect to other, incorrect classes (these also go up or down and may overtake the correct class at various times). With this, the point of first detection is defined as the time (as percentage of the total pattern duration) when the true class is first detected as having the highest similarity measure (at 40% in Fig. 3.12). The point of early confident detection is the time (as percentage from the total pattern duration) which the true class is correctly predicted until the end of the pattern (at 60% in

Fig. 3.12).

Based on these timings, the early detection rate and the correct duration are computed as follows:

• Early detection rate =100 ∗ Ratio of: (i) the interval of time between the start

of the pattern until the point of early confident detection, and (ii) the entire

duration of the input pattern.

• Correct duration= 100 ∗ Ratio of: (i) the total time during which the correct

class is predicted and (ii) is the total length of the pattern. 34

Figure 3.13 shows the average of early detection rates percentages for each digit individually. The average of input percentages for all digits except eight are less than

60% and they are lower than or equal to 40% for digits zero, four, six and seven.

This shows that the proposed system is able to detect the correct class from a very small part of the input pattern. Since the pattern of digit eight has significant overlap with the patterns of digits two, seven and nine, the system needs to observe a higher percentage of the pattern to predict it correctly.

100

90

80

70

60

50

40

30 Early detection rates(%) 20

10

0 Zero One Two Three Four Five Six Seven Eight Nine Digits

Figure 3.13: Average of early detection rates (percentages)

Figure 3.14 shows the average correct duration percentages for all digits individually.

The average amount of time when the system predicted the correct answer is more than 40% for all digits except eight. These numbers reflect the amount of time the system requires to predict the correct class, as the temporal pattern is being observed.

In the early stages, the system is uncertain about the pattern, possibly predicting 35 incorrect classes. However, as the early detection rate shows, after a small percentage of the pattern is observed, the system converges onto the correct class.

90

80

70

60

50

40

30 Correct duration (%)

20

10

0 Zero One Two Three Four Five Six Seven Eight Nine Digits

Figure 3.14: Average of correct duration (percentages)

100 First detection(%) 90 Early confident detection(%)

80

70

60

50

40

30

20

10

0 Zero One Two Three Four Five Six Seven Eight Nine Digits

Figure 3.15: Comparison between averages of point of first detection(%) and point of early confident detection(%)

Figure 3.15 presents the Comparison between averages of point of first detection(%) and point of early confident detection(%). Figure 3.16 (left and right parts) shows the system’s predictions for two of the digits (one and five), as increasingly larger 36

0.5 0.7 1 0 3 5 0.45 9 0.6

0.4

0.5 0.35

0.3 0.4

0.25

0.3 0.2 Similarity Measure Similarity Measure 0.15 0.2

0.1 0.1 0.05

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Input Percentage Input Percentage

Figure 3.16: Left: Early recognition results for a digit one, Right: Early recog- nition results for a digit five. portions of the input sequence are being observed. In these images, the Y axis shows the similarity measure between an input digit and models of all digits, and the X axis shows the percentage of the input that was presented to the system. For simplicity, the similarity measures between the input and a digit are not shown when they are zero. Figure 3.16 (left) shows how the system’s predictions change as more of the input pattern becomes available: when the system observes only up to 40% of the pattern an incorrect class (three instead of one) is predicted. However, after observing more than 40% of the pattern, the system predicts the correct class. Figure 3.17 shows an instance of digit eight which has been wrongly classified. It is interesting to note that although the highest similar class is not the correct one, it is still the second highest similar class in most of the cases. 37

0.1 2 3 0.09 8 9

0.08

0.07

0.06

0.05

0.04

Similarity Measure 0.03

0.02

0.01

0 0 10 20 30 40 50 60 70 80 90 100 Input Percentage

Figure 3.17: Early recognition results for a digit eight (incorrectly classified).

3.3 Conclusion

In this chapter a new unsupervised approach is introduced for learning spatio-temporal patterns based on SNN. This method utilizes a network of spiking neurons to learn from a very small training set and uses the trained network to classify testing sam- ples. The general idea behind this approach is that a spiking network with axonal conductance delays learns the encoding of each spatiotemporal pattern as a set of polychronous neural groups (PNGs). Then Models are created for each class that consist of the set of PNGs that were formed for individual training samples of that class. The classification itself then uses a similarity measure based on Jaccard In- dex to capture the similarities between two sets of PNGs, the model of a class and the PNG set obtained by feeding the test sample to the network. Furthermore, the capability of the proposed approach to detect the correct class before seeing the com- plete pattern is presented. This confirms the proposed approach is very suitable for early detection. The results also show that despite the very small number of training 38 samples, the method has successfully encapsulated the representative features of each class, providing high classification results for previously unseen patterns. Finally, to compare the method with other well-known approaches three other approaches like support vector machines, logistic regression and ensemble neural networks are applied on the same data and showed that the proposed approach is better than or at least comparable to these supervised methods. 39

Chapter 4

SNN Approach 2 for Hand-Written

Digits

In this chapter an unsupervised spike timing neural network approach for spatio- temporal pattern classification is proposed. This approach has been tested on a hand-written digit dataset. In the training phase, a spike timing neural network is trained. Then training models (training model strings) are created for each class.

Training models in this approach are models consist of strings of “characters”. Each

“character” represents a set of neurons that fire at a particular time step, in response to the pattern. In the classification stage, the Longest Common Subsequence (LCS) is computed between the model string of the given sample and model strings of all the training samples, after which the closest model string is chosen. Then the given sample is labeled based on the class label of the closest model string. The results show 40 that this method requires a small set of training samples and is capable of detecting the correct class early on. Based on the comparison between this approach with one unsupervised and eight supervised approaches, the proposed approach competes or even has better performance than state-of-the-art methods. Further analysis show that even for misclassified samples, this system can detect the correct class among the top three class labels.

4.1 Approach

4.1.1 Spike Timing Neural Network Structure

The structure of the neural network employed in this approach is similar to the network used in the approach presented in Chapter3. The details of the network structure are presented in Section 3.1.1.

4.1.2 Temporal Structure of Data

The temporal structure of data employed in this approach is similar to the structure used in the approach presented in Chapter3. The details of the temporal structure are presented in Section 3.1.2. 41

4.1.3 Training Spike Timing Neural Networks

The training step in this approach is similar to the training phase used in the ap- proach presented in Chapter3. The details of the training phase are presented in

Section 3.1.3.

4.1.4 Modeling Data with Temporal Patterns of Firing Neu-

rons

This section explains the procedure of creating model strings of the input data from the trained spiking neural network. Since spiking neural networks do not have an output layer similar to traditional neural networks, in order to use spiking networks to perform classification, the overall behavior of the network (i.e., the pattern of neuronal firings) is considered as the network’s response to an input. A particular type of modeling this behavior need to be considered, and then it will be shown how the models can be used for classification. The proposed approach is based on the concept of polychronization that occurs as a result of STDP learning, and which results in sets of neurons firing together in a time-locked pattern. Therefore, to build the models, for each input data which is presented to the network, the timing of neuronal firing is captured.

Using the trained network, a model string for each training data is created in order to be used for classification. This model string indicates the timing of firing neurons 42 after a particular training sample is presented to the network. Similar to the training process, based on the firing pattern corresponding to each input sample, the neurons in the network corresponding to the pixels of the input are stimulated. The neuron that has the highest intensity value fires at time 0 (millisecond 0), the neuron that corresponds to the second highest intensity pixel fires at time 1 (millisecond 1) and so on, until the entire pattern is presented. Then the network is let to run for 500 ms from the starting time without any further stimulation in order to propagate the existing activations.

To create the model string the fired neurons are saved in each 1ms interval. If no neurons are fired in a particular interval, that time interval will be skipped in the string model. Each character in each string is a set of neurons that fired at the same time in the network (excluding the empty set). The ordering of the characters shows the timing of the fired neurons. For instance a model string like {3,5}{4,6}{7,8,6} tells us that neurons 3 and 5 fired at the same time. Neurons 4 and 6 fired at the same time as well but after neurons 3 and 5. Neurons 7, 8 and 6 fired together after all mentioned neurons. The absolute timing of fired neurons is not important as the ordering of the fired neurons should only be considered.

Algorithm 4.1 presents the algorithm for building the training and testing model strings. for each training sample(input) x of the training dataset:

let sorted(x) be a list of pixel numbers in x sorted

by their intensity values in descending order

let t = 0ms 43

let model(x) = empty string

for t = 1 to 500 ms step 1ms:

let n = sorted(x)[t] be a neuron in the input that

should be stimulated at time t

stimulate n if n exists

let char(t) = {all neurons in the network fired at

interval t}

if char(t) is not empty:

concat char(t) to the end of model(x)

end -if

end - for

add model(x) to the set of training models end - for

Algorithm 4.1: The process of creating model strings for training and testing

steps.

The maximum length of a model string in this method can be 500 characters. Prac- tically, as the images in the datset are 16 by 16 (256 pixels) and most of the pixels are white (not included in the spatio temporal pattern), its unlikely to have a model string with size 500. The ending point time (500) is dependent on the application, the size of networks, types and size of data samples and thus should be chosen considering the application specifics.

4.1.5 Classification Algorithm

To classify a new unseen pattern, a model string is computed with the approach described in Section 4.1.4. This model contains characters that represent the neurons 44 that fired as a result of stimulating the network with unseen input data. Intuitively, the model strings of two similar spatio-temporal patterns should be similar to each other.

For classification, the model string of the new data sample is compared with the model strings of all the training samples, and the closest model string is chosen. Next, the label of closest model string is assigned as the class of the new input. Due to the specific nature of the model strings, in which characters encode sets of firing neurons, a suitable and efficient approach for comparing two model strings should be defined.

For typical (alphabet) characters, any comparison has two outcomes: the characters are either equal or not. However, each character in my model string is a set of neurons. Therefore, there could be different levels of similarities between characters: for example, a character such as {4, 5, 8} is more similar to another character such as {8, 3, 4} than {9, 2, 8}. The matching procedure and corresponding similarity values for two model strings a and b are shown in Figure 4.1.

Figure 4.1: Two model strings a = (a1 a2 a3) and b = (b1 b2 b3). Each character contains a set of neuron numbers which fired at the same time step. The values which are shown in edges describe the Jaccard similarity between two characters, i.e., the Jaccard similarity for a1 and b1 equals to 0.33. Also the overall LCS similarity for a and b is 0.66. 45

In order to formally define the similarity measure between two model strings, first the similarity measure for comparing two characters (set of fired neurons) should be defined. One approach to compare two sets is the Jaccard index [48]. Jaccard similarity between two sets is defined in Equation 3.2. The range of this measure is

0 to 1, and its normalized according to the size of two sets. If two sets are identical to each other or one set is a subset of the other, this measurement is equal to 1

(cannot distinguish between this two cases). Otherwise, if two sets do not have any intersection, this measurement is equal to 0.

In the Jaccard measure defined in Equation 3.2, two sets are similar if they have more shared members in them.

Given the metric for character comparison of the models, a similarity measure should be defined between two model strings, which uses the temporal information encoded in the ordering of the characters. For this, the dynamic programming LCS is used.

The size of longest substring is a good measure, which shows how much two model strings are similar to each other while preserving the order of character occurrences.

For handling different sizes of strings a normalization factor is required. Equation 4.1 shows the definition of the LCS similarity measurement. A modified version of LCS is used, as in this LCS, characters are equal when the Jaccard similarity value between them is over a predefined threshold. The LCS similarity measure between A and B, where both are model strings corresponding to two different patterns, is as follows: 46

simLCS(A, B) = |LCS(A, B)|/min(|A|, |B|), (4.1) where |A| represents the length of string A.

If the LCS similarity is equal to zero, it shows that two model strings are completely different and do not have anything in common. If this value is 1, it means that two model strings are similar or one of them is a substring of the other. Similarly to

Jaccard, these two cases cannot be distinguished from each other.

In this way there is a similarity measure between the test sample and each of the training samples. Next, in order to classify a new test sample, the class of the sample from training dataset which has the highest LCS similarity value is chosen.

Algorithm 4.2 presents the classification procedure. In this phase a model string that represents the response of the spiking neural network to the current testing sample is computed. Then this model string will be compared with all model strings of the training samples in the training set. The testing sample is classified into the same class as the most similar training sample.

for each test sample (input) x of the testing dataset:

let model(x) be the model string obtained for input x

for each training sample c of the training dataset:

let model(c) be the model string obtained for

input c

if LCS(model(x), model(c)) > currentMax:

currentMax = LCS(model(x), model(c)) 47

currentCandidate = c

end - for

x is classified in the same class as c end - for

Algorithm 4.2: Classification Procedure.

4.1.6 Parallelizing the Classification Algorithm

The classification approach explained in Section 4.1.5 has a number of ways and opportunities for speedup. Different methods and ways to parallelize the are explored, in order to deploy it for human robot interaction in real time. The GPU-

LCS algorithm in [54] is used for the LCS-based classification process that is explained in Section 4.1.5. The overall classifier is not still on the edge of real time operations, but by using GPU version of LCS, the classification phase is sped up by a factor of seven (88 milliseconds for each test data). This algorithm was ran on an NVIDIA

GEFORCE GTX960 device.

4.1.7 Early Classification

In addition to the ability to classify newly unseen spatio-temporal patterns, the pro- posed approach allows for identification of these patterns well before their completion, when only partial information about the pattern is available. The early classification algorithm has been explained in section 3.1.6. 48

Table 4.1: Classification results for each individual digit. SR: success rate, ER: error rate, RR: rejection rate.

0 1 2 3 4 5 6 7 8 9 SR (%) 74 98 84 80 84 98 82 78 80 70 ER (%) 22 2 16 20 10 2 16 12 16 30 RR (%) 1 0 0 0 6 0 2 10 14 0

4.2 Experimental Results

For experimental evaluation the problem of recognizing hand written digits is se- lected. The dataset which is used for the evaluation purposes has been presented in section 3.2.1. The quantitative measures defined in section 3.2.2 are used for com- paring the performance of the proposed method with other baseline approaches.

Table 4.1 presents the classification accuracy for the multi-class problem for the in- dividual digits. Table 4.2 presents multi-class classification accuracy for the entire dataset combined. Figure 4.2 shows the confusion matrix. Actual classes are shown in rows and predicted classes are shown in columns. The eleventh column represents the ties.

Table 4.2: Classification results for all digits combined.

All Digits Success rate (%) 82.5 Error rate (%) 14.6 Rejection rate (%) 3.3

As Table 4.1 shows, the performance of the proposed system is very good on recog- nizing hand-written digits, with an overall classification accuracy of 82.5%. This is 49

Figure 4.2: Confusion matrix of classification results. even more significant, considering that this approach only relies on 5 training samples per each digit. Based on Table 4.2 and Figure 4.2, the performance of the proposed system is slightly lower on detecting digits 0 and 9 rather than other digits. This is mostly due to the high similarity and overlap between spatio-temporal patterns of these two digits with other digits. Based on Figure 4.2 the pattern of digit 0 is very similar to the patterns of digits 1 and 3, and the pattern of digits 9 and 1 have a lot in common.

The top three accuracy of the system is also computed. It indicates that the correct class is among the three most similar classes to the input pattern. As shown in 50

Figure 4.3: Confusion matrix of top-3 accuracy.

Table 4.4 the top three accuracy of the system is 93%, which is 10.5% better than the top one classification. This shows that this system is able to detect the correct class among the top three classes. Figure 4.3 presents the confusion matrix of the top three accuracy.

As discussed in Section 4.1.5, the proposed approach relies on a threshold value for classification, in order to decide if a sequence of fired neurons matches another sequence or not. Table 4.3 shows the effect of changing this threshold. As shown in

Table 4.2, a threshold of 0.1 for the Jaccard similarity would give us the best results.

However, based on this table the performance does not change significantly with other 51

Table 4.3: Classification results with different threshold values. Thr: Threshold, Acc: Accuracy.

Thr 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Acc(%) 82.5 78.6 76.4 74.4 71.6 71.4 70.4 70 70 values for this parameter. A high threshold would require the characters (the sets of firing neurons) to be highly similar in order for the characters to be considered a match. This limits the LCS measurement and results to shorter common substrings being detected by the classifier.

Table 4.4: Classification results for all digits combined for top-3 accuracy.

All Digits Success rate (%) 93 Error rate (%) 6.6 Rejection rate (%) 0.4

4.2.1 Comparison with Other Approaches

The results of the proposed unsupervised approach are compared with one unsu- pervised spike timing neural network approach based on PNGs (SNN-PNG) [55] and other supervised machine learning approaches including support vector machines

(SVM), regularized logistic regression (LR), ensemble neural networks (ENN) , bayes network (BN), stacked denoising autoencoder (SDAE), na¨ıve Bayes (NB), multilayer perceptron (MLP-Single Neural Network), radial basis function network (RBF) and random forest (RF). 52

4.2.1.1 Unsupervised Spike Timing Neural Network with Jaccard Index

The method proposed in chapter3 uses spike timing neural network in an unsu- pervised manner, relying on polychronous neural groups (PNGs) for classification purposes.

4.2.1.2 Support Vector Machines

Libsvm [52] has been used as a tool for implementing multi-class classification with

SVM [49]. Its parameters which are kernel, gamma and penalty are set as polynomial,

1020 and 7 respectively.

4.2.1.3 Regularized Logistic Regression

Matlab is used for implementing Regularized Logistic regression [50]. The regulariza- tion parameter is set to 0.1.

4.2.1.4 Ensemble Neural Networks

An ensemble of neural networks is used for comparison. Recently it has been proved that the generalization ability of an ensemble NNs is better than that of a single neural network in [51]. In the dataset used in this approach the size of features is much larger than the number of samples (5 samples per each digit), so in this case ensemble works better rather than a single network. Because in this way I have access 53 to a set of networks and can compute the average response of all. The ensemble of neural networks is implemented in Matlab. The ensemble consist of 15-three layered neural networks and each network differs from the others in number of nodes in hidden layer. The first network has 4 hidden nodes, the second one has 6, the third one has

8 and so on. The last network has 32 hidden nodes. The range of hidden nodes is between 4 to 32 with the step of 2. The learning rate and threshold parameters are the same for all networks and equal to 0.02 and 0.005 respectively. The training data and the training algorithm are the same for all networks. The training algorithm is

Levenberg-Marquardt back propagation algorithm [53]. The activation function for the hidden layer is the tansig transfer function and for the output layers is purelin.

The way in which the parameters are selected is by changing them gradually and computing the final accuracy. The optimal parameter value is the one which results in the best overall accuracy. Also for the neural network the average of 10 trials has been computed for model generation. The majority voting approach is used to select the final decision among the 15 networks.

4.2.1.5 Bayes Network

Weka is used [56] for implementing and running Bayes Network [57], which is a supervised learning approach. Table 4.5 presents the parameters which have been used in Weka for the Bayes Network. 54

Table 4.5: Specification of Bayes Network.

Estimator Alpha Search Algorithm Score Type Accuracy SimpleEstimator 0.5 Hill climberK2 BAYES

Na¨ıve Bayes na¨ıve Bayes [58] is based on probabilistic Bayes theorem and can be used for classification purposes. If the attributes are numeric, then a normal distribution can be used instead of kernel estimation. So a supervised discretization is not required for converting the numeric attributes (features) to the nominal ones. This approach is supervised. In this dissertation a Weka implementation of na¨ıve Bayes is represented.

Table 4.6 presents the value for the parameters that are used for this classifier.

Table 4.6: Specification of na¨ıve Bayes.

Used kernel estimator Use supervised Discretization Accuracy False False

4.2.1.6 Multilayer Perceptron

For the multilayer perceptron [59], 15 hidden nodes are used. The learning rate, momentum and the threshold are fixed to 0.3, 0.2 and 0.005 respectively.

4.2.1.7 Radial Basis Function Network

This network is a normalized Gaussian radial basis function network [60], with a K- means clustering basis function. The network learns a logistic or linear regression for either discrete or numeric class problems. A symmetric multivariate Gaussian in each 55 cluster fits the data. For numeric class problems, the RBF normalizes all the numeric features to have zero mean and variance equal to 1. This approach is supervised. The

RBF network is implemented in Weka and its parameters are shown in Table 4.7.

Table 4.7: Spec of RBF network.

minStdDev numClusters ridge Accuracy 0.1 2 1.0E-8

4.2.1.8 Random Forest

Weka is used for implementing a random forest [61], which is a supervised approach to construct a forest of random trees. The parameters of random forest are presented in Table 4.8.

Table 4.8: Specification of random forest.

Number of trees maxDepth numFeatures Accuracy 3 0 0

4.2.1.9 Stacked Denoising Auto Encoder

A three layer stacked denoising autoencoder [62] is constructed. Each layer is a denoising autoencoder (DAE). The training algorithm is a greedy layer wise method.

After pre-training and obtaining weights and biases, the resulting values are used for training a three layer feed forward neural network. Table 4.9 and Table 4.10 show the SDAE and feed forward neural network parameters. 56

Table 4.9: Parameters of SDAE. LR: Learning Rate, HL: Hidden Layers, HN: Hidden Nodes.

LR Epochs Batch Size No. of HL No. of HN Activation Function 0.4 200 1 3 [256, 128, 64, 32] Sigmoid

Table 4.10: Parameters of feedforward neural network (initialized by SDAE weights). LR: Learning Rate, HL: Hidden Layers, HN: Hidden Nodes.

LR Epochs Batch Size No. of HL No. of HN Activation Function 0.45 200 2 3 [256, 128, 64, 32 10] Sigmoid

The learning rate is an important factor for SDAE. Therefore, for selecting the best learning rate, the accuracy of SDAE for different learning rates is computed and the result is shown as a graph in Figure 4.4. Based on this figure, the best value of the learning rate is 0.4.

Table 4.11: Accuracy result of comparison experiment (unit : %).

SNN-PNG SVM LR ENN BN NB MLP RF SDAE Proposed method 81.4% 86% 76.6% 55% 77.6% 69% 49.8% 42% 81.4% 82.5%

The comparison results between the proposed method and all the approaches de- scribed above are shown in Table 4.11. Based on this table, the proposed unsu- pervised approach performs better than all the other methods except for SVM. Its performance is slightly worse than SVM. However, 8 out of 9 state-of-the-art meth- ods that are selected for this comparison are supervised learning approaches. In the proposed approach however, the feature extraction method is fully unsupervised.

Overall, based on these results, the proposed approach can compete with and even outperform other state of the art unsupervised and supervised learning approaches. 57

Figure 4.4: Accuracy vs. learning rate for SDAE.

4.2.2 Early detection results

As proposed in [55] and [63] two metrics are available for quantitatively evaluating the early classification performance of the proposed approach. These two metrics are the early detection rate and the correct duration. These two metrics are defined in section 3.2.4.

The average of early detection rates for each digits is presented in Figure 4.5. The average of input percentages for all digits except three and seven are less than 52%.

This indicates that the proposed system is able to recognize the correct class by 58

100

90

80

70

60

50

40

30 Early detection rates(%) 20

10

0 Zero One Two Three Four Five Six Seven Eight Nine Digits

Figure 4.5: Average of early detection rates (percentages). observing only a small portion of input pattern. Since the pattern of digit three and seven have significant overlap with the pattern of digit two and one, for correctly predicting those patterns the system needs to observe a higher percentage of the input.

The average of correct duration percentages for all digits individually is presented in

Figure 4.6. Except for digit three, the average of correct duration percentages (the duration when digits predicted correctly by the system or has the highest similarity measurement among all other digits) is more that 55%. These numbers show the duration that the correct class is predicted by the system as input pattern is being presented. At the early stages the system is not able to predict the correct class, however after a small percentage of the input is presented to the system, it converges onto the correct class. 59

100

90

80

70

60

50

40

Correct duration (%) 30

20

10

0 Zero One Two Three Four Five Six Seven Eight Nine Digits

Figure 4.6: Average of correct duration (percentages).

100 First detection(%) 90 Early confident detection(%)

80

70

60

50

40

30

20

10

0 Zero One Two Three Four Five Six Seven Eight Nine Digits

Figure 4.7: Comparison between averages of point of first detection(%) and point of early confident detection(%).

Figure 4.7 shows the Comparison between averages of point of early confident detec- tion(%) and point of first detection(%).

Figures 4.8 through 4.12 present the prediction of the proposed system, as increasingly larger portions of the input patterns are being observed. In these figures the X axis 60

Figure 4.8: Early recognition results for a digit eight (correctly classified). indicates the input percentage which was presented to the network. The Y axis shows the similarity measurement between an input digit and models of all other digits. When the similarity measurement is equal to zero it is not shown in these

figures for simplicity. Figure 4.8 and Figure 4.9 show an instance of digits eight and nine which were correctly classified. For digit eight after presenting 37% of the input pattern the system could predict it correctly. For digit nine the system converges onto the correct class after it observed only 40% of the input pattern.

Figure 4.10 and Figure 4.11 show an instance of digits nine and two that system could not predict them correctly. However, in these figures the correct class is in the top three highest similar classes.

Figure 4.12 shows an example of a tie. As shown in this figure the system could 61

Figure 4.9: Early recognition results for a digit nine (correctly classified).

Figure 4.10: Early recognition results for a digit nine (incorrectly classified). 62

Figure 4.11: Early recognition results for a digit two (incorrectly classified). detect the correct class right after seeing 3% of the data and continuously detected the correct class up to observing 35% of the pattern. Once system observed 80% of the input, a tie has been occurred between digits three and five (similarity values of both digits three and five are the same).

4.3 Conclusion

In this chapter a new unsupervised approach is proposed to classify spatio-temporal patterns based on fired neurons in spike timing neural networks. The network uses the STDP learning rule to update the weights in the training phase. After the net- work is trained, model strings are created for each class. The proposed similarity 63

Figure 4.12: Early recognition results for a digit three (a tie). measurement in this approach is the combination of longest common subsequence and Jaccard index. This approach is capable of early classification of the input pat- terns. Based on the comparison results, the proposed unsupervised approach achieved higher performance rather than the supervised state of the art methods. 64

Chapter 5

SNN Approach 1 for Human Hand

Gestures

In this chapter a spiking neural network method is presented. This approach is based on PNGs and Jaccard index similarity measurement for learning and early classification of human hand’s gestures. The proposed method has the following main contributions: i) it requires a very small number of training examples, ii) it accepts variable sized input patterns, iii) it is invariant to scale and translation, and iv) it enables early recognition, from only partial information of the pattern. It is validated on a set of gestures representing the digits from 0 to 9, extracted from video data of a human drawing the corresponding digits. In this chapter a comparison is done with several other standard pattern recognition approaches. The results show that the proposed approach significantly outperforms these methods, it is invariant 65

to scale and translation, and it has the ability to recognize patterns from only partial

information.

5.1 Approach

5.1.1 Mapping of Spatio-Temporal Patterns to Spike Trains

A spiking neural network with axonal conductance delays is used, with 225 total

neurons, in which the ratio between excitatory and inhibitory neurons is 4:1. The

model of the spiking neurons and their parameters are similar to those defined by

Izhikevich [8]. Equations 3.1(a-c) present the neuron model of the neurons in this

network. Each neuron is connected to 10% of the rest of neurons based on Gaussian

distribution, as described below. Each synapse has a fixed conduction delay, which

has been assigned based on a random distribution. The range of conduction delays is

[1..20]ms. The synaptic weights are initialized to +6 for excitatory neurons and to -5

for inhibitory neurons. These weights are updated using the spike-timing dependent

plasticity (STDP) rule and can have a maximum possible value of 10.

The test domain for hand gesture recognition is the set of digits from zero to nine.

The raw testing data are recorded videos of a person drawing each digit by hand. To obtain the data necessary for training (the positions of the human’s hand) each video is converted to a sequence of images. Then a template matching approach is used to obtain the position of the hand in each frame. This results in a set of hand positions 66

0.3 Figure 5.1: Representation of spatio-temporal patterns.

(ti, xi, yi) along with temporal information regarding how the pattern is drawn (i.e. from the beginning to the end of the pattern).

The positions of the hand from this pre-processing stage contain absolute informa- tion (coordinates in the camera frame). Since this information does not allow us to handle variations in translation and scaling, relative information, such as differences in hand position between frames is used. In particular, for each two consecutive hand positions (frames), a vector that points from one position to the next is computed.

The orientation of that vector gives us an angle, as shown in Figure 5.1. Each angle will next be assigned to one of 36 groups, spanning the range from 0 to 360 degrees, in 10 degree intervals. To build the network, a set of 5 excitatory neurons is assigned to each group, resulting in a network that has 225 neurons: 180 excitatory and 45 in- hibitory neurons. This is equivalent to having a group of neurons that ”respond” to a particular type of orientation. Using this representation, the spatio-temporal pattern can now be encoded as a sequence of angles a1, a2,..., an. To establish the connec- tivity between neurons, a Gaussian probability distribution is used. This means that 67

neurons have higher probabilities of having connections to neurons corresponding to

angles closer in value to their own. To transform the above pattern into a spike-

timing pattern that could be used to train the network, for each angle ai, going in order from 1 to n, their corresponding neurons are stimulated, one after another, at

1ms intervals. Thus, for an input pattern with n samples, there will be a spike-timing pattern of 5 ∗ n ms. After each millisecond the standard STDP rule is used to update the synaptic weights [46]. To stimulate a neuron an input current of 20 mA will be provided to it. The input patterns are presented to the network in a rotating order, going through all samples of each individual digit, 290 times for each sample. This number of iterations is used to ensure the stability of PNG formation in the network

(i.e., the set of PNGs active in the network does not change).

5.1.2 Finding Polychronous Neuronal Groups (PNGs)

PNGs are groups of neurons that consistently fire together, with the same time delay between firings. These patterns are reproducible and occur with millisecond precision, indicating a strong connection between the neurons in a group. PNGs emerge during the training process, when the network is stimulated with trains of spike patterns.

This is due to a matching between the firing of neurons during training and the axonal conductance delays [8]: assuming a given delay for spikes to propagate between two neurons δij, if the spike from the pre-synaptic neuron i reaches the post-synaptic 68

neuron j just before its firing, the synapse between neurons i and j is strengthened

by the STDP learning rule.

PNGs have 3 features: size (the number of neurons which are in the group), length

(longest path between neurons in the group) and time span (difference between the

firing time of the first and last neuron, in ms). After training, the hypothesis is

that each training pattern would give rise to a set of PNGs that uniquely represent

it. These PNGs constitute a model of that pattern, which can later be used for

classification. To build a model for a pattern, the trained network is stimulated with

all possible combinations of three angles, taken in the order in which they appear

in the pattern. For instance, for the pattern in Figure 5.1, consisting of 8 angles,

all the subgroups of three angles are selected: (a1, a2, a3), (a1, a2, a4), . . . , (a1, a2, a8), (a2, a3, a4), . . . , (a2, a3, a8), . . . , (a6, a7, a8). The reason that the groups of three angles are selected is the fact that PNGs typically require a set of 3 neurons

(called anchor neurons) to fire in order to activate the entire chain [64]. Since there are no connections between neurons corresponding to the same angle ai, the only possibility for a PNG is to have anchor neurons corresponding to three different angles. Therefore, for each group of 3 angles, their 15 assigned neurons (5 per each angle group) are stimulated based on their firing time patterns. Three angles are stimulated at a time, instead of the entire pattern, to be able to identify the emerging

PNGs for each triplet, as described below. However, the entire model for a pattern consists of the union of all the PNGs emerging from all possible combinations of anchor neurons. After the stimulus, the entire network is let to run for 300ms and all 69

the neuron firings (Neuroni, firedT imei) are recorded within this time. A neuron is considered to fire if its voltage is greater than 30mV. A PNG has a path length of k if the longest chain of firing connections from the first to last firing of at least one part is k. If a PNG of length greater than 3 emerges as a result of one such stimulus, it

will be added to the pattern model as a group of three angles PNGasarau . The set of all PNGs that emerge through this process represent the model for the corresponding input pattern (corresponding class). These models will be used for classification, as described in Section 5.1.3.

To find if a PNG has emerged for a particular triplet of angles the following process are followed: at each time step of the stimulus, for all Neuroni that have fired in that step, if any of their presynaptic Neuronj have fired within a time less than or equal to 10ms from neuron i’s firing time firedT imei, the following tuple will be saved:

< Neuronj, firedT imej, Neuroni, firedT imei >. This will provide a list of ”edges” that connect neurons whose firing patterns are linked to each other. Using this set of edges the longest possible path is found which connects any of the neurons that have

fired during the 300 ms interval, by doing a depth first search in the above graph.

The length of this path is the length of the PNG associated with a particular triplet of angles. 70

5.1.3 Classification Algorithm

In the testing phase, all testing samples are presented to the trained network one after another and the PNG set corresponding to each testing input are found, similar to the process for finding PNGs described before. For classification, the similarity between the PNG set of the testing sample and all training models is computed. To define a similarity measure between the input PNGs and the models of all classes, a

Jaccard Index similarity measure [48] for sets (Equation 3.2) is used.

In this manner a similarity measure is obtained between the test samples and each of the training samples. For classification the class of the sample from the training set is chosen that results in the greatest similarity measure.

5.1.4 Early Detection

Early detection is considered the system’s ability to correctly classify patterns from incomplete information, i.e., from only a small fraction of the entire pattern. There- fore, it should be evaluated that how early the proposed system is able to recognize a particular pattern, or equivalently what percentage of the entire pattern needs to be observed before the pattern is correctly identified. Toward this end, to evaluate the early detection capabilities of the proposed approach the following process is taken: for each individual pattern of length [a1, a2, ..., an], the classification algorithm will be applied on sub-patterns of increasing lengths k, going from 3 to n (a minimum of 71

3 observations are required to find PNGs that are activated by the input). This is

equivalent to asking what class the pattern consisting only of observations a1 through

ak belongs to. The assumption is that initially, for sub-patterns of small lengths, the

predicted class would vary. However, at some point a sufficient part of the pattern has

been observed, such that the system stabilizes to the correct class label. In addition to

the predicted class label, for each classification step k, the input pattern’s similarity

to all classes is recorded, not just the one that gives the highest value. This infor-

mation will be used for the quantitative evaluation for early detection. To provide

a quantitative evaluation of my method’s performance on early pattern classification

the early detection rate and the correct duration are used (section 3.2.4)

5.2 Experimental Results

5.2.1 Classification Results

The hand gesture dataset consists of 7 training samples and 21 test samples for each digit. Figure 5.2 (first and third row) show all the samples of training data for each digit. To validate the performance of the proposed approach the following measures are computed: i) the success rate, ii) the error rate, and iii) the rejection rate, as defined in section 3.2.2.

Table 5.1 shows the results for each digit individually and for the entire dataset. 72

Figure 5.2: Training data: first and third row; Correctly classified testing data: second and forth row.

Table 5.1: Classification results for each individual digit and all digits combined. SR: success rate, ER: error rate, RR: rejection rate

0 1 2 3 4 5 6 7 8 9 All Digits SR 95.2% 57.1% 90.4% 85.7% 100% 85.7% 100% 85.7% 85.7% 66.6% 85.2% ER 4.76% 42.8% 9.5% 14.2% 0% 14.2% 0% 14.2% 14.2% 33.3% 14.8% RR 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%

The second and fourth row of Figure 5.2 show a subset of correctly classified testing data for each digits individually. The results indicate that although there was rela- tively low variation in both scale and translation in the training data, the proposed approach encapsulated the main features of each pattern and was able to identify pat- terns correctly at different scales and locations in the field of view. Figure 5.3 shows examples of incorrectly classified digits, along with the top 3 classes predicted by the 73

Figure 5.3: The top-3 candidates for similarity.

system. It is shown that although the correct class is not the highest, it is for most

parts one of the three highest choices. One of the reasons these examples are wrongly

classified is that parts of the spatiotemporal pattern of these examples (the way the

user is drawing digits by moving her hand) have significant overlap with other classes,

as shown in Figure 5.4. For example in drawing digit 8 there is a lot of overlapping

hand movement with digit 6 or 0. These overlaps may result in misclassification for

few examples. In particular, if the end of the pattern is highly similar with patterns

from other classes, then an incorrect class will be predicted (this is due to the fact

that the classification results are reported after observing the entire pattern).

Fig 5.5 shows the confusion matrix of the classification results. The rows and columns

correspond to the actual and predicted classes respectively. For example, in this con-

fusion matrix, of the 21 actual digits two, the system predicts that one was identified as digit three, another was digit five and the rest were correctly classified as digit two.

Based on the matrix, digit one is the hardest to classify, mainly because its spatio-

temporal pattern is very simple and it just contains a few angles. Furthermore, most

parts of its spatio-temporal pattern have significant overlap with other digits like two 74

Figure 5.4: Overlap (in number of PNGs) between all pairs of digit models.

and four. To see if there is any relation between the amount of overlap between models and the misclassification of samples, the following analysis are made: for each class, the union of all PNGs for the patterns belonging to that class is created. Then the intersection between all pairs of classes are computed, to see the amount of overlap between the different digits. Figure 5.4 shows these results, which for the most part support the data represented in the confusion matrix: the system incorrectly assigns patterns to classes with which there is a large overlap. For example, the three samples for digit seven that are misclassified are incorrectly assigned to classes two and three, which have the highest amount of overlap with the model for seven. Although there are some exceptions to this rule, it is important to note that each sample is evaluated and classified independently, and therefore, a particular pattern could have more in common with a class that has a smaller overall overlap. Figure 5.6 shows the average of the number of PNGs for each model, with standard deviations. 75

Figure 5.5: Confusion matrix

5.2.2 Early Detection Results

For the test dataset, the early detection rate and the average correct duration are

computed, as defined in Section 5.1.4. Figure 5.7 shows the average of early detection

rates percentages for each digit individually. The average of input percentages for

all digits is less than 70% and it is lower than or equal to 50% for several digits:

one, two, four, five, six and seven. Furthermore, for some test patterns belonging to

class one, two, four and seven, less than 40% of the pattern is necessary for correct identification. This indicates the system’s capability of classifying patterns correctly from only a very small fragment of the input. 76

Figure 5.6: Average of the number of PNGs for each model, with standard devi- ations

Figure 5.7: Average of early detection rates (percentages)

Figure 5.8 shows the average correct duration percentages for all digits individually.

The average value for all digits is more than 30%. These numbers reflect the amount of time the system predicted the correct class as the temporal pattern is being observed.

In the early stages, the system is uncertain about the pattern, possibly predicting 77

incorrect classes, but it later stabilizes to the correct class, as shown by the early

detection rates. The comparison between averages of point of first detection(%) and

point of early confident detection(%) is presented in Figure 5.9. The graph shows

that these two points are very close to each other for the majority of classes. This

means that for most cases, as soon as the correct class is identified (i.e., it has the

highest similarity with the test pattern), it will maintain the highest similarity with

the pattern over all other classes.

Figure 5.8: Average of correct duration (percentages)

Figure 5.10 and Figure 5.11 show two examples of the system’s predictions for two input digits, as a larger amount of the input temporal pattern is being observed.

The Y axis shows the similarity measure between an input digit and the models of all ten digits, and the X axis shows the percentage of the input based on which the classification is being made. For simplicity, the similarity measures between the input and a digit are not shown when they are zero. The graph in Figure 5.10 shows an 78

Figure 5.9: Comparison between averages of point of first detection(%) and point of early confident detection(%)

example for which the correct class is identified very early: the similarity with class

(one) is higher than for all other digits and it remains the highest as more of the pattern is being observed. Figure 5.11 shows an instance of digit eight in which the correct class is being predicted after 30% of the input is taken into account. In the beginning, when the system only considers between 10% to 30% of the pattern, the similarities with all classes are relatively the same and the system can not yet decide.

However, as more of the data becomes available, the correct class is then identified.

It is interesting to note that although the highest similar class is not the correct one, it is still the second highest similar class in most of the cases. If top-3 accuracy (i.e. the correct class is one of the first, second or third highest class) is considered, the performance (overall accuracy) increases from 85.2% to 95.2%. Table 5.2 shows the results of top 3 accuracy for each digit individually and for all digits combined. 79

Testone5 0.25 0 1 2 0.2 3 4 5 6 0.15 7 8 9

0.1 Similarity Measure

0.05

0 0 20 40 60 80 100 Input Percentage

Figure 5.10: Similarity metric- a digit one sample.

Table 5.2: Classification results of top-3 accuracy for each individual digit and for all digits combined. SR: success rate, ER: error rate, RR: rejection rate

0 1 2 3 4 5 6 7 8 9 All Digits SR 100% 80.9% 100% 95.2% 100% 100% 100% 95.2% 100% 85.7% 95.2% ER 0% 19.1% 0% 4.76% 0% 0% 0% 4.76% 0% 14.2% 4.75% RR 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%

5.2.3 Comparison with Other Approaches

To validate the performance of the approach its results are compared with the follow- ing state of the art approaches: support vector machine (SVM), logistic regression

(LR), ensemble neural networks (ENN), convolutional neural networks (CNN), deep 80

Testeight9 0.35 0 1 0.3 2 3 0.25 4 5 6 0.2 7 8 9 0.15

Similarity Measure 0.1

0.05

0 0 20 40 60 80 100 Input Percentage

Figure 5.11: Similarity metric- a digit eight sample. belief networks (DBN) and stacked denoising autoencoder (SDAE). Two approaches are used for the type of input data provided to these methods. In the first approach, each training sample consists of a sequence of angles, identical to the one used by the proposed approach. However, due to the fact that these algorithms require that all input samples have the same size, all samples in the dataset should be normalized to the same length. For this, the size of the longest training sample (130) is used and the end of the shorter patterns are padded with zeros. In the second approach, using the (x, y) coordinates of the hand from the video streams, images are generated that have the size of the high-resolution video frames (900 by 1200), by putting ones at 81 coordinates that are present in the pattern and zeros everywhere else. Further, the images are down sampled to a smaller size (90 by 120) to make them better suited for the training process.

Table 5.3: Parameters of SVM

Kernel Type Gamma Degree Penalty Cross Validation Angles Polynomial 1024 3 8 2 Fold Images Polynomial 1024 3 5 2 Fold

For SVM, Libsvm [52] is used, with the parameters shown in Table 5.3. A Matlab implementation is used for regularized logistic regression with 0.1 for the value of the regularization parameter. An ensemble neural network is used instead of a single neural network, since the number of training samples is very small in spite of the size of the feature set. Table 5.4 presents the details of each of the neural networks in the ensemble. ENN is implemented using Matlab. The number of hidden nodes in each neural network (7 networks in total for the angle data format) is different among the different neural networks in ENN: the first one has 8 hidden nodes and the subsequent networks have an additional 2 nodes each. The final classification for the ensemble is made by a majority voting approach among the 7 networks.

The Levenberg-Marquardt back propagation algorithm [53] is used to train the feed- forward network. The tansig transfer function and the purelin activation function are also used for the hidden and output layers. The same training and testing dataset are used for all neural networks in ENN. To determine the final result for ENN the average of 10 trials for model generations are computed. 82

Table 5.4: Parameters of ENN. (LR: Learning Rate, N: Nodes)

Input N Hidden N Output N Hidden Layers LR Threshold Angles 130 8 to 20 10 1 0.2 1e-100 Images 10800 20 to 32 10 1 1 1e-150

A three layer Deep Belief Network (DBN), which contains three Restricted Boltzmann

Machines (RBM) [65] is constructed. Each RBM has 100 hidden neurons and was trained in a layer-wise greedy manner with contrastive divergence [66]. The initial values for both weights and biases are zero. After pre training each RBM, the weights and biases are used to initialize a two layer feed forward neural network. The feed forward neural network was trained using mini-batches of size 10 for 150 epochs using a learning rate of 0.1 and backpropagation algorithm for both angle and image data format. Table 5.5 shows the DBN parameters.

Table 5.5: Parameters of DBN

Learning rate Epochs Batch Size Angles 0.01 100 10 Images 1 200 10

A three layer stacked denoising autoencoder (SDAE) was created. Each layer is a denoising autoencoder (DAE), having 100 hidden neurons. The layers are trained in a greedy-layer wise fashion. After pre training the SDAE, the upward weights and biases are then used for training a two layer feed forward neural network. Table 5.6 and Table 5.7 show the SDAE and feed forward neural network parameters.

A five layer convolutional neural network (CNN) was trained with stochastic gradient descent. The learning rate is equal to 0.2 and the number of epochs is 100. The 83

Table 5.6: Parameters of SDAE

Learning rate Epochs Batch Size Angles 1.5 200 10 Images 1 100 10

Table 5.7: Parameters of feedforward neural network (initialized by SDAE weights, LR: Learning Rate, N: Nodes)

LR Epochs Batch Size Input N Hidden N Output N Angles 1.5 200 10 130 100 10 Images 1 300 10 10800 100 10 network has the following structure: for the first layer, there are 6 feature maps that are connected to the single input layer through 6 kernels of size 5x5. The next layer is mean-pooling of size 2x2. The third layer has 12 feature maps. All of these feature maps are connected to all 6 mean-pooling layers below through 72 5x5 kernels. The fourth layer is again a 3x3 mean-pooling layer. In training, the feature maps of the fourth layer are concatenated into a single feature vector which feeds into the last layer. This layer has 10 output neurons corresponding to the 10 class labels. For the

CNN only the results on the image data format could be reported, as the angle data format cannot be employed. To implement DBN, SDAE and CNN, a Matlab toolbox developed by Palm [67] is used. In order to find the parameters that give the best performance, each model is tested by changing the values of parameters gradually.

Table 5.8 shows the comparison between my approach and the above algorithms.

While all the methods which are tested typically perform very well when provided with large training sets and with normalized data, they performed very poorly given 84 the small number of training samples and scale and translation variance in the data.

This further emphasizes the significance of the proposed approach, which gives very good performance under these situations.

Table 5.8: Comparison Results of the proposed approach with other approaches.

SVM LR ENN CNN SDAE DBN Proposed Approach Angles 47.36% 31.1% 19.62% - 40.19% 33.49% 85.2% Images 29.66% 27.27% 8% 10% 10% 0% -

5.3 Discussion and future work

The approach for building a model for a particular pattern is to select only the groups of three angles that lead to some PNG activations. To verify that this approach helps encapsulate the main features of the patterns the following analysis are done. For each training digit, a model that contains all possible combinations of 3 angles (not just those that lead to PNG activation) are created. Similar models for the testing data are also condiered and the classification is performed as described in Section 5.1.3.

Using these models, which do not take into account neuronal activations, the accuracy decreased to 60%. This shows that the network activation effectively encapsulates the individual patterns. 85

5.4 Conclusion

This chapter presents a new unsupervised approach for encoding hand gesture data using spike-timing neural networks. A novel way is proposed for mapping such pat- terns into spike trains, which are presented to the network during training. This mapping leads to encapsulating the important features of each pattern, which further enables a scale and translation invariant classification approach. For each pattern, a model is created that contains all sets of three neuronal angle groups that lead to

PNG activations. The Jaccard Index similarity metric for sets is used for comparing a new pattern’s model against the models from the training data. The method is vali- dated on a dataset of hand movement gestures representing the drawing of digits 0 to

9, in front of a camera. The results show that the method achieves high recognition performance and it significantly outperforms a set of standard pattern recognition techniques. The method successfully classifies patterns using a very small training dataset and from small fraction of the input. Furthermore, the results show that the approach implicitly handles inputs of varying lengths and it is also invariant to scale and translation. 86

Chapter 6

LCS based SNN Approach 2 for

Human Hand Gestures

This chapter presents an approach based on spike-timing networks and LCS for learn- ing, recognizing, early classification of human hand’s gestures. The proposed method brings the following contributions: first, it enables the encoding of patterns in an unsupervised manner. Second, it allows us to create models of specific patterns using a very small set of training samples, in contrast with standard pattern recognition approaches that typically require large amounts of training data. Based on these models, the method further enables classification of new patterns using a longest- common subsequence approach for matching between patterns of activated neurons.

Third, the approach is invariant to scale and translation and thus it enables gen- eralization across multiple scales and positions. Fourth, the approach also enables 87 early recognition of patterns from only partial information about the pattern. The proposed method is validated on a set of gestures representing the digits from 0 to 9, extracted from video data of a human drawing the corresponding digits. The results are also compared with other state of the art pattern recognition algorithms.

6.1 Approach

6.1.1 Mapping of Spatio-Temporal Patterns to Spike Trains

The mapping of spatio-temporal patterns to spike trains employed in this approach is similar to the mapping used in the approach presented in Chapter5. The details of the mapping are presented in section 5.1.1.

6.1.2 Finding Polychronous Neuronal Groups (PNGs)

The process of finding PNGs employed in this approach is similar to the process used in the approach presented in Chapter5. The details of finding PNGs are presented in section 5.1.2.

6.1.3 Classification Algorithm

In the testing phase, all testing samples are presented to the trained network one by one and the model corresponding to each testing input is found, similar to the process 88 for finding PNGs described above. This model encapsulates the network’s response to that pattern, encoded as a sequence of PNGs. For classification, the similarity between the PNG model of the testing sample and all training models are obtained using the longest common sub-sequence algorithm (LCS), typically used for finding similarities between strings [68]. Then the class of the sample from the training set that results in the greatest similarity measure is chosen. Equation 4.1 presents the

LCS definition.

6.2 Early Detection

The metrics for quantitative evaluation of my method’s performance on early pattern classification are the early detection rate and the correct duration which are defined in section 3.2.4.

6.2.1 Experimental Results

The human-hand gesture dataset which is introduced in section 5.2.1 is used for the evaluation purpose of my proposed method.

To validate the performance of my approach I compute the following measures: i) the success rate, ii) the error rate, and iii) the rejection rate. They are defined in section 3.2.2. 89

Table 6.1: Classification results for each individual digit. SR: success rate (%), ER: error rate (%), RR: rejection rate (%)

0 1 2 3 4 5 6 7 8 9 SR 100 100 80.9 52.3 80.9 100 95.2 100 66.6 52.3 ER 0 0 19.1 47.7 19.1 0 4.8 0 33.3 47.7 RR 0 0 0 0 0 0 0 0 0 0

Table 6.1 and Table 6.2 show the results for the multi-class problem for each digit

individually and for the entire dataset combined respectively. Fig 6.1 shows the

confusion matrix of the classification results. The rows and columns correspond to

the actual and predicted classes respectively. For example, in this confusion matrix,

of the 21 actual digits four, the system predicts that four were identified as digit one

and the rest were correctly classified as digit four.

Table 6.2: Classification results for all digits

All Digits Success rate (%) 83 Error rate (%) 17 Rejection rate (%) 0

Table 6.3 and Table 6.4 show the top-3 accuracy results for the multi-class problem

for each digit individually and for the entire dataset combined respectively. Fig 6.2

shows the confusion matrix of the top-3 classification results.

Based on the matrix, digits three and nine are the hardest to classify, mainly because most parts of their spatio-temporal pattern have significant overlap with other digits like one and seven for digit three and digits zero and one for digit nine. 90

Table 6.3: Classification results for each individual digit for top-3 accuracy. SR: success rate, ER: error rate, RR: rejection rate

0 1 2 3 4 5 6 7 8 9 SR (%) 100 100 90.4 80.9 100 100 95.2 100 100 90.4 ER (%) 0 0 9.6 19.1 0 0 4.8 0 0 9.6 RR (%) 0 0 0 0 0 0 0 0 0 0

Table 6.4: Classification results for all digits combined for top-3 accuracy

All Digits Success rate (%) 96 Error rate (%) 4 Rejection rate (%) 0

Fig 6.3 shows the average of early detection rates. Based on this figure, the proposed system can confidently recognize all digits by presenting less than or equal to 55% of the pattern. In particular digits one, five, six and seven can be correctly classified from less than 10% of the pattern. The average of correct duration is shown in Fig

6.4. This figure indicate that the correct duration time of the proposed method for all digits is more than or equal to 58%. The results of comparison between averages of point of first detection(%) and point of early confident detection(%) is shown in

Fig 6.5. Based on this figure for most of the digits, the point of confident detection is very close to the point of early first detection.

Table 6.5 shows the comparison between the proposed method and Support Vector

Machine (SVM), Logistic Regression (LR), ensemble neural network (ENN), deep be- lief network (DBN) and stacked denoising autoencoder (SDAE). The results show that the proposed approach significantly outperforms these methods, and it is invariant 91

Figure 6.1: Confusion matrix of classification results to scale and translation. For SVM, Libsvm [52] is used, with the parameters shown in Table 6.6. A Matlab implementation is used for regularized logistic regression and the regularization parameter is set to 0.1.

Table 6.5: Comparison Results of the proposed approach with other approaches (PR: Proposed Approach)

SVM LR ENN SDAE DBN PR Accuracy 47.36% 31.1% 19.62% 40.19% 33.49% 83%

Table 6.6: Parameters of SVM

Kernel Type Gamma Degree Penalty Cross Validation Polynomial 1024 3 8 2 Fold 92

Figure 6.2: Confusion matrix of top-3 accuracy

Figure 6.3: Average of early detection rates (percentages) 93

Figure 6.4: Average of correct duration (percentages)

Figure 6.5: Comparison between averages of point of first detection(%) and point of early confident detection(%) 94

The ensemble neural network approach [51] contains 7 neural networks. The networks differ in the number of hidden nodes, which range from 8 to 20, with an increase step of 2. Table 6.7 presents the details of each of the feed forward neural networks in the ensemble. A majority voting approach has been used for final classification among the 7 networks.

Table 6.7: Parameters of ENN. (LR: Learning Rate)

Input Nodes Hidden Nodes Hidden Layers LR Threshold 130 8 to 20 1 0.2 1e-100

A three layer Deep Belief Network (DBN), which contains three Restricted Boltzmann

Machines (RBM) [65] is constructed. Each RBM has 100 hidden neurons and was trained in a layer-wise greedy manner with contrastive divergence [66]. After pre training each RBM, the weights and biases are used to initialize a two layer feed forward neural network. Table 6.8 and Table 6.9 show the DBN and feed forward neural network parameters.

Table 6.8: Parameters of DBN

Learning rate Epochs Batch Size 0.01 100 10

Table 6.9: Parameters of feedforward neural network (initialized by DBN weights)

Learning rate Epochs Batch Size 0.1 150 10

The stacked denoising autoencoder (SDAE) used for comparison had three layers, with each layer being a denoising autoencoder (DAE) with 100 hidden neurons. After 95 pre training the SDAE, the upward weights and biases are then used for training a two layer feed forward neural network. Table 6.10 and Table 6.11 show the SDAE and feed forward neural network parameters. To implement DBN, SDAE, a Matlab toolbox developed by Palm [67] is used.

Table 6.10: Parameters of SDAE

Learning rate Epochs Batch Size Noise Type Noise value 1.5 200 10 Zero-Masking noise 0.8

Table 6.11: Parameters of feedforward neural network (initialized by SDAE weights)

LR Epochs Batch Size input Nodes Hidden Nodes Output Nodes 1.5 200 10 130 100 10

6.3 Conclusion

In this chapter a new unsupervised approach is presented for classification and early recognizing of human hand gesture. This approach is based on ordered sets of PNGs for creating training and testing models. LCS approach also used for comparing a new pattern’s model against the models from the training data. This method is validated on a dataset of hand movement gestures representing the drawing of digits 0 to 9, in front of a camera. The results show successful classification of patterns from only a very small training dataset, early detection capabilities, as well as invariance to scale and translation. 96

Chapter 7

Real-time SNN Approach 3 for

Human Hand Gestures

This chapter proposes a novel real-time unsupervised spike timing neural network for recognition and early detection of spatio-temporal human gestures. This approach is a

CUDA implemented of the method presented in chapter3. The spiking network clas- sifier has been implemented in CUDA, and allows the classification to be performed in real-time. To evaluate the performance of this approach, I test the case of a physical robot observing air-handwritings of human counterparts as a benchmark for human gesture patterns. The presented results confirm the following contributions: The pro- posed approach runs in real-time, thus is suitable for human-robot applications; it is unsupervised and capable of early classifying human gestures and actions; it requires a very small number of training samples; and it is also scale and translation invariant. 97

As an additional feature, the approach can process variable length patterns, which

are typical for human gestures. In comparing to other prominent techniques, the

proposed approach demonstrates superior accuracy and is suitable for early classifi-

cation of different types of human actions in time-sensitive mobile applications such

as robotics.

7.1 General Approach

The key idea of the proposed real-time approach is that both the spatial and temporal

information in an input pattern play a very important rule in both learning and early

recognizing the patterns. This real-time approach consists of two phases: training

and testing. The first step of the training phase is to map spatio-temporal patterns

(such as gestures) to the neural spike trains, which can be used for training the network (Section 7.1.1). No labels for training samples have been provided to the network as the proposed approach is unsupervised. The spike trains are used to stimulate the neural network. During stimulation, the synaptic weights in the spiking network are updated using the spike-timing dependent plasticity rule [69]. After the network is fully trained with all the training patterns and the synaptic weights are updated according to STDP, each training pattern is presented to the network and the network’s response is used to create a model. The model consists of all polychronous neuronal groups (PNGs) (PNGs are time-locked asynchronous firing patterns (Section 7.1.2) that arise after the network was trained and that corresponds 98 to that pattern. In this way one model for each training sample is obtained. These models are being used in the testing phase later on. In the testing phase, to classify an unseen spatio-temporal pattern, the pattern is mapped into a spike train using the same process as in training and then model that represents the network’s response to that pattern is created. The classification decision is made based on the comparison between the testing model and all the training models, by selecting the closest training sample model to the input model (Section 7.1.3). These steps are explained in the next sections in more details.

7.1.1 Mapping of Spatio-Temporal Patterns to Spike Trains

The mapping of spatio-temporal patterns to spike trains employed in this approach is similar to the mapping used in the approach presented in Chapter5. The details of the mapping are presented in section 5.1.1.

7.1.2 Finding Polychronous Neuronal Groups (PNGs)

The method for finding PNGs is done similar to the section 5.1.2 in Chapter5. The overall process of finding PNGs is shown in Figure 7.1. This process is the same as in the training phase, the only difference being that this process is run on the CPU, so the triplets of angles should be presented to the network sequentially. However, in the testing phase this process is done simultaneously for all triplets of angles. 99

Figure 7.1: The process of finding PNGs 100

7.1.3 Classification Algorithm

In the classification algorithm, a testing pattern is presented to the trained network, and a corresponding model is created for that pattern, similar to the process described for training. This results in a new model for the testing sample, which consists of a set of activated PNGs. For classification, the testing model is compared with all training models, and its similarity value against all the training models is computed.

A Jaccard Index [48] is used as the similarity metric. If A and B are two models associated with the PNG sets of two patterns, then the similarity between them is computed as follows:

sim(A, B) = 1 − |AxorB|/|A ∪ B| (7.1)

In this way, a similarity value is computed between the test sample and all the training set models. For classification, the class corresponding to the sample that has the highest similarity value among all the values is chosen.

7.1.4 Real-Time Recognition

A real-time version of the spike timing neural classifier is implemented using a paral- lelized approach based on CUDA C++, on an NVIDIA GEFORCE GTX960 device.

This method achieves real-time performance with a speedup of 60 compared to simu- lation executed on Core i7 CPU 285, 2.13-GHz device with Matlab implementation. 101

To implement the computation described in Section 7.1.2, I define a number of blocks that is dependent on the number of triplets of angles to be processed to find the potential PNGs.

Each block has 32 threads, thus the number of blocks is equal to the number of angle triplets divided by 32. Each thread is responsible for stimulating the trained network using one triplet of angles to identify if any potential PNGs emerge or not. The different steps of the real-time recognition phase is as follows: (1) acquire input data

(hand positions), (2) map input data to spike train (3) find all possible angle triplets

(4) find corresponding PNGs (5) create models, and (6) classification. Step (4) is the only part that has been implemented on the GPU. In the recognition system, the process for each thread is presented in Figure 7.1. In addition, the overall procedure of the early recognition process is shown in Figure 7.2. The real-time classification approach is as follows: as the proposed method requires at least 3 angles to find potential PNGs, once the system observes four hand positions (three angles), it starts performing the pattern recognition. In the first step, there is only one triplet of angles, which is (a1, a2, a3). One block and one thread is used to predict the class label with the procedures described in Section 7.1.2 and Section 7.1.3. In the second step, the input pattern has five hand positions (4 angles), which results in four triplets of angles:

(a1, a2, a3), (a1, a2, a4), (a1, a3, a4) and (a2, a3, a4). Since in the previous step, it is already identified if any PNGs emerge for the triplet (a1, a2, a3), in the second step, the network is only stimulated with the remaining triplets (a1, a2, a4), (a1, a3, a4) and

(a2, a3, a4), using three threads at the same time. The final model at this step is the 102

union of PNGs formed from these three triplets of angles, in addition to the model

from the first step. In step k, the input pattern consists of angles [a1, a2, ..., am], and

all possible combination of three angles are (a1, a2, am), (a1, a3, am), (a1, a4, am), ...

(ai, aj, am), ... (am−2, am−1, am). The model obtained at the end of this step is the union of PNGs that emerge by stimulating these triplets of angles combined with the model which has been created in step k-1 for the input pattern [a1, a2, ..., am−1]. In other words, in step k the trained network is only stimulated with triplets of angles that are not used in the previous steps.

7.1.5 Early Classification (Intent Recognition)

Early classification is the ability of a system or robot to correctly identify patterns or human activity from incomplete patterns or human actions. One of the main goals of this work is to show the early classification ability of the proposed system.

For this, the percentage of the input pattern which should be observed in order to identify the correct pattern is calculated. The early detection approach is explained in detail in Section 7.1.4. In the early steps, when the human is drawing a pattern, it is not obvious for the system which pattern is being observed, and the predicted class typically varies. However, after a certain (usually small) amount of the pattern is observed, the system can predict the correct class correctly. For a quantitative evaluation of this capability, the recognition results is recorded after each step, then 103

Figure 7.2: Real-time recognition process 104 this information is used to compute both the recognition and the early classification results.

7.1.6 Early Classification Measurements

Two criteria are used to evaluate the early classification performance of the proposed system. These metrics are early detection rate and the correct duration which are defined in section 3.2.4.

7.2 Results

7.2.1 In-house Dataset

The in-house hand motion dataset contains 70 training samples (7 samples per each digit) and 21 testing samples. The first and third row of Figure 7.4 illustrated all of the training samples in this dataset. Each sample has been extracted from a recorded video by a high resolution Sony camera from a human who is drawing a digit in front of it. To obtain the human hand position in the video, each recording has been converted to a sequence of images. For each image then, the human hand position is extracted via template matching. Thus, each training sample is a sequence of 2d human hand positions (x, y). 105

Table 7.1: Tracked and stored information for each gesture in 6dmg dataset

Time-step (ms) x - - - Position (meter) x y z - Quaternion w x y z Acceleration (g) x y z - Angular speed (radian/s) yaw pitch roll -

7.2.2 6dmg Dataset

In addition to the in-house dataset, the 6dmg dataset [70] is also used to evaluate the performance of the proposed approach on an alternative dataset. 6dmg contains patterns for uppercase letters A to Z and digits 0 to 9. Letters are air-drawn by 22 right-handed subjects (17 male and 5 female) which drew the letters (A to Z) 10 times for a the total number of 5720 samples. Digits air drawn by 6 people drew each digit

10 times for a total number of 600 samples. Each pattern in this dataset contains the information shown in Table 7.1.

7.2.3 Real-Time Classifier

To evaluate the efficacy of the proposed method in robotic applications, its accuracy of classifying human gestures is calculated in real-time. In this experiment, the human subjects drew each digit by holding a red ball in their hand and gesturing the writing of the digit for the Kinect camera onboard a PR2 robot. The red ball allows the tracking of the hand movements to be reduced to a color detection problem, thereby providing a simple, fast and precise enough method to obtain the hand’s position in 106 each frame. Indeed there are more sophisticated hand tracking methods developed in the area of machine vision that relinquish the need for holding colored objects, but as this topic is out of the scope of this work, it is refrained from overcomplication of the experiment by implementing more advanced hand-tracking techniques.

For this experiment, the classifier was trained with the process described in Sec- tion 7.1, once with the in-house digits dataset and once with the 6dmg digit dataset.

To train the classifier with the in-house dataset, 7 training samples were used per each digit. In the testing phase, the subjects drew each digit 22 times in front of the onboard Kinect camera of a PR2 robot. Training from the 6dmg digits dataset was performed in a similar way: 7 randomly chosen samples for each digit were used for training the proposed spiking neural network. The test consisted of each subject drawing every digit 22 times in front of the onboard Kinect camera of a PR2 robot.

Figure 7.3 presents a screenshot of the proposed spike-timing classifier. In this figure, the top-left shows the source image, the top-right shows the result of early classifica- tion, the bottom-left is the detected ball position and the bottom-right is the color detection tracker (including the past trajectory). Although digit eight is one of the most challenging patterns due to its large overlap with other digits, as Figure 7.3 shows, the system succeeds in accurately detecting this pattern after it observes only a small fragment of it. Therefore, the results not only confirm the feasibility of the proposed approach for real-time applications, but also demonstrates its capability for early classification and intent recognition. 107

Figure 7.3: Screenshot of our real-time classifier

Table 7.2: Classification results for each individual(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset).

0 1 2 3 4 5 6 7 8 9 SR 86.3 | 91 100 | 95.4 95.2 | 90.9 81.8 | 86.3 100 | 95.4 95 | 100 81.8 | 72.7 100 | 95.4 95.4 | 90.9 72.7 | 81.81 ER 13.7 | 9 0 | 4.6 4.8 | 9.1 18.2 | 13.7 0 | 4.6 5 | 0 18.1 | 27.3 0 | 4.6 4.5 | 9.1 27.3 | 18.19 RR 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0

Table 7.3: Classification results for all digits(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset).

All digits SR 90.82 | 90 ER 9.17 | 10 RR 0 | 0

In order to evaluate and validate the proposed approach we 3 measures are used, which are the success rate, the rejection rate and the error rate. These measures are defined in section 3.2.2. The values of these three metrics for both of the classifiers used in this experiment for each digit individually and all digits combined are presented in

Table 7.2 and Table 7.3 respectively.

The top-3 accuracy results (i.e. the correct class is one of the first, second or third 108

Figure 7.4: Training data: first and third row; Correctly classified testing data: second and forth row (in-house dataset) highest class) are also computed for each digit individually and for all digits combined.

These results for 6dmg digit and in house dataset are shown in Table 7.4. and

Table 7.5. These results demonstrate that in most cases, if the correct class is not the highest one, it is either the second or the third highest similar class. The results show that if we consider the top-3 accuracy as the measure of performance, the overall accuracy increases from 90.82% to 97.24% in in-house dataset and from 90% to 96.3% in 6dmg-digit dataset.

To facilitate the early classification and recognition of human activities, the proposed real-time classifier records the sequence of hand positions, as well as the predicted class labels for all hand positions. A subset of the correctly classified test samples 109

Table 7.4: Top-3 classification results for each individual and all digits(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset).

0 1 2 3 4 5 6 7 8 9 SR 100 | 95.4 100 | 100 100 | 100 90.90 | 95.4 100 | 100 100 | 100 100 | 86.3 100 | 100 100 | 100 81.81 | 86.3 ER 0 | 4.6 0 | 0 0 | 0 9.09 | 4.6 0 | 0 0 | 0 0 | 13.7 0 | 0 0 | 0 18.18 | 13.7 RR 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0

Table 7.5: Top-3 classification results for all digits(%). SR: success rate, ER: error rate, RR: rejection rate (In house dataset|6dmg digit dataset).

All digits SR 97.24 | 96.3 ER 2.75 | 3.7 RR 0 | 0 are illustrated in the second and the fourth row of Figure 7.4a. These figures show that despite the fact that the training data has low variability in terms of scale and translation, the proposed system can correctly classify test samples that are at different locations and scale in the camera’s field of view, thereby confirming that the proposed system is scale/translation invariant and is able to extract the main features of the training samples and use them for classifying unseen testing data.

The predicted class is represented on the columns and the actual class is represented on the rows. For example, in the matrix in Figure 7.4b, out of the 22 actual digits eight, the system predicts that one eight was identified as digit zero and the rest were correctly classified as digit eight. Based on the results presented in Figure 7.4b and

Table 7.2 the hardest patterns to classify are digits three, six, and nine in the in-house dataset, while for the 6dmg data, the results in Figure 7.4c and Table 7.2 indicate that the most difficult patterns occur in the cases of six and nine. The reason behind 110 this is the significant overlap between the spatio-temporal pattern of these digits with other digits, some instances of which are the similarity of digits two and five to digit three, digit five to nine), and digit eight to six.

7.2.4 Comparison with Other Approaches

The performance of the proposed approach is compared with that of Recurrent Neural

Network (RNN), Continuous Hidden Markov Model (C-HMM) and Discrete Hidden

Markov Model (D-HMM). For this comparison, the RNN method is implemented us- ing the TensorFlow library in Python [71], while D-HMM and C-HMM classifiers are implemented in Matlab using the HMM toolbox for Matlab [72] and hmmlearn toolbox in Python [73] respectively. The benchmark data selected for this comparison are the 6dmg digit and character datasets, as well as the in-house digit dataset. It is noteworthy that the offline nature of these datasets does not allow for the com- parison between the real-time performance of the selected classification techniques.

Hence, the comparison is limited to offline classification. The number of classes in the digit and character datasets is 10 and 26 respectively. Two sets of features are considered in these comparisons, including those introduced for the 6dmg dataset by Chen et.al.[74], and the 1D angle representation introduced by the proposed ap- proach. Features introduced in [74] are normalized 3D position, normalized 3D linear velocity, normalized 3D angular velocity, normalized 3D accerlation and normalized

4D quaternion. In this work the aforementioned two sets of features have been used 111

Table 7.6: Specification of recurrent neural network. (LR: Learning rate, BS: Batch size, LF: Loss function, NE: Number of epochs, NN: Number of neurons, OS: Output size, CrEn: Cross entropy)

LR BS LF NE NN OS Type RNN, 6dmg-d (1D angle) 6e-3 1 CrEn 300 150 10 LSTM RNN, 6dmg-d ([74]) 3e-3 1 CrEn 400 150 10 LSTM RNN, In-house-d (1D angle) 2e-2 70 CrEn 300 150 10 LSTM RNN, 6dmg-c (1D angle) 6e-3 1 CrEn 300 150 26 LSTM RNN, 6dmg-c ([74]) 3e-3 1 CrEn 400 150 26 LSTM on the digit and character 6dmg datastes. However only the 1D angle feature has been extracted from the in-house dataset because it just contains 2D positions.

The results of SNN (1D angle) are compared with RNN (1D angle), RNN (Chen et.al.[74]), C-HMM (1D angle), C-HMM (Chen et.al.[74]), D-HMM (1D angle), and

D-HMM(Chen et.al.[74]) on the 6dmg dataset. However the results of SNN (1D angle) have been compared against the RNN (1D angle), the C-HMM(1D angle) and the

D-HMM(1D angle) on in-house dataset. Table 7.6, Table 7.7, Table 7.8 and Table 7.9 present the parameters of RNN, D-HMM, C-HMM and SNN respectively. The way in which the parameters are selected is by changing them gradually and computing the final accuracy. The optimal parameter value is the one which results in the best overall accuracy. The comparison results between the proposed method and all the approaches described above on the 6dmg character, the 6dmg digit and in-house datasets are shown in Table 7.12, Table 7.10 and Table 7.11 respectively.

Either 12 training samples or 5 training samples per each digit are used for training

SNN, RNN, C-HMM and D-HMM on the 6dmg digit dataset. The comparison results are shown in Table 7.10. As the results indicate, the performance of SNN is slightly 112

Table 7.7: Specification of continuous hidden markov model. (NE: Number of epochs, NC: Number of components, EF: Estimating function, Cov-t: covariance type, 6dmg-d: 6dmg digit dataset, 6dmg-c: 6dmg character dataset, EM: Expec- tation maximization algorithm, BW: Baum Welch)

NE NC EF Cov-t C-HMM, 6dmg-d (1D angle) 250 15 EM (BW) diag C-HMM, 6dmg-d ([74]) 250 10 EM (BW) diag C-HMM, In-house-d (1D angle) 300 10 EM (BW) diag C-HMM, 6dmg-c (1D angle) 500 10 EM (BW) diag C-HMM, 6dmg-c ([74]) 500 15 EM (BW) diag

Table 7.8: Specification of discreet hidden markov model. (NS: Number of state, NO: Number of outputs , EF: Estimating function, 6dmg-d: 6dmg digit dataset, 6dmg-c: 6dmg character dataset, EM: Expectation maximization algorithm, BW: Baum Welch)

NS NO EF D-HMM, 6dmg-d (1D angle) 10 36 EM (BW) D-HMM, 6dmg-d ([74]) 15 36 EM (BW) D-HMM, In-house-d (1D angle) 12 36 EM (BW) D-HMM, 6dmg-c (1D angle) 16 36 EM (BW) D-HMM, 6dmg-c ([74]) 26 36 EM (BW)

Table 7.9: Specification of spiking neural network. (NN: Number of neurons)

NN Sigma SNN, 6dmg-d (1D angle) 269 10 SNN, In-house-d (1D angle) 225 10 SNN(1), 6dmg-c (1D angle) 325 13 SNN(2), 6dmg-c (1D angle) 538 14 SNN(3), 6dmg-c (1D angle) 718 15 better than all approaches except RNN([74]) when the number of training samples per each digits is 12. However, when the number of training samples per each digit is decreased to 5, the drop in performance of the SNN was less compared to other approaches. This shows that SNN is a robust approach when a small training dataset is available. 113

Table 7.10: Comparison results, 6dmg-digit dataset (%)

Classifier Accuracy (12) Accuracy (5) SNN (1D angle) 87 80 RNN (1D angle) 86 71 C-HMM (1D angle) 85 74 D-HMM (1D angle) 60 25 RNN ([74]) 89 75 C-HMM [74] 84 70 D-HMM ([74]) 65 38

Table 7.11: Comparison results, In house dataset, digits, angles features (%)

Classifier Accuracy SNN (1D angle) 85.2 RNN (1D angle) 84 C-HMM (1D angle) 80 D-HMM (1D angle) 60

In order to evaluate the performance of the proposed approach when the number of classes are greater than 10, the 6dmg character dataset (26 classes) is considered.

The overall performance of three variations of SNNs along with other approaches on this dataset is presented in Table 7.12. Based on the results in Table 7.12, SNN

(3) obtained better performance rather than SNN(1) and SNN(2). As the parameter values of SNN (3) in Table 7.9 indicates, the larger value of sigma and a bigger spiking network in terms of the number of neurons result in a better performance.

Further inspecting the presented results reveals that the proposed SNN-based ap- proach performs consistently better than C-HMM and D-HMM, while its comparison with RNN indicates that the latter either performs slightly better or at the same level of accuracy as the proposed approach. 114

Table 7.12: Comparison results, 6dmg character dataset (%)

Classifier Accuracy SNN (1) (1D angle) 72 SNN (2) (1D angle) 77 SNN (3) (1D angle) 80 RNN (1D angle) 82 C-HMM (1D angle) 40 D-HMM (1D angle) 55 RNN ([74]) 85 C-HMM [74] 78 D-HMM ([74]) 35

7.2.5 Early Detection Results

The early detection ability of the proposed system on in-house dataset is evaluated by analyzing the class labels from each frame, which are saved during the on-line classification stage. The early detection rate and the average correct duration are computed. The average of the early detection rates as percentages of the entire pattern for each digit is shown in Figure 7.5. This average is around 50% for digits one and seven and less than 60% for digit four. However, for digits whose spatio- temporal patterns have significant overlap with other digits, the system needs to observe a larger portion of the patterns.

The average of correct duration percentages for all digits is presented in Figure 7.6.

This value is greater than 25% for all digits except digit three. This figure shows the duration in terms of time percentages during which the correct class is predicted correctly as the pattern is presented to the system. 115

Figure 7.5: Average of early detection rates (percentages)

Figure 7.6: Average of correct duration (percentages)

7.3 Conclusion

This chapter presents an online real-time approach for early detection and classifi- cation of spatio-temporal human gestures. The classifier consists of a parallelized

CUDA implementation of a spike-timing neural network with axonal conductance 116 delays. This work brings the following key contributions: i) it provides an on-line real-time classification of spatio-temporal patterns with the ability to provide early predictions of the patterns before their completion, ii) it only requires a small set of training data, iii) it is scale and translation invariant iv) it seamlessly handles vari- able length patterns. The approach has been validated on a set of digits (0 through

9), drawn by a human hand in front of a Kinect camera mounted on a PR2 robot.

In comparing to other prominent techniques, the proposed approach demonstrates superior accuracy and is suitable for recognizing as well as early classification of dif- ferent types of human actions in time-sensitive mobile applications. Along with the demonstrated real-time performance, these features evidence the feasibility of this approach for human-robot applications. 117

Chapter 8

Conclusion and Future Work

In this dissertation I proposed 5 different approaches based on spiking neural network with conductance delays for recognizing and early detecting of spatio-temporal pat- terns. Two different datasets of hand written digits and human hand gestures have been used for evaluating the proposed approaches. The proposed approaches have several of the following contributions: i) they require a very small number of training examples, ii) they enable early recognition from only partial information of the pat- tern, iii) they learn patterns in an unsupervised manner, iv) they accept variable sized input patterns, v) they are invariant to scale and translation, vi) they can recognize patterns in real-time and, vii) they are suitable for human-robot interaction appli- cations and have been successfully tested on a PR2 robot. The proposed methods are compared with well-known supervised approaches included but not limited to the

Support vector machine, logistic regression and ensemble of neural networks. While 118 the proposed methods are unsupervised, they provide better results in most of the cases and comparable results for a few of them. Based on the results the spiking neu- ral network is very suitable for modeling, learning, recognizing and early classifying spatio-temporal patterns.

The positive results obtained in this dissertation motivates further studies on both the theoretical and practical aspects of the approach. Some venues of future work are identified as follows: i) The current setup of this approach is tuned for one sensor and a single input type. Considering the multi-modality of cues in human actions, the application of the proposed approach in HRI can benefit from extending the current work to support multiple sensors and types. ii) The current work only studies the case of 1-dimensional inputs, an extension of which can prove further applicability to real-world problems. iii) The real-time performance of our approach may be further enhanced through design alterations towards online or one-shot training techniques. 119

Bibliography

[1] Richard Kelley, Christopher King, Alireza Tavakkoli, Mircea Nicolescu, Monica

Nicolescu, and George Bebis. An architecture for understanding intent using a

novel hidden markov formulation. International Journal of Humanoid Robotics,

5(02):203–224, 2008.

[2] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in

cognitive sciences, 3(6):233–242, 1999.

[3] Banafsheh Rekabdar, Bita Shadgar, and Alireza Osareh. Learning teamwork

behaviors approach: learning by observation meets case-based planning. In

Artificial Intelligence: Methodology, Systems, and Applications, pages 195–201.

Springer, 2012.

[4] Amanda L Woodward, Jessica A Sommerville, and Jose J Guajardo. How infants

make sense of intentional action. Intentions and intentionality: Foundations of

social cognition, pages 149–169, 2001. Bibliography 120

[5] AJung Moon, Daniel M Troniak, Brian Gleeson, Matthew KXJ Pan, Minhua

Zheng, Benjamin A Blumer, Karon MacLean, and Elizabeth A Croft. Meet me

where i’m gazing: how shared attention gaze affects human-robot handover tim-

ing. In Proceedings of the 2014 ACM/IEEE international conference on Human-

robot interaction, pages 334–341. ACM, 2014.

[6] Tirthankar Bandyopadhyay, Kok Sung Won, Emilio Frazzoli, David Hsu,

Wee Sun Lee, and Daniela Rus. Intention-aware motion planning. In Algorithmic

Foundations of Robotics X, pages 475–491. Springer, 2013.

[7] Kester Duncan, Sudeep Sarkar, Redwan Alqasemi, and Rajiv Dubey. Scene-

dependent intention recognition for task communication with reduced human-

robot interaction. In Computer Vision-ECCV 2014 Workshops, pages 730–745.

Springer, 2014.

[8] Eugene M Izhikevich. Polychronization: computation with spikes. Neural com-

putation, 18(2):245–282, 2006.

[9] Yiannis Demiris. Prediction of intent in robotics and multi-agent systems. Cog-

nitive processing, 8(3):151–158, 2007.

[10] Eugene Charniak and Robert P Goldman. A bayesian model of plan recognition.

Artificial Intelligence, 64(1):53–79, 1993.

[11] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012. Bibliography 121

[12] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics (in-

telligent robotics and autonomous agents series). Intelligent robotics and au-

tonomous agents, The MIT Press (August 2005), 2005.

[13] Sheldon M Ross. Introduction to probability and statistics for engineers and

scientists. Academic Press, 2009.

[14] Lawrence Rabiner. A tutorial on hidden markov models and selected applications

in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[15] Christopher D Manning and Hinrich Sch¨utze. Foundations of statistical natural

processing. MIT press, 1999.

[16] Junji Yamato, Jun Ohya, and Kenichiro Ishii. Recognizing human action in

time-sequential images using hidden markov model. In Computer Vision and

Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society

Conference on, pages 379–385. IEEE, 1992.

[17] Ferdinando Samaria and Steve Young. Hmm-based architecture for face identi-

fication. Image and vision computing, 12(8):537–543, 1994.

[18] Greg Welch and Gary Bishop. An introduction to the kalman filter. 1995.

[19] M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tuto-

rial on particle filters for online nonlinear/non-gaussian bayesian tracking. Signal

Processing, IEEE Transactions on, 50(2):174–188, 2002. Bibliography 122

[20] Cody Kwok, Dieter Fox, and Marina Meila. Real-time particle filters. Proceedings

of the IEEE, 92(3):469–484, 2004.

[21] Michael Isard and Andrew Blake. Contour tracking by stochastic propagation

of conditional density. In Computer VisionECCV’96, pages 343–356. Springer,

1996.

[22] Michael Isard and Andrew Blake. Condensationconditional density propagation

for visual tracking. International journal of computer vision, 29(1):5–28, 1998.

[23] Arnaud Doucet, Nando De Freitas, and Neil Gordon. Sequential Monte Carlo

methods in practice. Springer, 2001.

[24] James Davis and Mubarak Shah. Visual gesture recognition. In Vision, Image

and Signal Processing, IEE Proceedings-, volume 141, pages 101–106. IET, 1994.

[25] Aaron F Bobick and Andrew D Wilson. A state-based approach to the repre-

sentation and recognition of gesture. Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 19(12):1325–1337, 1997.

[26] Mohammed Yeasin and Subhasis Chaudhuri. Visual understanding of dynamic

hand gestures. Pattern Recognition, 33(11):1805–1817, 2000.

[27] Pengyu Hong, Matthew Turk, and Thomas S Huang. Gesture modeling and

recognition using finite state machines. In Automatic face and gesture recognition,

2000. proceedings. fourth ieee international conference on, pages 410–415. IEEE,

2000. Bibliography 123

[28] Peter E Duda and O Richard. Hart, pattern classification and scene analysis.

1973.

[29] Rafael C Gonzalez and Richard E Woods. Digital image processing. 2002.

[30] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour

models. International journal of computer vision, 1(4):321–331, 1988.

[31] Simon Haykin and Neural Network. A comprehensive foundation. Neural Net-

works, 2(2004), 2004.

[32] Yann LeCun, L´eonBottou, Yoshua Bengio, and Patrick Haffner. Gradient-based

learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–

2324, 1998.

[33] Aditya Ramamoorthy, Namrata Vaswani, Santanu Chaudhury, and Subhashis

Banerjee. Recognition of dynamic hand gestures. Pattern Recognition, 36(9):

2069–2081, 2003.

[34] Ming-Hsuan Yang and Narendra Ahuja. Recognizing hand gestures using motion

trajectories. In Face Detection and Gesture Recognition for Human-Computer

Interaction, pages 53–81. Springer, 2001.

[35] Marcus Liwicki, Horst Bunke, et al. Hmm-based on-line recognition of hand-

written whiteboard notes. In Tenth International Workshop on Frontiers in

Handwriting Recognition, 2006. Bibliography 124

[36] Jianying Hu, Sok Gek Lim, and Michael K Brown. Writer independent on-line

handwriting recognition using an hmm approach. Pattern Recognition, 33(1):

133–147, 2000.

[37] Claus Bahlmann and Hans Burkhardt. The writer independent online handwrit-

ing recognition system frog on hand and cluster generative statistical dynamic

time warping. Pattern Analysis and Machine Intelligence, IEEE Transactions

on, 26(3):299–310, 2004.

[38] Moussa Djioua and Rejean Plamondon. A new algorithm and system for the

characterization of handwriting strokes with delta-lognormal parameters. Pat-

tern Analysis and Machine Intelligence, IEEE Transactions on, 31(11):2060–

2072, 2009.

[39] Alex Graves, Marcus Liwicki, Santiago Fern´andez, Roman Bertolami, Horst

Bunke, and J¨urgenSchmidhuber. A novel connectionist system for uncon-

strained handwriting recognition. Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 31(5):855–868, 2009.

[40] Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and J¨urgenSchmid-

huber. Deep, big, simple neural nets for handwritten digit recognition. Neural

computation, 22(12):3207–3220, 2010.

[41] Eric R Kandel, James H Schwartz, Thomas M Jessell, et al. Principles of neural

science, volume 4. McGraw-Hill New York, 2000. Bibliography 125

[42] Eugene M Izhikevich et al. Simple model of spiking neurons. IEEE Transactions

on neural networks, 14(6):1569–1572, 2003.

[43] Xiaoli Tao and Howard E Michel. Data clustering via spiking neural networks

through spike timing-dependent plasticity. In IC-AI, pages 168–173, 2004.

[44] H´elenePaugam-Moisy, R´egisMartinez, and Samy Bengio. Delay learning and

polychronization for reservoir computing. Neurocomputing, 71(7):1143–1158,

2008.

[45] Sadegh Karimpouli, Nader Fathianpour, and Jaber Roohi. A new approach

to improve neural networks’ algorithm in permeability prediction of petroleum

reservoirs using supervised committee machine neural network (scmnn). Journal

of Petroleum Science and Engineering, 73(3):227–232, 2010.

[46] Michael Beyeler, Nikil D Dutt, and Jeffrey L Krichmar. Categorization and

decision-making in a neurobiologically plausible spiking network using a stdp-

like learning rule. Neural Networks, 48:109–124, 2013.

[47] Eugene M Izhikevich, Joseph A Gally, and Gerald M Edelman. Spike-timing

dynamics of neuronal groups. Cerebral cortex, 14(8):933–944, 2004.

[48] Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist,

11(2):37–50, 1912. Bibliography 126

[49] Edgar Osuna, Robert Freund, and Federico Girosi. Support vector machines:

Training and applications. Technical Report AIM-1602, MIT Artificial Intelli-

gence Laboratory, 1997.

[50] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y Ng. Efficient l˜ 1 regular-

ized logistic regression. In Proceedings of the National Conference on Artificial

Intelligence, volume 21, page 401. Menlo Park, CA; Cambridge, MA; London;

AAAI Press; MIT Press; 1999, 2006.

[51] Banafsheh Rekabdar, Mahmood Joorabian, and Bita Shadgar. Artificial neural

network ensemble approach for creating a negotiation model with ethical artificial

agents. In Artificial Intelligence and Soft Computing, pages 493–501. Springer,

2012.

[52] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector

machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2

(3):27, 2011.

[53] Donald W Marquardt. An algorithm for least-squares estimation of nonlinear

parameters. Journal of the Society for Industrial & Applied Mathematics, 11(2):

431–441, 1963.

[54] Katsuya Kawanami and Noriyuki Fujimoto. Gpu accelerated computation of the

longest common subsequence. In Facing the Multicore-Challenge II, pages 84–95.

Springer, 2012. Bibliography 127

[55] Banafsheh Rekabdar, Monica Nicolescu, Richard Kelley, and Mircea Nicolescu.

An unsupervised approach to learning and early detection of spatio-temporal

patterns using spiking neural networks. Journal of Intelligent & Robotic Systems,

80(1):83–97, 2015.

[56] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute-

mann, and Ian H Witten. The weka data mining software: an update. ACM

SIGKDD explorations newsletter, 11(1):10–18, 2009.

[57] Remco R Bouckaert. Bayesian network classifiers in weka. Department of Com-

puter Science, University of Waikato, 2004.

[58] David D Lewis. Naive (bayes) at forty: The independence assumption in infor-

mation retrieval. In Machine learning: ECML-98, pages 4–15. Springer, 1998.

[59] Dennis W Ruck, Steven K Rogers, Matthew Kabrisky, Mark E Oxley, and

Bruce W Suter. The multilayer perceptron as an approximation to a bayes

optimal discriminant function. Neural Networks, IEEE Transactions on, 1(4):

296–298, 1990.

[60] Mark JL Orr et al. Introduction to radial basis function networks, 1996.

[61] Andy Liaw and Matthew Wiener. Classification and regression by randomforest.

R news, 2(3):18–22, 2002. Bibliography 128

[62] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-

Antoine Manzagol. Stacked denoising autoencoders: Learning useful representa-

tions in a deep network with a local denoising criterion. The Journal of Machine

Learning Research, 11:3371–3408, 2010.

[63] Banafsheh Rekabdar, Monica Nicolescu, Mircea Nicolescu, Mohammad Taghi

Saffar, and Richard Kelley. A scale and translation invariant approach for early

classification of spatio-temporal patterns using spiking neural networks. Neural

Processing Letters, pages 1–17, 2015.

[64] Eugene M. Izhikevich, Joseph A. Gally, and Gerald M. Edelman. Spike-timing

dynamics of neuronal groups. Cerebral Cortex, 14(8):933–944, 2004. doi: 10.

1093/cercor/bhh053. URL http://cercor.oxfordjournals.org/content/14/

8/933.abstract.

[65] Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm

for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[66] Miguel A Carreira-Perpinan and Geoffrey E Hinton. On contrastive divergence

learning. In Proceedings of the tenth international workshop on artificial intelli-

gence and statistics, pages 33–40. Citeseer, 2005.

[67] Rasmus Berg Palm. Prediction as a candidate for learning deep hierarchical

models of data. Technical University of Denmark, Palm, 2012. Bibliography 129

[68] V´acl´avChvatal and David Sankoff. Longest common subsequences of two random

sequences. Journal of Applied Probability, pages 306–315, 1975.

[69] Jesper Sj¨ostr¨omand Wulfram Gerstner. Spike-timing dependent plasticity. Spike-

timing dependent plasticity, page 35, 2010.

[70] Mingyu Chen, Ghassan AlRegib, and Biing-Hwang Juang. 6dmg: A new 6d mo-

tion gesture database. In Proceedings of the 3rd Multimedia Systems Conference,

pages 83–88. ACM, 2012.

[71] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,

Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al.

Tensorflow: Large-scale machine learning on heterogeneous distributed systems.

arXiv preprint arXiv:1603.04467, 2016.

[72] Kevin Murphy. Hmm toolbox for matlab. Internet: http://www. cs. ubc. ca/˜

murphyk/Software/HMM/hmm. html,[Oct. 29, 2011], 1998.

[73] Implementation of hidden markov models in python. URL http://hmmlearn.

readthedocs.io/en/latest/.

[74] Mingyu Chen, Ghassan AlRegib, and Biing-Hwang Juang. Air-writing recog-

nitionpart i: Modeling and recognition of characters, words, and connecting

motions. IEEE Transactions on Human-Machine Systems, 46(3):403–413, 2016.