UNSUPERVISED IMAGE FEATURE LEARNING FOR CONVOLUTIONAL NEURAL NETWORKS

A THESISSUBMITTEDTOTHE UNIVERSITYOF MANCHESTER FORTHEDEGREEOF DOCTOROF PHILOSOPHY INTHE FACULTY OF SCIENCEAND ENGINEERING

2019

By Richard Hankins Candidate Department of Electrical and Electronic Engineering School of Engineering Contents

Abstract 12

Declaration 13

Copyright 14

Acknowledgements 15

List of Publications 16

1 Introduction 17 1.1 Background ...... 17 1.2 Aims and Objectives ...... 21 1.3 Scope ...... 21 1.4 Structure ...... 23

2 Classical Methodologies 25 2.1 Feature Representations and Learning ...... 26 2.1.1 Hand-crafted Feature Representations ...... 28 2.1.2 Unsupervised Feature Learning ...... 34 2.2 Classifiers ...... 37 2.2.1 ...... 39 2.2.2 Support Vector Machines ...... 40 2.2.3 k-nearest Neighbours and Decision Trees ...... 42 2.3 Datasets ...... 43 2.3.1 Image Datasets ...... 44 2.3.2 Video Datasets ...... 47

2 3 Deep Neural Networks 52 3.1 Introduction ...... 52 3.1.1 Feedforward Networks ...... 54 3.2 Related Work ...... 57 3.3 Convolutional Neural Networks ...... 65 3.3.1 Layers and Architectures ...... 67 3.3.2 Other Layers ...... 72 3.3.3 Optimisation and the Algorithm ...... 75 3.3.4 Pre-processing ...... 79 3.3.5 Issues ...... 81 3.4 2D Convolutional Neural Networks Case Studies ...... 82 3.4.1 Image Classifcation on the MNIST Dataset ...... 83 3.4.2 Action Classification on the Weizmann Dataset ...... 86 3.5 3D Convolutional Neural Networks Case Study ...... 92 3.5.1 Action Classification on the UCF Sports Dataset ...... 92

4 Self-Organising Map Network 104 4.1 Introduction ...... 104 4.2 Related Work ...... 106 4.3 Methodology ...... 108 4.3.1 Convolutional Self-Organising Map ...... 108 4.3.2 Discrete Cosine Transform (DCT) ...... 109 4.3.3 Markov Random Field ...... 109 4.3.4 Self-Organising Map Network (SOMNet) ...... 111 4.3.5 Markov Random Field Self-Organising Map Network (MRF- SOMNet) ...... 114 4.3.6 Computational Complexity ...... 115 4.4 Experiments and Discussion ...... 116 4.4.1 Comparison of Different Features and Encodings ...... 116 4.4.2 Evaluation on the MNIST Dataset ...... 117 4.4.3 Optimising Parameters on the CIFAR-10 Dataset ...... 120 4.4.4 Evaluation on the CIFAR-10 Dataset ...... 123 4.5 Conclusions and Future Work ...... 125

5 SOMNet with Aggregated Channel Connections 127 5.1 Introduction ...... 127

3 5.2 Related Work ...... 130 5.3 Methodology ...... 131 5.3.1 Proposed Method ...... 131 5.4 Experiment and Discussion ...... 133 5.4.1 Whitening ...... 133 5.4.2 Digit Classification on the MNIST Dataset ...... 134 5.4.3 Object Classification on the CIFAR-10 Dataset ...... 142 5.5 Conclusion and Future Work ...... 149

6 Filter Replacement in Convolutional Networks using Self-Organising Maps 151 6.1 Introduction ...... 151 6.2 Related Work ...... 153 6.3 Methodology ...... 156 6.3.1 Proposed Method ...... 156 6.3.2 Self-Organising Maps ...... 157 6.3.3 Convolutional Neural Networks ...... 157 6.3.4 Filter Replacement with Self-Organising Maps ...... 160 6.4 Object Classification Experiments and Discussion on the CIFAR-10 and CIFAR-100 Datasets ...... 163 6.4.1 Convolutional Neural Networks ...... 163 6.4.2 Filter Replacement with Self-Organising Maps ...... 164 6.5 Action Classification Experiments and Discussion on the UCF-50 Dataset181 6.5.1 Convolutional Neural Networks ...... 181 6.5.2 Filter Replacement with Self-Organising Maps ...... 184 6.6 Conclusion and Future Work ...... 190

7 Conclusions and Future Work 193 7.1 Conclusions ...... 193 7.2 Future Work ...... 198 7.2.1 SOMNet ...... 199 7.2.2 Supervised Channel Pooling ...... 200 7.2.3 Combining Supervised and ...... 200 7.2.4 Temporal Models ...... 202

4 Bibliography 204

Word Count: 40269

5 List of Tables

3.1 Validation and test error as well as intra-class error using different sub- jects as the test set on Weizmann...... 91 3.2 Absolute misclassification for each class using different subjects as the test set on Weizmann...... 91 3.3 3D CNN architecture for UCF Sports ...... 95 3.4 3D baseline: accuracy on UCF Sports ...... 97 3.5 3D bounding box: accuracy on UCF Sports ...... 100 3.6 Accuracy on UCF Sports ...... 103

4.1 Computational Complexity ...... 116 4.2 Comparing features and encodings ...... 117 4.3 Error rate on MNIST ...... 120 4.4 Variations in block size and overlapping ratio of SOMNet on CIFAR-10 122 4.5 Variations in feature numbers of SOMNet on CIFAR-10 ...... 122 4.6 Accuracy on CIFAR-10 ...... 124

5.1 FAC layer: error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section)...... 136 5.2 SAC layer: error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section)...... 138 5.3 GAC layer: error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section)...... 140 5.4 Error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 142

6 5.5 FAC layer: accuracy on CIFAR-10 (the p-value represent the com- parison between each method with the method labelled not applicable (N/A))...... 144 5.6 SAC layer: accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A)). 145 5.7 GAC layer: accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A)). 146 5.8 Accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section)...... 148

6.1 Baseline 2D CNN architecture for CIFAR-10/CIFAR-100 ...... 159 6.2 Baseline 3D CNN architecture for UCF-50 ...... 161

6.3 Proposed 2D CNNNIN+SOM architecture for CIFAR-10/100 . . . . . 162 6.4 Baseline 2D CNN accuracy on CIFAR-10 (p-values represent the com- parison between each method with the method labelled not applicable (N/A))...... 164 6.5 Baseline 2D CNN accuracy on CIFAR-100 (p-values represent the comparison between each method with the method labelled not ap- plicable (N/A))...... 164

6.6 2D CNNNIN+SOM accuracy on CIFAR-10 ...... 166 6.7 2D CNNNIN+SOM accuracy on CIFAR-10 ...... 168 6.8 2D baseline CNN vs CNN+SOM subset accuracy on CIFAR-10 . . . 170

6.9 2D baseline CNN vs CNN+SOMTI subset accuracy on CIFAR-100 . . 170 6.10 2D CNNNIN+SOMTI accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not appli- cable (N/A) in each column)...... 172

6.11 2D CNNNIN+SOMTI using 3 × 3 filters accuracy on CIFAR-100 (p- values represent the comparison between each method with the method labelled not applicable (N/A) in each column)...... 172

6.12 2D CNNNIN+SOMTI using 5 × 5 filters accuracy on CIFAR-100 (p- values represent the comparison between each method with the method labelled not applicable (N/A) in each column)...... 173

6.13 2D CNNNIN+SOMTI using 7 × 7 filters accuracy on CIFAR-100 (p- values represent the comparison between each method with the method labelled not applicable (N/A) in each column)...... 173

7 6.14 Accuracy on CIFAR-10 and CIFAR-100 (p-values represent the com- parison between each method with the method labelled not applicable (N/A) for each section)...... 180 6.15 Baseline 3D CNN accuracy on UCF-50 (p-values represent the com- parison between each method with the method labelled not applicable (N/A) for each section in each column)...... 183

6.16 3D CNN+SOMTI accuracy on UCF-50 ...... 185 6.17 3D CNN+SOMTI accuracy on UCF-50 (p-values represent the com- parison between each method with the method labelled not applicable (N/A) for each section)...... 187 6.18 Accuracy on UCF-50 ...... 190

8 List of Figures

2.1 Logistic sigmoid function ...... 39 2.2 A selection of examples from the MNIST dataset ...... 45 2.3 Frames from example videos from each class of the Weizmann dataset. 47 2.4 Frames from example videos from each class of the UCF Sports dataset. 48 2.5 Frames from example videos from each class of the UCF-50 dataset. . 49 2.6 Frames from example videos from each class of the UCF-101 dataset. 50

3.1 model (adapted from Rosenblatt 1958 [151]) ...... 55 3.2 with a single hidden layer ...... 56 3.3 Non-linear sigmoidly functions ...... 57 3.4 Simple cells (adapted from Hubel 1995 [83]) ...... 68 3.5 Complex cells (adapted from Hubel 1995 [83]) ...... 68 3.6 Non-linear activation functions ...... 70 3.7 LeNet-5 architecture (adapted from LeCun et al.1998 [112])...... 73 3.8 First layer convolutional filters at different stages of training on MNIST. 86 3.9 Confusion matrix for the MNIST experiment ...... 87 3.10 Comparison of 2D (a) and 3D (b) convolutions (the temporal depth of the 3D filter is equal to 3). The colours indicate shared weights (adapted from Ji et al. 2013 [89]) ...... 93 3.11 Confusion matrix for the 3D baseline experiment on UCF Sports. . . . 98 3.12 Confusion matrix for the 3D bounding box experiment on UCF Sports. 101

4.1 Block diagram of SOMNet. SOM was used to derive the convolutional layer filter banks, as depicted in Figure 4.2 [68]...... 112 4.2 Training the SOM-based filter banks [68]...... 113 4.3 Learned SOMNet filters on MNIST. Top row: the first layer filter bank. Bottom row: the second layer filter bank...... 120

9 4.4 Learned PCANet filters on MNIST. Top row: the first layer filter bank. Bottom row: the second layer filter bank...... 121 4.5 Generated DCTNet filters for MNIST. Replicated across both filter banks.121 4.6 Clustered MRF filters for MNIST. Replicated across both filter banks. 121 4.7 Learned SOMNet filters on CIFAR-10. Top five rows: the first layer filter bank. Bottom four rows: the second layer filter bank...... 124 4.8 Learned PCANet filters on CIFAR-10. Top five rows: the first layer filter bank. Bottom row: the second layer filter bank...... 125

5.1 Application of proposed aggregation layers to a two layer SOMNet. The SOM-based filter banks correspond to the convolutional layers in Fig. 4.1. Each SOM layer is trained and frozen prior to training any subsequent SOM layer [69]...... 133 5.2 Learned second layer SOMNet + FAC filters on MNIST. Top four

rows: SOMNet8−32 architecture. Bottom four rows: SOMNet8−32 + FAC architecture...... 136 5.3 Learned second layer SOMNet + FAC filters on MNIST. Top four

rows: SOMNet16−32 architecture. Bottom four rows: SOMNet16−32 + FAC architecture...... 137 5.4 Learned second layer SOMNet + SAC filters on MNIST. Top four

rows: SOMNet8−32 architecture. Bottom four rows: SOMNet8−32 +

SAC6 architecture...... 139 5.5 Learned second layer SOMNet + SAC filters on MNIST. Top four

rows: SOMNet16−32 architecture. Bottom four rows: SOMNet16−32 + SAC8 architecture...... 139 5.6 Learned second layer SOMNet + GAC filters on MNIST. Top four

rows: SOMNet8−32 architecture. Bottom four rows: SOMNet8−32 + GAC4 architecture...... 140 5.7 Learned second layer SOMNet + GAC filters on MNIST. Top four

rows: SOMNet16−32 architecture. Bottom four rows: SOMNet16−32 + GAC4 architecture...... 141 5.8 Learned second layer SOMNet + FAC filters on CIFAR-10. Top four

rows: SOMNet40−32 architecture. Bottom four rows: SOMNet40−32 + FAC architecture...... 144

10 5.9 Learned second layer SOMNet + SAC filters on CIFAR-10. Top four

rows: SOMNet40−32 architecture. Bottom four rows: SOMNet40−32 + SAC2 architecture...... 146 5.10 Learned second layer SOMNet + GAC filters on CIFAR-10. Top four

rows: SOMNet40−32 architecture. Bottom four rows: SOMNet40−32 + GAC2 architecture...... 147

6.1 Learned SOMTI filters on CIFAR-10 using M = 20 × 20 and s = 3. . . 174 6.2 Learned SOMTI filters on CIFAR-10 using M = 30 × 30 and s = 5. . . 175 6.3 Learned SOMTI filters on CIFAR-10 using M = 30 × 30 and s = 7. . . 176 6.4 Accuracy of different SOM sizes (M) for CNNNIN+SOMTI on CIFAR- 100 ...... 177

6.5 Learned SOMTI filters on UCF-50 using M = 8 × 8 and td = s = 3 (therefore each filter is of size 3 × 3 × 3). (a)–(c) represent each slice

of td where td is the temporal depth or the number of frames...... 189

11 Abstract

Robust solutions to image classification is a challenging task since approaches should be able to successfully discriminate between the different classes, whilst being able to generalise across a large amount of intra-class variations. In an extension of image classification to the temporal domain, video classification aims to assign accurate hu- man action labels to video sequences. Recently and in particular the con- volutional neural network (CNN) has made great strides in many computer vision and tasks. CNNs implicitly learn data-specific hierachies of salient fea- tures with multiple levels of abstraction. However, the increased capacity of these con- volutional networks require vast labelled datasets in order to optimise their parameters. Unsupervised learning offers potential solutions to this problem as it doesn’t require labels and can simply learn the structure of data. In this work, the current state-of- the-art convolutional networks for both image and video classification and alternative strategies for feature learning using unsupervised learning are investigated. In particu- lar the use of the self organising map (SOM) to learn unsupervised features, to be used independently or in conjunction with other supervised feature learning methods, in the application of image and video classification, is explored. Firstly, the versatile nature of SOMs is exploited to extend and improve a simple multi-layer unsupervised archi- tecture inspired by PCANet named SOMNet; secondly SOMNet is further extended via the proposal of novel unsupervised feature aggregation layers; thirdly SOMs are used as fixed lower layer weights of CNNs in a novel approach to deep learning pre- training. Comprehensive experiments are conducted on a wide range of datasets and SOM-based filters are found to maintain or improve classification performance in the majority of cases even when labelled data is scarce. The wide variety of uses and ap- plications explored in this work demonstrates the robust and versatile nature of simple unsupervised SOM-based approaches and warrants their continued relevance in feature learning, even in the age of deep learning.

12 Declaration

No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

13 Copyright

i. The author of this thesis (including any appendices and/or schedules to this the- sis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, includ- ing for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other in- tellectual property (the “Intellectual Property”) and any reproductions of copy- right works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the Univer- sity IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? DocID=487), in any relevant Thesis restriction declarations deposited in the Uni- versity Library, The University Library’s regulations (see http://www.manchester. ac.uk/library/aboutus/regulations) and in The University’s policy on pre- sentation of Theses

14 Acknowledgements

Firstly I would like to thank my supervisor Dr Hujun Yin for his continued support and patience. Dr Roelof van Silfhout for his useful comments during first and second year vivas. I would like to thank my colleagues Ali AlSuwaidi, Shireen Zaki, Jing Huo and especially Yao Peng, for their invaluable input and counsel. I would also like to thank my family for all their love and encouragement. Lastly, but by no means least, I would like to thank my partner Rebecca for without her love and support I don’t think this would have been possible.

15 List of Publications

[1] Richard Hankins, Yao Peng, and Hujun Yin. SOMNet: Unsupervised feature learning networks for image classification. In International Joint Conference on Neural Networks (IJCNN), pages 1221–1228. IEEE, 2018.

[2] Richard Hankins, Yao Peng, and Hujun Yin. Towards complex features: Compet- itive receptive fields in unsupervised deep networks. In International Conference on Intelligent Data Engineering and Automated Learning (IDEAL), pages 838– 848. Springer, 2018.

[3] Yao Peng, Richard Hankins, and Hujun Yin. Data-independent feature learning with markov random fields in convolutional neural networks. Neurocomputing, In press 2019.

[4] Corneliu T.C. Arsene, Richard Hankins, and Hujun Yin. Deep learning models for denoising of ECG signals. In European Signal Processing Conference (EU- SIPCO). IEEE, In press 2019.

16 Chapter 1

Introduction

1.1 Background

The classification of images has been a long-standing task of the machine learning and computer vision communities. There are numerous applications for successful imple- mentations such as medical image analysis, visual geolocation, image retrieval, au- tonomous systems and augmented reality. For example, medical image analysis could assist in diagnoses and potentially suggest suitable treatment options. With regard to image retrieval, images could be searched for by content instead of their associated ti- tle. Or, for a given image, it could be possible to locate particular key regions, such as the location of a particular object. For the successful implementation of autonomous vehicles, the detection of objects in the road such as cars, people and road signs, will be necessary. Robust object location will also be useful to other autonomous applications such as robotics or manufacturing. In recent years, the recognition applied to 3D data in the form of videos has gained considerable traction, in part due to its many applications, both online and offline, in a wide range of areas. For instance, the classification of human actions has po- tential applications including surveillance and security, video retrieval, healthcare and entertainment, among others [133, 143]. In particular, the automatic recognition of

17 CHAPTER 1. INTRODUCTION 18 abnormal behaviour would be very advantageous in surveillance applications, and the monitoring of patients in healthcare environments. Similarly, action recognition could help in the rehabilitation of patients by examining rehabilitation exercises to ensure they are performed correctly. From a biological perspective, action recognition, which can also encompass other related areas such as gesture recognition and facial recog- nition, is highly important. Historically, the ability to recognise such movements was necessary for detecting predators, and selecting prey [55]. Whilst in modern times, these skills are less useful, recognition still plays a vital role in areas such as social communication [55]. The task of vision-based data classification is to assign a named label to an in- put. Classical approaches to vision-based data classification vary greatly, however, most methods involve two main steps; data representations and classification. Many challenges arise regarding these two steps which make data classification a difficult proposition. The main difficulties for vision-based classification stem from the large amount of intra-class variability attributed to differences in appearance, scale, view- point, illumination, deformations, misalignments, occlusions and backgrounds. There are also very specific variations concerning action classification from videos. For ex- ample, actions can deviate greatly due to style and duration, and the backgrounds in which they are filmed can be dynamic. Moreover, the classification must allow for the many sub-actions that do not belong to any one action class. Considerable effort has been made in the area of data representations, where fea- tures are often manually designed, however, this requires considerable knowledge and time. In addition, for some tasks it can be difficult to know what features should be extracted. Consider the amount of knowledge required to complete everyday tasks; much of this is intuitive, subjective and difficult to describe [60]. Examples of good hand-crafted approaches include Gabor features [123], local binary patterns (LBP) CHAPTER 1. INTRODUCTION 19

[137], histogram of oriented gradients (HOG) [38] and scale invariant feature trans- form (SIFT) [121]. The job of the classifier is then to learn how each of the fea- tures correlates with the different labels, however, it cannot influence how the features are defined, placing great importance on representation of the data they are presented with [60]. Whilst more complex classifiers have been proposed [16, 31], they still rely heavily on the representations to adequately separate factors of variation. In recent years, the field of deep learning has made many advances in computer vision and machine learning tasks [11]. Deep learning has previously been known un- der many names, including connectionism, and artificial neural networks (ANN). Its recent rebranding can be attributed to the work of Hinton et al., on a type of neural net- work called a (DBN) [75]. This focused the research community’s attention on the concept of depth and coined the term “deep learning”. Deep learning offers a plausible way of jointly extracting features of multiple levels of abstraction and learning a classifier based on these multiple levels of features. This allows the learn- ing of not only the mapping from representation to output, but also the representations themselves. Deep learning is responsible for the current state-of-the-art in areas such as object detection and localisation, recognition and image segmentation [32], as well as other related problems, such as speech recognition [37]. It has also been successfully used to assist pharmaceutical companies in the design of new drugs [37], to beat the current world champion at the game Go [160] and search for subatomic particles [7]. Its main advantages over the previous shallow feed-forward networks are the multiple layers of representations made possible by the deeper structure. Different layers of the network are responsible for different levels of data abstraction, where representations are expressed as the combination of other more primitive representations [110]. The convolutional neural network (CNN) [111, 112] is a type of deep learning model that uses shared weights to learn globally local features and it is currently achieving the state-of-the-art in image classification [101]; even demonstrating recognition rates that CHAPTER 1. INTRODUCTION 20 exceed human performance in some cases [25]. However, the cost of such performance is a significant increase in the complexity of the networks, and subsequently the number of parameters, and the need for vast labelled datasets. Yet, the inherent parallel nature of neural networks and convolutional operations can be taken advantage of in order to increase throughput. In particular, the availability of general-computing solutions such as Cuda and OpenGL, which allow GPUs to be used for general purpose processing, have made them a popular option in the pursuit of faster training times [24]. In fact, much of deep learning’s recent success can be attributed to improvements in hardware and data availability. Given these recent improvements, algorithms that have existed since the 1980s are now known to work reasonably well [60]. Initially, there was a focus on unsupervised methods which could generalise well from small datasets, as well as using pre-training to overcome the problems induced by increased complexity. However, as larger labelled datasets have become standard, these methods have fallen out of favour and tasks for which data is limited are often tackled by training the network on a larger, related dataset first, using a process called “transfer learning”. Yet, larger datasets present problems with annotations; automatic approaches prove difficult [60, 105] and manually verifying labels is labour intensive. Yet, despite the popularity of these convolutional networks, the feature learning mechanism and optimal configurations are not well understood [18] and they are often used as a black box. In addition, it has been found that deep networks can confidently classify unrecognisable images [136] and have difficulties recognising adversarial ex- amples which are modified in ways imperceptible to human vision [171]. This research aims to help understand deep learning more fully, particularly the convolutional neu- ral network, as well as to investigate alternative unsupervised representation learning techniques. Whilst unsupervised learning has become unpopular in recent years it ap- pears that can only progress with ever increasing complexity and CHAPTER 1. INTRODUCTION 21 datasets; which is unsustainable. In contrast, unsupervised learning can take advantage of the plentiful unlabelled data in the age of “big data”.

1.2 Aims and Objectives

Over the course of this PhD research it is intended that I will develop a thorough un- derstanding of the field of data classification; its current practices and their limitations. Considering the advances made by deep learning in the area of data classification, it is the main aim of the work to continue investigating deep learning and, in particular, the convolutional neural network for both image and video classification. However, given the drawbacks of the current supervised approach, as discussed in Section 1.1, I will be focusing on alternative unsupervised strategies to representation learning. This aim will be accomplished by completing the following objectives:

• Investigate the principles of feature learning in deep learning and, in particular, the convolutional neural network

• Investigate alternative unsupervised representation learning techniques while main- taining a convolutional structure

• Develop a deep unsupervised convolutional structure for image classification

• Combine unsupervised representation learning with a deep convolutional net- work for image and video classification

1.3 Scope

Firstly, in reference to data classification, this research is mainly concerned with image (2D) and video (3D) examples. Whilst other types of data will be mentioned in this research, it will only be discussed in reference to the main areas of focus. In addition, CHAPTER 1. INTRODUCTION 22 whilst there are other related lines of enquiry, such as detection and segmentation, these will not be discussed in any great detail. Specifically, in terms of image classification, only digit and object classification are considered, and therefore other tasks such as fa- cial recognition are not included. Whereas for video classification, only human action classification is considered, and other similar tasks, such as gesture recognition, are not included. Furthermore, whilst data classification could be conducted using a variety of sensory systems, this report will only consider visual-based approaches; from images and videos. However, methods that use multiple cameras will not be explored. With particular regard to the classification of actions from video, there are a wide range of taxonomies used in the literature. In [147], the author uses a hierarchy of action primitive, action and activity. Action primitives describe limb-level motions, such as moving a leg forward. Actions describe whole-body motions which contain action primitives, such as walking or running, and activities describe events which en- compass multiple actions. Therefore, playing football would be considered an activity because in contains actions such as running, kicking and jumping. However, there are some instances where the differences are not always clear; the environment and the interactions between persons or objects can alter perceptions. For example, lifting an object such as cup could be considered an action primitive, however, the sport of weightlifting could be considered an activity. In addition, the datasets used for evaluat- ing differing approaches do not always make a distinction; often combining examples of action primitives, actions and activities. In consideration of this, this report will use the word action in a generic way to refer to action primitives, actions and activities. Lastly, whilst this research is conducted in the fields of computer vision and ma- chine learning, no attempt will be made here to describe the entirety of these vast subjects, and only areas directly relevant to the project as a whole will be covered. CHAPTER 1. INTRODUCTION 23

1.4 Structure

This thesis is arranged in the following order: In Chapter 2 an overview of classical approaches to image and video classification is provided. Classical methodologies usually split the task of data classification into two distinct stages. Firstly, inputs are generally represented as higher level abstrac- tions which can be hand-crafted or learned. Secondly, the resultant descriptors are used to train a classifier. Various methods for both stages are discussed and reviewed, along with video specific representations, and classifiers. In addition, an overview of the datasets used during this research is presented along with an examination of their usage. In Chapter 3 deep learning and its ability the learn multiple levels of features is discussed. Specifically, convolution neural networks (CNNs) are explained in detail and some pre-existing approaches to both image and video classification using CNNs are examined. A brief history of progress in neural networks from the early percep- tron to the current state-of-the-art CNN is presented. Along the way, the benefits of the multi-layer perceptron, which enabled the joint learning of features and classifer, via the backpropogation algorithm, is highlighted. A more thorough discussion on CNNs, their application and current constraints, especially with regard to supervised learning, are given. Lastly, a selection of case study experiments on image and video classification are presented. In Chapter 4 a multi-level unsupervised SOM-based convolutional architecture for image classification is proposed and examined on two datasets. The architecture pro- vides a simple transparent approach to multi-level unsupervised feature learning which CHAPTER 1. INTRODUCTION 24 can take advantage of numerous unlabelled data, in comparison to more complex su- pervised deep learning models. An examination of different unsupervised and gener- ated feature types is presented, along with their application to a multi-level architec- ture. Furthermore, various experiments are performed in order to fine-tune the pro- posed architecture. A modified encoding technique is also proposed. In Chapter 5 the connections between layers for unsupervised multi-layer archi- tectures is explored, using the proposed method from Chapter 4. Various different channel aggregation techniques are proposed, in order to facilitate the efficient learn- ing of high-level representations based on the features from previous layers, which requires no additional parameter learning. A thorough investigation of the proposed aggregation layers under different conditions is undertaken. In Chapter 6 a CNN is combined with unsupervisedly train filters, in order to inves- tigate if improvements in accuracy can be achieved, when replacing low-level features with efficiently trained unsupervised alternatives. Various experiments are performed, including examinations of the filter size, and number, on both image and video datasets. In addition, the proposed method is further explored when labelled training data is scarce and using transfer learning. The report ends with a conclusion which summaries the main findings of this thesis followed by a thorough discussion on future work. Chapter 2

Classical Methodologies

This section reviews and discusses the conventional approaches to data classification. Whilst this report attempts to cover a wide variety of methods it should not be consid- ered an exhaustive list of techniques, only a review of what the author considers to be the most relevant avenues of past investigation. A task in machine learning is usually determined by how it processes data. A col- lection of examples is defined as a dataset. Prior to the explosion in the popularity of supervised deep learning, traditionally, the task of classification was split into two parts. Firstly, an example that the system is tasked with processing, is represented in some measurable way. For instance, for an image or video, the example input or set of features could simply be raw pixel values or some further hand-crafted or learned abstraction. Typically, an input example is described by a vector x ∈ Rn where each xi represents a separate feature. Common representations and feature learning methodologies are discussed in Section 2.1. Secondly, once an example is sufficiently described, it is the job of the classifier to determine a label of the input example. Ex- amples of commonly employed classifiers are discussed in Section 2.2. The task of classification is generally considered as supervised learning. Feature representations

25 CHAPTER 2. CLASSICAL METHODOLOGIES 26 on the other hand are not considered to be fixed to any one paradigm and many ap- proaches are actually considered learning-free. In addition, datasets that have been used during this research are described and discussed in Section 2.3. Whilst the author has endeavoured to include the most relevant information, for further information on some of the areas discussed here, please consider the following review articles [147,188]. For more information on machine learning in general, please see [14, 132].

2.1 Feature Representations and Learning

Features are extracted such that they can be used to robustly discriminate between classes, whilst maintaining the ability to generalise over intra-class variations. Meth- ods for vary greatly. In addition, there are very specific methods for action classification from videos. Representations are generally hand-crafted or learned unsupervisedly. These two main areas are discussed in this section, as well as methods which have been specifically developed for action classification. Features can be considered as global or local representations. Global features en- code the input or a region-of-interest (ROI) as a whole in a dense manner, and tend to deliver robust computationally low results [188]. Examples of simple global features are shape, contours [119] or silhouettes [188]. Yet these methods can overly rely on accurate localisation for success, with many applications assuming that an input con- tains a single example, or the segmentation of an example [119]. Although automatic detectors exist for certain objects, such as faces [179], they also suffer from occlusions, clutter [119], viewpoint and appearance variations [147], and sensitivity to noise due to poor localisation [147]. In order to partly overcome some of the problems associ- ated with global representation, grid-based approaches are often used. Whilst the input assumption still applies, it is split into different spatial and/or temporal regions, which CHAPTER 2. CLASSICAL METHODOLOGIES 27 provide invariance to local variations [38]. When using local feature representations, images are described as a collection of independent patches which are sampled at generally sparse interest points or dense locations. Local features are more robust to changes in viewpoint and appearance, background clutter, occlusion and do not necessarily require localisation [119]. The location of local features can be determined by interest point detectors which should locate regions that are composed of high local information. In addition, inter- est points should be robust to global and local variations, such that interest points can be accurately reproduced [38, 119]. Generally corner detectors and blob detectors are considered to detect interest points. Corner detectors, such as Harris [104], locate lo- cal regions which have large variations in all directions. Whereas, blob detectors such as Difference of Gaussians (DoG) [121] locate local extrema of transform responses. Other options include edge detectors or wavelet operators, such as Gabor filters [40]. Space-time interest point detectors can be generally considered adaptations of 2D inter- est point detectors to 3D [40,104,183]. In this way, videos are considered as volumes, and the interest points indicate the location in space and time, where significant local spatial and temporal variations in the videos occur. Due to the often different numbers of unordered descriptors created by interest point detectors, final representations are often made using bag-of-features (BoF) or bag-of-words (BoW) methods [34]. BoF or BoW methods describe a set of high di- mensional keypoint descriptors by a histogram of visual words [177]. In order to first construct the visual vocabulary, clustering techniques are used and the visual word for a given descriptor is assigned based on its proximity to clustering prototypes or other descriptors. However, some work has shown that dense sampling of regularly placed regions outperforms sparse interest point methods in realistic scenarios [183]. In this way, the number of descriptors is known a priori and therefore BoF or BoW methods CHAPTER 2. CLASSICAL METHODOLOGIES 28 are not necessary. Yet, it is noted that dense sampling produces a large number of fea- tures compared with other methods and therefore methods to decrease the number of descriptors, such as BoF or BoW, can still be applied [107].

2.1.1 Hand-crafted Feature Representations

Hand-crafted representations as their name implies are generally designed for the tar- get problem, and therefore can be very problem dependent, and can overly relying on domain-specific knowledge. Whilst some methods have been demonstrated to work for many different tasks, there success depends on some degree of adaptation or fine- tuning. In addition, hand-crafted methods can be quite complex and thus expert knowl- edge is sometimes required to employ them.

2.1.1.1 Histogram of Oriented Gradients (HOG)

One of the most popular grid-based methods is the histogram of oriented gradients (HOG). HOG [38] is a feature descriptor, primarily used for object detection, that counts occurrences of gradient orientations within localised regions of an image. The gradient is computed for each pixel location, as well as its corresponding magnitude and direction, using suitable filter kernels. For example, the following derivative kernel can be applied in the x and y directions of the image via convolution [38]:

  −1       Dx = −1 0 1 ,Dy =  0  (2.1)     1

Ix = I ∗ Dx,Iy = I ∗ Dy (2.2) where I is the input image, Ix and Iy are the derivative images in the x and y directions, and ∗ represents the convolution operator. The magnitude |G| and the orientation θ can CHAPTER 2. CLASSICAL METHODOLOGIES 29 be calculated using the following:

q 2 2 Ix |G| = Ix + Iy ,θ = arctan (2.3) Iy

The image is then divided into either rectangular or radial cells. Within each cell a histogram is constructed by cumulatively adding the magnitude of each pixel’s gra- dient vector into quantised bins of signed or unsigned direction. If the direction of the gradient vector is between two bins, then the magnitude is proportionally split be- tween them. Lastly, blocks are formed by sliding rectangular or radial windows and the corresponding histograms are concatenated into a one-dimensional vector. The vector is normalised to have unit length, which provides local contrast normalisation to provide better invariance to illumination [38]. The final HOG feature vector is the concatenation of all block vectors. The use of gradients or edge detectors has been shown to capture object appearance and shape well [38]. However, gradients are sensitive to variations in textures, material properties and illumination [188]. Furthermore, HOG is generally invariant to local affine transformations due to spatial pooling of the histograms [38]. Although HOG was primarily designed for object detection it has also been used for face [39], object [50] and action [183] recognition. Felzenszwalb et al. [50] used HOG to extract both global and local information for deformable parts-based object recognition. The authors captured a structural global root filter and several local part- based appearance filters to better handle deformable objects.

2.1.1.2 Gabor Filters

Gabor filters are linear filters which have been claimed to model simple cells in the visual cortex of mammalian brains [123], and have been found useful for applications such as texture classification and face recognition [92]. The filters can be tuned to CHAPTER 2. CLASSICAL METHODOLOGIES 30 various frequencies and orientations. In 2D, Gabor filters are a Gaussian modulated by a complex plane wave [92]:

2 2 f 2 − f x02+ f y02 ℜ[ψ(x,y)] = e γ2 η2 cos(2π f x0) (2.4) πγη

2 2 f 2 − f x02+ f y02 ℑ[ψ(x,y)] = e γ2 η2 sin(2π f x0) (2.5) πγη where x0 = xcosθ + ysinθ and y0 = −xsinθ + ycosθ, and f and θ define the central frequency and orientation of the filter, respectively. γ and η determine the spread or γ bandwidth of the filter in the x and y axes, respectively, and η the aspect ratio of the Gaussian. Whilst the filters have a complex form, the real or imaginary components can be used individually. The Gabor response is achieved by the convolution of Gabor filters of different orientations and frequencies with an image, which results in the extraction of useful feature responses which can be combined to form a final descriptor [184]. The standard Gabor filter bank has five frequencies and eight orientations [184], although other configurations can be used [92].

2.1.1.3 Local Binary Patterns (LBP)

Originally designed for texture description, local binary patterns [137], evaluate each pixel by considering its value with respect to its neighbours. Specifically, for a given central pixel, a comparison is made by thresholding it against each of its eight neigh- bours in a 3 × 3 neighbourhood. If the value of the central pixel is greater than its neighbour, the neighbour is denoted 0, otherwise 1. This results in an 8 digit binary readout called a local binary pattern or LBP code with 28 possible combinations. Each neighbourhood is considered in the same order to produce consistent codes which are generally converted to decimal for which a histogram can be computed. Formally, the CHAPTER 2. CLASSICAL METHODOLOGIES 31

LBP code is given as: P−1 p LBP(x,y) = ∑ 2 s(ip − i) (2.6) p=0 where (x,y) are the coordinates of the central pixel with value i, ip is the value of neighbouring pixel p and s is the sign function:

  1, if x ≥ 0 s(x) = (2.7)  0, otherwise

An extension of LBP [138] used a variable neighbourhood size and arbitrary number of neighbours to allow for variations in scale and feature type. To facilitate this, cir- cular neighbourhoods and bi-linearly interpolation of pixel values were used. Another extension is the use of uniform patterns for which patterns are considered uniform if they contain less than two bit-wise transitions which demonstrated improved perfor- mance on texture classification [138]. In addition, LBPs ability to handle variations in illumination have made it useful for a number of other applications, such as face recog- nition [3]. Ahonen et al. [3] combined histograms of local binary patterns constructed over separate regions to create a global descriptor which encodes both local textures and shape to represent face images.

2.1.1.4 Scale Invariant Feature Transform (SIFT)

SIFT [121] combines an interest point detector and descriptor, to provide local features for object detection. Firstly, it finds interest points at local extrema using Difference- of-Gaussian (DoG) function on scale-space representations (multi-scale). Local ex- trema are found by comparing each pixel with its neighbours over multiple scales to find local maxima and minima of the DoG volume function. Keypoints with low- contrast, or those close to edge response, are discarded. The remaining keypoints are CHAPTER 2. CLASSICAL METHODOLOGIES 32 then oriented by the dominate orientation and local descriptors are formed at each site. Descriptors are formed over 16 × 16 regions around each keypoint. Within each region, histograms of gradients are formed for each 4 × 4 subregion, using eight bins of orientation and linear interpolation. The final descriptor is formed by the concate- nation of each histogram within the region around the keypoint, and therefore has a dimensionality of 16 × 8 = 128. The magnitude of the gradients are weighted by a Gaussian function with the standard deviation set to half the region size. Since the descriptor incorporates multiple scales and is represented relative to its dominate ori- entation, it provides both scale and rotation invariance [121]. In addition, SIFT is also partially robust to illumination variations and local affine distortions [121]. SIFT descriptors are generally computed at sparse, scale invariant key locations which are rotated for orientation alignment. However, there are other cases such as dense SIFT [184], where the SIFT descriptor is used at dense locations which can improve efficiency and performance [49, 107]. Lazebnik [107] employ a global dense SIFT, based on spatial pyramids, for object recognition, and show improvements against orderless bag-of-features approaches. However, the dataset used contained ob- jects which are centred and occupy the majority of the image, and is therefore more suitable for global statistics. Another approach similar to SIFT is Speeded Up Robust Features (SURF) [8], which uses an Integral Image representation and employs a Determinant of the Hessain blob detector to locate interest points. Haar wavelet responses are used for orientation and keypoint description. SURF has been shown to be faster than SIFT, and can lead to superior accuracy in some applications [8].

2.1.1.5 Action Representations

This section discusses some of the approaches to representations for action classifica- tion. Although not an exhaustive review, it gives an overview of methods specifically CHAPTER 2. CLASSICAL METHODOLOGIES 33 used for action classification from video. Whilst some methods explicitly extract tem- poral information, others choose to extract frame-wise spatial information only. In this case, changes in the temporal domain are generally handled during the classification stage however, there are some methods which completely disregard temporal informa- tion, relying solely on spatial representations. Chen et al. [21] used a star skeleton which was extracted by first finding the posture contours of a background subtracted agent. However, both silhouettes and contours can be sensitive to viewpoint variations and occlusions. Efros et al. [46] used optical flow in agent-bounded boxes to recognise actions in sport footage. Optical flow is used to describe relative motion by tracking features between frames. However, the very na- ture of optical flow means that it considers all image differences are a result of motion and are not caused by other variations between frames, such as illumination [188]. Whilst optical flow does not use background subtraction, it can benefit from accurate localisation through tracking, in order to reduce noise from dynamic backgrounds and motion caused by camera movement [147]. Polano and Nelson [146] applied a grid- based approach by accumulating the optical flow in separate non-overlapping regions. Subramania and Suresh [168] did similar work by applying several different grid de- signs to a human-centred bounding box. The feature vector was derived by calculating the mean optical flow within each cell on each grid. Other methods attempt to encode both spatial and temporal information together. Bobick and Davis [15] used motion energy images (MEI) and motion history images (MHI), which incorporate both spatial and temporal information into a single static 2D representation by aggregating a sequence of successive background subtracted frames. This was extended further by Gorelick et al. [62], who concatenate multiple frames to achieve spatio-temporal volumes or so called Space-Time Volumes (STV), to capture agent pose and dynamics. CHAPTER 2. CLASSICAL METHODOLOGIES 34

2.1.2 Unsupervised Feature Learning

Unsupervised algorithms attempt to learn useful properties about the structure of a dataset without any additional information. Specifically, they attempt to either explicit or implicitly learn the probability distribution p(x). Unsupervised algorithms attempt to represent the input in a way that preserves as much information as possible, whilst at the same time being a simpler representation of the input. Since unsupervised al- gorithms do not require a label during the learning process they can take advantage of unlabelled or weakly labelled data. However, they rely on the assumption that un- supervised representations can successfully discriminate between classes, which may not be the case. A subset of unsupervised algorithm are clustering algorithms, which attempt to group similar examples.

2.1.2.1 principal Component Analysis principal component analysis [80] or PCA is an unsupervised algorithm that effectively reduces the dimensionality of an input by identifying the principal directions in which the data varies. PCA learns a representation that has a lower dimensionality than the input and whose elements have no linear correlation. It achieves this by projecting from M to N (N < M) dimensions for which PCA will define a matrix W with N vectors. PCA finds principal directions in the projected space which maximises the variance. In particular, the first principal direction or component accounts for the most variance, with each succeeding principal component accounting for the next greatest variance whilst also being orthogonal to all preceding principal components. Given a set of input examples X = [x1,...,xn] which is mean normalised by subtracting the mean from each example. Projecting X along the proposed axis W is given by Y = WT X and therefore:

T T 2 var[Y] = E[(W X − E[W X]) ] CHAPTER 2. CLASSICAL METHODOLOGIES 35

T 2 = E[(W (X − E[X])) ]

T T = W E[(X − E[X])(X − E[X]) ]W

= WT CW (2.8) where C is the sample covariance matrix defined as:

n 1 T C = ∑(xi − x¯)(xi − x¯) n i=1

= XXT (2.9)

In order to maximise the variance, W is chosen to maximise WT CW. This is achieved by using eigendecomposition which decomposes a matrix into eigenvectors and eigen- values. The principal components are given by the eigenvectors of C. The eigenvector with the highest eigenvalue corresponds to the first principal component. W can also be factorised using singular value decomposition (SVD). PCA is one of the simplest and earliest forms of .

2.1.2.2 k-means Clustering k-means clustering is a form of vector quantisation which aims to learn k centroids, whose means represent a cluster of closest examples. This results in the data being partitioned into Voroni cells or regions, whereby each cluster consists of a region which is occupied by data that are closer to their centroid than any other. k-means is a popular tool for and . T Specifically, given a set of input examples X = [x1,...,xm] , k-means aims to clus- ter examples into k(≤ m) centroids {s1,...,sk} by minimising the variance at each centroid: k 2 argmin kx − µik (2.10) s ∑ ∑ i=1 x∈si CHAPTER 2. CLASSICAL METHODOLOGIES 36

It achieves this by iterating between two steps until convergence. Firstly, each example is assigned to one of the randomly initialised centroids, whose mean has the least squared Euclidean distance. Secondly, each centroids mean is updated to reflect the new assignment of examples. The algorithm has converged when the assignments no longer change. The centroids are referred to as a prototype for their respective cluster. The simplistic and efficiency of k-means has made it attractive for feature learning and competitive results have been demonstrated on image classification with patch- based k-means trained filters [27]. Although, the correct pre-processing (PCA whitened input) and encoding are necessary [27, 117].

2.1.2.3 Self-Organising Maps

First introduced by Kohonen [98], the self-organising map (SOM) uses a competitive Hebbian learning-based approach in order to quantize an input space, whilst maintain- ing the inputs topological structure. As with k-means, the input space is represented by a reduced set of discrete prototypes.

T Specifically, let each unit i in the SOM denote a reference vector wi = [w1,w2,...,wz] ∈ z R with equal dimension as the input vector. Prototype vectors or neurons are com- monly arranged in a 2D rectangular or hexagon grid. Iteratively at each time t, the winner or best matching unit is found by minimising the distance between the input x(t) and all the neurons of the map:

bmu(t) = argminkx(t) − wik (2.11) i∈Ω and weights of the winner and its neighbours are updated according to

∆wi(t) = LR(t)η(bmu,i,t)[x(t) − wi(t)] (2.12) CHAPTER 2. CLASSICAL METHODOLOGIES 37 where Ω is the set of neuron indices, LR is the monotonically decreasing , 2 ( ,i,t) = [− krbmu−rik ] r η bmu exp 2σ(t)2 is the Gaussian neighbourhood function with i being the location of neuron i on the map and σ the effective range of the neighbourhood, which decreases with time t. The neighbourhood is important to the function of the SOM and helps avoid sensitivity to initialisation and outliers that other clustering algo- rithms, such as k-means, can suffer from [4]. However, the initialisation and topology are still important factors regarding the convergence of the map [131]. Alternative neighbourhood functions, such as the “bubble”, can be used, however, the Gaussian is the most popular [192]. In addition to monotonically decreasing LR also satisfies the following [192]:

1. 0 < LR(t) < 1 2. limt→∞ ∑LR(t) → ∞ 3. limt→∞ LR(t) → 0

Considered as a non-linear version of PCA [192], SOM is a well known data min- ing tool and has been used extensively for computer vision tasks [128, 131]. In recent years, more attention has been given to applying SOM to the task of local feature learning [4, 30]. There have been many extensions to SOM, such as: neural gas [125] (NG) which, defines the neighbourhood based on the input space instead of a pre-defined output space; growing neural gas [51] (GNG) which, given the absence of any topological constraint, allows nodes to be created and destroyed; grow when required [124] (GWR) which, enables nodes to be created and destroyed at each iteration, unlike GNG which only allows this at iterations which are a multiple or a pre-defined constant.

2.2 Classifiers

Once appropriate features have been captured, the next stage is to perform classifica- tion. The task of classification is to assign a label of k categories to an input example. CHAPTER 2. CLASSICAL METHODOLOGIES 38

n Specifically, the algorithm learns a function f : R → {1,...,k}. In some cases f out- puts a probability distribution over the classes. The following section details some of the many classification techniques employed. Classification is generally considered to be supervised, for which, additional information in the form of an associated value or label y is afforded to the task and classifiers typically learn to predict y given x, by estimating the distribution p(y|x). The label y acts as a teacher signal, instructing the algorithm what to learn. The labels can be collected automatically or by a human, although, in many cases automatic annotation can be difficult [60, 105]. For the classification of videos there are some particular models that use sequences of inputs in order to make a prediction. Unlike direct classification methods, where the representation would be required to encode temporal information, if at all, these methods place the emphasis on the classifier to capture motion information. Chen et al. [21] used hidden Markov models (HMMs) to model the probabilistic dependence between states and observations in order to recognise actions from star skeleton pos- ture sequences. Yamoto et al. [189] also used HMMs represent different tennis strokes. Mici et al. [127] used a recurrent echo state network for the task of daily action recog- nition. They integrate pose information in the form of three dimensional articulated body joint positions from RGB-D videos with ground-truth labels for the presence of manipulated objects. Some methods for classification combine multiple classifiers. Boosting frame- works, such as Adaboost, can be used to combine multiple weak learners to create a single strong classifier [179]. Another popular method for combining classifiers is the cascaded classifier [70] introduced by Viola and Jones [179] for object and face detection. This method decomposes a strong classifier into a pipeline of several classi- fiers which discard irrelevant regions in stages only outputting a non-negative detection if all stages of the cascade are passed. By filtering out less promising regions, more focus can be given to regions which contain more relevant features. CHAPTER 2. CLASSICAL METHODOLOGIES 39

Figure 2.1: Logistic sigmoid function

2.2.1 Logistic Regression

Logistic regression generalises for the purposes of classification by using a binomial probability distribution. Unlike linear regression, which uses a linear function, logistic regression uses a logistic sigmoid function φ (Fig. 2.1) to model binary values, as opposed to numeric ones. The logistic function effectively squashes the output of a linear function so that the values lie in the interval (0,1). These values can then be interpreted as a probability (with the probability of one class determining the probability of the other):

p(y = 1|x;θ) = φ(θT x) (2.13) where y is the output, x is the input and θ represents the parameters. Unlike linear CHAPTER 2. CLASSICAL METHODOLOGIES 40 regression, logistic regression has no closed-form solution and therefore a solution can be found my maximising the log-likelihood. The maximum likelihood estimator of θ is defined as: m θML = argmax∏ pmodel(xi;θ) (2.14) θ i=1 where m is a set of input examples and pmodel(x;θ) are a set of probability distributions. For convenience, the product can be converted into a sum by taking logarithms:

m θML = argmax ∑ log pmodel(xi;θ) (2.15) θ i=1

Multinomial logistic regression, also known as softmax, is an extension of logisitic regression to multi-class problems (when there are more than two states of dependant variable), which produces a probability distribution over the classes. Softmax is calcu- lated using equation 2.16.

T exp(θ j x) p(y = j|x;θ) = softmax(θT x) = (2.16) j K T Σk=1 exp(θk x) where y is the output, x is the input, θ is the parameters for the output neuron that represents the jth class and K is the number of classes.

2.2.2 Support Vector Machines

Support vector machines (SVMs) [16, 31] are one of the most popular classification tools in machine learning [27, 70, 105, 177, 183]. The goal of SVM is to maximise the margin between classes in a feature space. It achieves this by selecting a hyperplane which maximises the distance to the closest training examples. These points are called the support vectors. This separating hyperplane then categorises new examples. This model is similar to logistic regression in that it uses a linear function, however, it is non-probabilistic and therefore does not output probabilities, but provides a class CHAPTER 2. CLASSICAL METHODOLOGIES 41 identity instead [60]. Consider data that has two classes and is linear separable in two dimensions. Two parallel hyperplanes are selected which separate the classes, whilst maximising the distance between themselves. The region bounded by these two hyperplanes is termed the “margin” and the hyperplane that resides halfway between them is the maximum- margin hyperplane. Given this maximum margin hyperplane, a positive output predicts the positive class and conversely a negative output the negative class. The introduction of slack variables which control the violation of the maximum margin hyperplane by the training data allows for a soft-margin classifier. The SVM can be converted for multiclass problems by reducing them to multiple binary classifiers (one-vs-all [183] or one-vs-one). SVMs can also be applied to non-linear problems by the application of a kernel [70, 81, 105, 177], which maps inputs to a high-dimensional feature space in which classes become linearly separable. In fact, the application of kernel SVM is equivalent to applying a kernel to the input and then using a linear model in the transformed space. It can be shown that the linear function can be rewritten as [60]:

m m T T w x + b = b + ∑ αix xi = b + ∑ αiK(x,xi) (2.17) i=1 i=1 where xi is a training example, α is a vector of coefficients and K(x,xi) is the kernel function. Since the kernel function is fixed, only α needs optimising, and therefore, whilst the kernel may transform the input to a non-linear space, the decision function remains linear in the transformed space [60].

SVMs learn an α vector which is sparse, and therefore only non-zero terms are evaluated at validation time [60]. Yet, non-linear support vector machines can be ex- pensive to train when the training set is large [81]. However, given a sufficiently large number of features, linear methods can work just as well, and also limits the number CHAPTER 2. CLASSICAL METHODOLOGIES 42 of parameters to optimise [81]. Training of SVMs is usually performed by quadratic programming [16]. All experiments using SVM in this thesis were conducted using the LIBLINEAR toolbox [48].

2.2.3 k-nearest Neighbours and Decision Trees

Non-parametric nearest neighbour approaches, such as k-nearest neighbour, are simple tools for classification. A representation is classified based on the majority voting of its k nearest neighbours in the feature space, formed by a training set. Therefore, there is not any learning process and only the choice of k and neighbourhood definition must be considered. By varying k, the model can be made more (small k) or less flexible (large k) [43]. Many different distance metrics can be employed which affect the re- sults of the classification. Gorelick et al. [62] used a 1-NN classifier combined with a

Euclidean distance metric. Dollar´ et al. [40] used χ2 distance. Often, inverse distance weightings are applied to the neighbours so that the contribution from neighbours of greater proximity is higher than those that are further away. However, the accuracy is dependent on a large training set, which results in a higher computational cost [147]. Thus, sometimes nearest neighbour algorithms are used in conjunction with clustering or dimensionality reduction methods [146]. Due to the simplicity of this algorithm, k-nearest neighbours can provide uncomplicated explanations of results [43]. Decision trees [114] also separate the input space into specific class regions by using a tree structure, where each node effectively represents a conditional statement. Each node represents a different region of the input, with each child node representing further subregions. The end of a branch or edge is called a leaf node, where each leaf represents non-overlapping regions of the input. Leaf nodes are usually mapped to specific output classes. Greedy algorithms are often used to split the input space using the nodes at various points, until the best split is found [43]. For classification a cost CHAPTER 2. CLASSICAL METHODOLOGIES 43 function, usually information gain, measures how well each node splits the input data into its respective classes; maximising the decrease in entropy [43]. The algorithm can be constrained by choosing stopping criteria, such setting a maximum tree depth or a minimum number of training examples per node. Since decision boundaries are axis- aligned, even simple linear decision functions along an arbitrary direction would need to be estimated with multiple axis-aligned splits which step back and forth across the true decision function [60]. Whilst conditional decisions can be adapted to encompass linear combinations of features, this can still result in complex trees when combined to solve non-trivial cases. However, decision trees can be simply expressed as a set of rules, enabling transparency of results [43]. Whilst nearest neighbour and decision trees have their limitations, they are ex- tremely efficient and generally easy to understand.

2.3 Datasets

Publicly available datasets provide ease of evaluating and comparing differing ap- proaches. Datasets consists of domain-specific examples, which can be used for train- ing and testing algorithms. Approaches are generally compared quantitatively, based on their respective recognition rates. However, sometimes it can be difficult to make true comparisons when different evaluating methodologies are used, or in fact when the procedure is unclear. Thus, it can be difficult to draw strong conclusions and as such they can be somewhat misleading. To negate this, authors sometimes recreate the work of others in order to provide a more reliable comparison. Yet, this can be time consuming and not without criticism, since the works that are chosen for comparison can be limited in number and scope. Moreover, due to the inherent nature of datasets, whilst examples can be of real-world situations, they are prepared in such as way that cannot always be assumed when applying algorithms to practical scenarios. CHAPTER 2. CLASSICAL METHODOLOGIES 44

Whilst there are problems with datasets, they currently provide the best mechanism for studying different methodologies, and, as the field of deep learning has grown in recent years, there has been an increase in the scale and complexity of those available. In Sections 2.3.1 and 2.3.2, respectively, the image and video datasets that are used during this research are discussed. There are a large number of image datasets indi- cating the maturity of the research area. Whilst some are very limited in variation and captured under controlled conditions, such as MNIST (Section 2.3.1.1), others, such as ImageNet (Section 2.3.1.4), contain over 14 million examples. In contrast, video clas- sification is still quite a new area and as such, up until recently, many of the datasets were captured in controlled environments, such as Weizmann (Section 2.3.2.1), and were limited in size. However, in the last few years, realistic larger datasets, which contain more natural sequences from YouTube and other sources, have been made available, such as UCF-50 (Section 2.3.2.3) and UCF-101 (Section 2.3.2.4). How- ever, these datasets are still modest in size compared to ImageNet. More recently, Sports-1M (Section 2.3.2.5) was released, which contains over 1 million automatically labelled videos.

2.3.1 Image Datasets

2.3.1.1 MNIST

Formed from the larger NIST dataset, the MNIST (MNIST stands for modified-NIST) image dataset [112] is a collection of 70000 grayscale 28×28 handwritten digit images 0 − 9 (Fig. 2.2). The dataset is split into 60000 training images and 10000 test images. There is considerable variation in the shape of the digits, however, all have been size- normalised and centred in the images. CHAPTER 2. CLASSICAL METHODOLOGIES 45

Figure 2.2: A selection of examples from the MNIST dataset CHAPTER 2. CLASSICAL METHODOLOGIES 46

2.3.1.2 CIFAR-10

The CIFAR-10 image dataset [100] is a labelled collection of 60000 32×32 colour im- ages taken from the 80 million tiny images dataset. The dataset is formed of ten classes: airplane; automobile; bird; cat; deer; dog; frog; horse; ship; and truck. Each class con- tains 6000 images. The dataset is divided into 50000 training images and 10000 test images. The training set consists of 5 batches of 10000 each. Whilst the training and test sets contain an equal number of examples of each class, the five batches that make up the training set may contain more examples of one class than another. Whilst each image contains one notable instance of its respective class object, there are variations in viewpoint and the scale present.

2.3.1.3 CIFAR-100

The CIFAR-100 image dataset [100] is similar to the CIFAR-10 datasets (Section 2.3.1.2) and features the same number of total images, except it has 100 classes containing only 600 images each. Each class features 500 training images and 100 testing images. The 100 classes are further grouped into 20 coarse superclasses: aquatic animals; fish; flowers; food containers; fruit and vegetables; household electrical devices; household furniture; insects; large carnivores; large man-made outdoor things; large natural outdoor scenes; large omnivores and herbivores; medium-sized mammals; non-insect invertebrates; people; reptiles; small animals; trees; vehicles 1; and vehicles 2.

2.3.1.4 ImageNet

ImageNet [153] is currently the largest image dataset. It contains over 14 million labelled images. Each year the ImageNet project hosts a object detection and im- age classification challenge. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) uses a subset of the main dataset to evaluate different algorithms. CHAPTER 2. CLASSICAL METHODOLOGIES 47

Figure 2.3: Frames from example videos from each class of the Weizmann dataset.

2.3.2 Video Datasets

2.3.2.1 Weizmann

The Weizmann human action dataset [62] consists of nine agents performing ten ac- tions (bend, jumping jack, jump forward, jump in place, run, gallop sideways, skip, walk, wave one-handed, wave two-handed) against a static background (Fig. 2.3). As well as providing the 90 original 180 × 144 video sequences, the authors also provide aligned and non-aligned extracted masks, which are obtained through background sub- traction. All sequences differ in duration and the viewpoint is constant, however, there is variation in action performance and agent appearance. In addition, there are two robustness datasets, which are included for further evaluation.

2.3.2.2 UCF Sports

The UCF Sports Action dataset [150,164] consists of 150 720 × 480 action sequences of various sports, collected from stock footage of broadcast television. The dataset contains ten actions (diving, golf swinging, kicking, weight lifting, riding horse, run- ning, skateboarding, swinging-bench, swinging-side, walking) performed against natu- ral, task-specific backgrounds (Fig. 2.4). The dataset contains variations in action per- formance, agent appearance, camera movement, viewpoint and illumination. Bounded box annotations of the agents are provided. CHAPTER 2. CLASSICAL METHODOLOGIES 48

Figure 2.4: Frames from example videos from each class of the UCF Sports dataset.

2.3.2.3 UCF-50

UCF-50 is an action recognition dataset [149] featuring 50 categories of realistic videos taken from YouTube. The dataset is very challenging to accurately categorise, due to large variations in viewpoint, appearance and pose, scale, illumination, camera motion and background. For each category, videos are further grouped into 25 groups, each consisting of more than four action clips which share common features.

2.3.2.4 UCF-101

UCF-101 [165] is an extension of the UCF-50 (Section 2.3.2.3) action recognition dataset, incorporating a further 51 categories. As with UCF-50 this is a very challeng- ing dataset, featuring much variation. The videos are further grouped into 25 groups, each consisting of more than four clips which share common features.

2.3.2.5 Sports-1M

The Sports-1M dataset [94] is one of the largest video datasets, containing over one million video URLs which have been automatically labelled into 487 sports categories using the YouTube Topics API. The dataset is very challenging due to the large number of classes and the wide variety of examples. In addition, due to use of automated la- belling, the labels are considered weak as opposed to strong labels present in manually CHAPTER 2. CLASSICAL METHODOLOGIES 49

Figure 2.5: Frames from example videos from each class of the UCF-50 dataset. CHAPTER 2. CLASSICAL METHODOLOGIES 50

Figure 2.6: Frames from example videos from each class of the UCF-101 dataset. CHAPTER 2. CLASSICAL METHODOLOGIES 51 annotated datasets. Chapter 3

Deep Neural Networks

3.1 Introduction

Traditionally, in machine learning, features are extracted using often complex hand- crafted techniques which require a certain amount of prior problem-dependent knowl- edge. Instead of making assumptions about the problem, a better approach would be to learn the task-specific salient features in a more intuitive way, akin to biological neural networks. The algorithms underpinning deep learning have been around for a long time, how- ever at the time, it was deemed too difficult to train deep architectures. The linear perceptron [151] from the 1950’s, based on the McCulloch-Pitt neuron model [126], was the first model that could learn a decision function, given examples from two cate- gories. Back-propagation [152] allowed for multilayer models, such as the feedforward multi-layer perceptron (section 3.1.1), to jointly learn both features and classifier. By placing hidden layers between the input and the output layer, the classifier could op- erate on features found in the hidden layers instead of the raw input. In addition, by training both the features and classifier, the classifer could aid in the definition of the

52 CHAPTER 3. DEEP NEURAL NETWORKS 53

filters; leading to more discriminative features. The introduction of non-linear activa- tion functions enabled these multilayer networks to learn more complicated decision functions. Whilst there are other types of algorithms for training deep models, such as contrastive divergence [76], backpropogation is currently the dominant approach given its relative ease of training. However, despite these advances, progress in neural networks and deep learning was held up principally by two main reasons: there was not enough data to train them or enough advanced hardware to run them. Neural networks fell out of vogue un- til a paper by Hinton in 2006 [76], which demonstrated the importance of depth by training a network called a deep belief network consisting of stacked restricted Boltz- mann machines, encouraged renewed interest in this field. The authors performed greedy layer-wise unsupervised pre-training of the network before performing fine- tuning with backpropagation. This pre-training was key to solving the vanishing gra- dient problem which prevented deep networks from training, as the error signal could not propagate far enough. With access to larger datasets and more powerful dedicated hardware, research in deep learning has exploded in recent years. The breakthrough in large scale image classification can be attributed to Krizhevshy et al. with the convo- lutional AlexNet [101] architecture, which achieved the best performance in the Ima- geNet Large Scale Visual Recognition Competition (ILSVRC) 2012. The authors used a new form of neuron model called the rectified linear unit, which negates the vanish- ing gradient problem without the use of pre-training. This present day model [58] is a simplified version of a model inspired by brain function [52], which replaced more traditional probabilistic models such as the logistic function (Fig. 3.3). Convolutional networks can be traced back to the work of Fukushima in 1980s. His Neognitron [53] was inspired by the mammalian visual system and featured multiple CHAPTER 3. DEEP NEURAL NETWORKS 54 convolutional and pooling layers trained using a reinforcement scheme for the process- ing of images. By taking advantage of natural local correlations present, convolution- based architectures are well suited to the classification of images. The first proper in- carnation of the convolutional neural network was made by LeCun in 1989 in which he combined a multilayer convolutional network with backpropagation to classify hand- written digits. More recently CNNs have been extending to three dimensions and are being applied to the problem of action classification from videos. Like images, videos have high local correlations which make the problems suitable for convolutional-based methods. This chapter gives an overview of the field of deep learning and pays particular attention to deep convolution neural networks. Section 3.2 will summarise some of the current deep learning based approaches, Section 3.3 will discuss CNNs in detail. In Sections 3.4 and 3.5 the results of preliminary experiments on images and videos are discussed.

3.1.1 Feedforward Networks

The feedforward neural network or multilayer perceptron (MLP) are universal function approximators [36] composed of three or more layers of neurons (Fig. 3.2). Each network features at least an input layer L1, a hidden layer L2 and an output layer L3; the input of which is typically a set of features. Deep structures are possible with additional hidden layers. MLPs are fully connected networks and therefore each neuron in a given layer connects to every neuron in the following layer. Each neuron is based on Rosenblatt’s perceptron [151] (Fig. 3.1) which, given an input x, outputs:

n T  h(θ;x) = φ(θ x) = φ ∑ wixi + b (3.1) i=1 CHAPTER 3. DEEP NEURAL NETWORKS 55

Figure 3.1: Perceptron model (adapted from Rosenblatt 1958 [151]) where φ is an activation function, w is the weight vector, b is the bias, and n is the number of input neurons. Let a(l) be the activations of layer l, then the activation for layer l + 1, with a(l) providing the input, are given as:

z(l+1) = (W(l+1))T a(l) + b(l+1) (3.2)

a(l+1) = φ(z(l+1)) (3.3)

Whilst a perceptron performs binary classification using a Heaviside step function an MLP can perform either regression or classification depending on its activation function. Typically, for classification, a non-linear sigmoidly function is used in the form of a logistic (equation 3.4) or hyperbolic tangent (tanh) (equation 3.5) function CHAPTER 3. DEEP NEURAL NETWORKS 56

Figure 3.2: Multilayer perceptron with a single hidden layer

(Fig. 3.3). When used for classification, the MLP typically learns a non-linear function to map the input to the desired output class. Unlike which famously could not learn the XOR function [129], MLPs, given their non-linear activation functions, can differentiate data that is not linearly separable.

1 φ(z) = logistic(z) = (3.4) 1 + e−z

e2z − 1 φ(z) = tanh(z) = (3.5) e2z + 1

Training MLPs is typically performed with stochastic batch gradient descent via su- pervised backpropagation (Section 3.3.3). MLPs require no assumptions regarding the underlying distribution of the input data a priori, and can be trained directly from CHAPTER 3. DEEP NEURAL NETWORKS 57

Figure 3.3: Non-linear sigmoidly functions the data. Yet, MLPs can suffer from overfitting due to full connections between lay- ers [112].

3.2 Related Work

Deep architectures simply consist of multiple layers of feature detectors. Layers learn features which represent different levels of abstraction of the input data: lower layers learn simple features whereas higher layers are responsible for more complex ones. This style of neural network first gained attention when Hinton et al. [76] proposed an efficient algorithm for unsupervised training of deep belief networks (DBNs) by greed- ily training each layer based on the previous layers activations. DBNs are formed by stacking restricted Boltzmann machines (RBMs), which are generative models that encode statistical dependencies between two groups of units, which form a bipartite graph. In the case of DBNs, each layer or group of units would be trained to capture the main variants of the its input (the output from the previous layer). Once trained, a CHAPTER 3. DEEP NEURAL NETWORKS 58

DBN can undergo further gradient-based training. For example, Hinton and Salakhut- dinov [77] used this technique to initialise a deep , which was then further optimised using supervised criteria. The rational being that unsupervised pre-training initialised the parameters close to a region of local minima and therefore assisted opti- misation [12]. The initialisation of deep networks using unsupervised techniques was also explored in several other works [12, 47, 148]. Interest in deep learning convolutional neural networks accelerated with the intro- duction of AlexNet by Krizhevsky et al. [101]. AlexNet achieved the winning top-5 error rate on the image classification task at ILSVRC 2012. The overall structure of the proposed architecture was similar to that of LeNet-5 introduced by LeCun et al. [112], however, it featured more layers and demonstrated major advancements to training by employing non-saturating ReLU non-linearities to neurons. AlexNet also employed the recently developed DropOut to aid regularisation of the network [101]. ZFNet [196] by Zeiler and Fergus used knowledge gained from a deconvolutional net- work in order to improve the architecture of AlexNet. The visualisation of features through deconvolution enabled certain hyper parameters to be investigated in greater detail, which in turn led to improvements in design, such as reduced stride to reduce aliasing effects and smaller filters to prevent extremes in filter frequency. Sermanet et al. introduced an integrated framework for classification, localisation and detection named, Overfeat [157]. A sliding window multi-scale approach was adopted for this framework, which incorporated an additional regressor for predicting bounding-box locations for localisation and detection. GoogLeNet [170] proposed the multi-scale inception module which features parallel convolution, using different receptive field sizes, and pooling. They also incorporated techniques borrowed from Network in Net- work (NIN) [116], such as 1 × 1 convolutions and global average pooling in place of fully connected layers, to reduce dimensions and enable deeper and wider architec- tures without increased computational overhead. Whereas AlexNet and ZFNet used CHAPTER 3. DEEP NEURAL NETWORKS 59 relatively large receptive fields in the first convolution layer (11 × 11 and 7 × 7 re- spectively), VGGNet [162] proposed using 3 × 3 convolutions throughout the whole network. Through stacking convolutional layers without subsampling, they achieved larger effective receptive fields whilst introducing additional non-linearities to increase the discrimination of the decision function. The use of small filters also reduced pa- rameters allowing for deeper networks (up to 19 weight layers), similar to GoogLeNet. ResNet introduced skip connections to allow very deep networks of up to 152 lay- ers [73]. By using skip connections very deep networks do not suffer from vanishing or exploding gradients, which can present optimisation problems with more traditional plain networks and, as such, do not display performance improvements beyond a cer- tain depth. By incorporating shortcut connections the network effectively eases the learning process by approximating residual functions. DenseNet [82] advanced this idea further by connecting each layer to every other layer. Specifically, the features maps in a given layer are used as inputs to all downstream layers. Feature maps are aggregated using depth-concatenation. By providing dense connections between all layers, features become more diversified and both low and high level features affect the final decision boundary. Regarding video classification, since CNNs have been primarily used for 2D image classification, some techniques have simply used 2D representations of actions as in- puts. Valle and Starostenko [176] used this approach to classify between walking and running from images, which were extracted from consecutive frames using background subtracting. Simonyan and Zisserman [161] simply average multiple frame-level pre- dictions to get video-level predictions. Others attempt to aggregate features in more sophisticated manner, utilising more of the information source. Cheron´ et al. realise a final video descriptor by aggregating deep representations over time [22]. They inte- grate the static frame descriptors by combining features from the last fully connected layer of a CNN, using four different min and max schemes. CHAPTER 3. DEEP NEURAL NETWORKS 60

For most of the state-of-the-art approaches to action recognition the embedding of temporal information is key to their success. Indeed, when considered individ- ually, temporal representations typically outperform their spatial based counterparts [22,161]. Many works also show the complimentary nature of spatial and temporal in- formation [22, 56, 161]. However, the best way of incorporating temporal information is still an open question. To create more dedicated video models, some methods have added a third tem- poral dimension to create 3D CNNs in order to implicitly capture the evolution of actions over multiple frames of input. 3D CNNs have been shown to be better suited to modelling temporal information, due to spatio-temporal convolution and pooling operators [173]. In [173], the temporal depth of 3D filters was extensively evaluated and compared to 2D filters. The authors found that 3D filters perform significantly better than 2D convolutions for video classification. Baccouche et al. [6] used inputs consisting of nine consecutive frames, which were extracted using a agent-centred bounded box. Apart from down-sampling the videos by a factor of two horizontally and vertically, the only pre-processing undertaken was to perform 3D local contrast normalisation on the extracted images. Many types of architectures that implicitly capture spatio-temporal features were investigated by Karpathy et al. [94] and com- pared against a single frame-based baseline. A model that employed slow fusion, where each ten frame clip is separated and processed individually and slowly fused together (which is the only network that uses 3D convolutions), proved to provide the best level of performance. Although, interestingly, the difference between the slow fusion model and the single frame model was minimal; potentially suggesting that 3D CNNs do not implicitly capture key motion information. However, [173] have more recently reported impressive results. Their network is first trained on Sports-1M and then used as a generic video feature extractor on a much smaller dataset. Activations from the first fully connected layer of multiple clips are averaged and then normalised CHAPTER 3. DEEP NEURAL NETWORKS 61 to form a final descriptor, which is trained using a multi-class SVM. Yet a deeper anal- ysis of their results shows their model only improves upon the multi-stream approach of [161] when combined with state-of-the-art hand-crafted features. Some approaches use multi-modal architectures for which the predicted output is a combination of each individual models prediction [89, 94]. Both Ji et al. [89] and Karpathy et al. [94] demonstrated improved results with this approach, indicating that separate models can encode complementary information about the data. Karpathy et al. [94] employed separate streams to represent contextual and fovea inputs. The context stream learnt features from a frame, which was a down-sampled version of the original, whilst the fovea stream targeted the centre patch of the original resolution frame. Both frames were the same resolution but less than the original frame, thus reducing the resolution of the input. Both streams were processed by identical networks. Whilst this showed an improvement in performance, this approach was more concerned with mitigating the complexity of the model, for which the authors demonstrated a 2–4 times increase in runtime performance. However, this method overly relies on the object of interest being present in the centre of the frame. In comparison to feed-forward networks for which each prediction is a new, Re- current Neural Networks (RNNs) exhibit memory which allow them to exploit the temporal evolution of data present in videos. RNNs contain feedback which allows a neuron in a hidden layer to receive both activation from another neuron in a lower layer and its own activation from the previous step. This internal state captures the dynam- ics of the input over time. RNNs have been successfully applied to many sequential problems [64, 65]. One of the most popular models of RNNs is the Long Short-Term Memory (LSTM) developed by Hochreiter and Schmidhuber [79] in which neurons are replaced with memory cells. The memory cells are able to remember inputs for longer than the activation of a simple artificial neuron, and thus it has been shown that LSTMs are more practical than traditional RNNs, markedly when they feature CHAPTER 3. DEEP NEURAL NETWORKS 62 multiple layers per time step [110]. Ng et al. [194] aggregate deep appearance and motion representations over time to form a global video descriptor for video classifi- cation. They utilise a multi-stream architecture and explicitly model the video as an ordered sequence of frames using LSTMs. However, the introduction of the LSTMs only slightly increases on the performance of the multi-stream method of [161] and the authors own baseline. Wang et al. [186] integrate LSTMs with implicitly learned spatiotemporal features from 3D CNNs, which demonstrated significant performance improvement against a 3D baseline. Liu et al. [120] propose a LSTM based model with attention mechanism, which can focus in on the most informative joints from a skeletal sequence. Whilst there was previously much attention on unsupervised pre-training tech- niques and their ability to generalise well from small datasets, these methods have generally been replaced by a process called transfer learning [142,172]. Transfer learn- ing leverages the knowledge learnt from large datasets and transfers this to a new task for which there may be limited data. The pre-trained features can be used as an ini- tialisation and fine-tuned for the new tasks [161] or the network can simply be used as a feature extractor for which only a new classifier will need to be trained [41, 158]. Transfer learning has demonstrated state-of-the-art performances through its ability to assist in the generalisation of image representations especially in the lower layers of CNNs [193]. Low-level features have been observed to be similar across tasks with more specialist features occurring in higher layers. This shows that generic low-level features can be easily transferred and the rest of the network may only require little ad- justment. In the spirit of this, some approaches only fine-tune the higher layers, leaving the lower layers fixed [94, 193]. However, the transferring of knowledge relies on the initial availability of large labelled datasets from which to learn transferable features, as well as the computation and time resources to train them. Whilst some approaches have transferred knowledge from similar tasks [173], others use networks pre-trained CHAPTER 3. DEEP NEURAL NETWORKS 63 on very large datasets, such as ImageNet, which has proven to provide features that can generalise well to many other domains [161, 193]. The resemblance of low level features to Gabor filters across different domains has lead to much work on hybrid deep networks, which combine learned and hand-crafted features that make use of prior feature knowledge. Gabor filters have been widely used in filter design due to their useful decomposition properties. Yao et al. [191] used three different orientations of Gabor filters to produce a single descriptor, which is used as an input to a traditional CNN. Luan et al. [122] enhanced the learned convolutional filters through modulation with Gabor filters. Furthermore, Sarwar et al. [156] replaced certain trainable filters with fixed weight Gabor filters in convolutional layers, which increased training efficiency, whilst sacrificing a tolerable decrease in accuracy. They found that a network consisting of two layers, where the entire first layer featured Gabor filters and the second layer featured a mixture of Gabor and trainable features, produced the best results in terms of accuracy and efficiency. Other hand-crafted features have also been combined with CNN. Kim et al. [96] uses background subtraction to first segment actions from a small sub-sequence of frames. The extracted images are then used to create view-invariant outer-boundary STVs, which are used as the input to a 3D CNN. Ji et al. [89] used hand-wired kernels to generate multiple channels of information from each video input. Channels, which represented the grey-scale values, gradients along the horizontal and vertical axes, and optical flow information along the horizontal and vertical axes were used. Head-based agent detection was used to extract bounding boxes from the frames, which were used as input to the network. Whilst only a small number of stacked frames were input into the 3D CNN, auxiliary outputs composed of features which encoded the action information over a higher number of frames were used to regularise the network, by encouraging the 3D CNN to minimise the difference between the extracted features and the auxiliary outputs. The auxiliary outputs were formed using BOF and SIFT CHAPTER 3. DEEP NEURAL NETWORKS 64 descriptors. Regularising the network demonstrated an improved performance over the un-regularised version [89]. Optical flow is often combined with deep networks for action recognition. How- ever, it has also been shown that optical flow is only helpful on high quality datasets [194]. Datasets which feature videos in the wild create problems for optical flow, due to excessive noise caused by dynamic backgrounds and camera motion. Inspired by what-where pathways, Simonyan and Zisserman [161] introduced separate 2D CNNs for spatial and temporal information. Current understanding of biological movement recognition is based on the two-streams hypothesis [59], which states that the brain uses both spatial and temporal information from the ventral and dorsal pathways. The ventral pathway (or what pathway) is responsive to shape, colour, and texture, whilst the dorsal pathway (or where pathway) is responsive to spatial transformations and movement. The aim of the network was to accumulate complementary information from both individual frames and the evolution of motion between frames. The spa- tial stream was trained on individual frame’s RGB values, whilst the temporal stream was trained on multi-frame dense optical flow to explicitly represent motion. The two streams were then both combined using late fusion. Whilst both streams showed competitive performance individually, by combining both, accuracy improved signifi- cantly (6% over the temporal and 14% over the spatial). Cheron´ et al. [22] also used a mutli-stream CNN to realise deep appearance and motion based features. Their results showed that an optical flow-based descriptor continually outperforms an appearance- based one, and demonstrates that a combination of appearance and optical-flow based features improves performance over individual cases. Gkioxari and Malik [56] used a similar approach for action localisation and classification. They also demonstrated that appearance and motion are complimentary sources of information, and using both leads to significant improvement in performance. Whilst deep-learners are undoubtedly responsible for much progress in machine CHAPTER 3. DEEP NEURAL NETWORKS 65 learning and computer vision in recent years, there are concerns regarding their inner workings. The inherent complexity of multiple non-linear layers makes them far from transparent [155], leading to their usage as a ‘black box’. This has therefore fostered new interest in the features they learn, and how this can be achieved using alterna- tive, more widely understood methods. Wavelet scattering networks (ScatNet) [18] used wavelet operators as data-independent filters. In keeping with CNNs, convolu- tion, non-linear and averaging operators were used in a cascaded architecture, which demonstrated state-of-the-art performance in handwritten digit and texture classifica- tion. PCANet [20] and DCTNet [134] used principal component analysis and discrete cosine transform derived filters respectively to process input images via convolution. The resulting activations were binarised and further processed using block-wise his- tograms, before classification by a linear SVM. Despite the simple architecture of these approaches, they achieved comparable performance to other more complex methodolo- gies in handwritten digit and object classification.

3.3 Convolutional Neural Networks

Neural networks, like the feedforward network, automatically learn features and, as such, it is possible that no pre-processing steps will be needed and the network can thus be fed directly with raw input images. However, there are certain limitations with the standard fully connected feedforward model. To begin with, for high dimensional inputs (the curse of dimensionality), the learnable parameters would be extremely large and therefore the learning would be computationally expensive and would require a larger dataset due to the increased capacity [112], in order to prevent overfitting. In addition, feedforward networks do not acknowledge the topology of the input [112]. In certain circumstances this is not an issue, however, for inputs such as images and video which have a definite topological structure, it is very beneficial. Furthermore, CHAPTER 3. DEEP NEURAL NETWORKS 66 traditional feedforward networks have limited ability when it comes to providing in- variance to scale, translation or distortion of the inputs [112]. Although, in theory this is possible, in practice it would involve similarly weighted neurons in multiple spatial locations in order to detect the same features appearing in different locations, which again is impractical. By implementing shared weights, local connectivity and subsampling, CNNs address the aforementioned issues.

• Local connectivity: Each neuron in the network is only connected to a local re- ceptive field on the previous layer. Therefore the connectivity between adjacent layers is spatially restricted. The extent to which they are restricted is called the receptive field of the neuron which is equivalent to the filter size. These local filters make the features invariant to translations and minor deformations and reduce the number of parameters.

• Shared weights: Multiple distinct filters are mapped to different output feature maps, which thus represent the feature activations for a particular feature over the entire input; if a feature is useful in one region of an image it is likely to useful in another [111]. This form of distributed representations [74], where each input is represented by multiple features and each feature should be responsible for many inputs, allows the network to have greater expressive ability whilst simultaneously reducing parameter overhead.

• Spatial subsampling: The input is aggregated over overlapping or non-overlapping regions which provides better invariance to scale, translations and deformations in addition to further reducing the number of parameters.

When these attributes are combined, in a deep architecture, rich global representation of multiple layers of abstraction are formed, which are invariant to multiple transfor- mations. In addition, this is achieved with a reduced number of parameters without affecting the networks computation power. CHAPTER 3. DEEP NEURAL NETWORKS 67

CNNs take their inspiration from the visual cortex. In particular, from experiments on cats and monkeys performed by Hubel and Wiesel [84, 85], who showed that the striate cortex (V1) of a monkey contain two types of cells; simple and complex. In addition, they found that the system is hierarchical with geniculate cells converging on simple cells, simple cells converging on complex cells and complex cells converging on hypercomplex cells. In simple cells the receptive field is split into spatially dis- tinct sub regions separated by parallel straight lines. One of these inhibits the response whilst the other excites the response. Therefore, when a stimulus of the right size and shape is orientated correctly (above the excitatory sub-region), the response is optimal. Simple cells respond maximally to a slit, dark bar or edge within their receptive field. In contrast to simple cells, complex cells have no separation in their receptive field. Therefore, whilst they still respond maximally to stimuluses of a certain orientation, they do this over the entirety of their receptive field causing a certain spatial invariance. Fig. 3.4 and Fig. 3.5 show the response of simple and complex cells to various stimu- lus geometries, respectively. The line at the bottom of both figures indicates when the stimulus (white bar) is turned on and off. As can be seen in Fig. 3.4, when a stimulus of optimum size and shape is orientated above the excitatory region (first stimulus ge- ometry) the response is optimal, whereas, in Fig. 3.5 the optimal response is achieved when a stimulus of the correct orientation is placed anywhere above the receptive field (first three stimulus geometries).

3.3.1 Layers and Architectures

3.3.1.1 Convolution

Equivalent to simple cells, convolution layers extract local features by a convolution of an input with a learned filter kernel. The input can be the original input images or the previous layers feature maps. Each output feature map can incorporate convolutions CHAPTER 3. DEEP NEURAL NETWORKS 68

Figure 3.4: Simple cells (adapted from Hubel 1995 [83])

Figure 3.5: Complex cells (adapted from Hubel 1995 [83]) CHAPTER 3. DEEP NEURAL NETWORKS 69 with multiple input feature maps. The number of output feature maps produced is chosen arbitrarily although generally the number increases for each layer in the overall architecture. The general form is defined in equation 3.6.

n l  l l−1 l  a j = φ b j + ∑ ai ∗ ki j (3.6) i=1

l th l−1 th where a j represents the j output feature map in layer l, ai is the i output feature th l map from layer l −1 and also the i input feature map to layer l, ki j is the filter kernel connecting the respective input and output feature maps, bl is the bias for the jth output feature map, n is the number of feature maps from layer l − 1, and ∗ represents the convolution operator. The non-linear activation function φ was traditionally sigmoidal; either a logsitic function (equation 3.4) or hyperbolic tanh (equation 3.5) however, the rectifier linear unit (ReLU) (equation 3.7) and its variants are now standard (Fig. 3.6). Traditional non-linear sigmoidal activation functions are bounded by maximum and minimum values. Therefore, a neuron using a logistic activation function, whilst good for representing probability, will be in saturation when nearing a value of zero or one. Once saturated, the back-propagation error signal is diminished since it is proportional to the gradient of the current layers activation function, which approximates to zero at saturation. Therefore, as the size of the network increases, the back-propagated error signal approaches zero making it harder to train the lower layers. The ReLU is a piece-wise linear function that can represent any non-negative real number and is therefore only bound by its minimum value. Gradients for the ReLU can only take two values; 0 if z < 0 and 1 if z > 0. Therefore the gradient will not vanish as z is increased. Furthermore, since they have a real zero activation value the ReLU can provide sparsity in the network [178]. Moreover, rectifier linear units have been shown to speed up training. Krizhevsky et al. [101] demonstrated a six-fold decrease in the time taken for training a four layer CNN, using rectified linear units to achieve the CHAPTER 3. DEEP NEURAL NETWORKS 70

Figure 3.6: Non-linear activation functions same error rate as an equivalent network using tanh activation functions. A similar activation function was proposed by Jarrett [88].

φ(z) = rect(z) = max(0,z) (3.7)

3.3.1.2 Pooling

Equivalent to complex cells, pooling or subsampling layers combine the activations of local receptive fields into a single activation, thus, increasing the feature maps invari- ance to translations and distortions. Once the features are located, their exact spatial position becomes less important than their relative position [112]. In fact, the exact location could potentially be harmful to classification due to the variation of inputs for a particular class. Inherent in this process is a reduction in the resolution of the feature CHAPTER 3. DEEP NEURAL NETWORKS 71 maps which reduces computational complexity and aids overfitting. Unlike convolu- tion layers, for n input feature maps there will be exactly n output feature maps. The general form of sub-sampling is described in equation 3.8.

l  l l−1 l  a j = φ β jdown(a j + b j) (3.8)

where βl and bl are the multiplicative and additive biases for the jth output feature map in layer l and down is the pooling function. There are many different types of pooling function. Typical examples include mean pooling and max pooling. Mean pooling simply outputs the average over each local receptive field, whilst max pooling selects the maximum value in each. Max pooling is often used in state-of-the-art networks [101, 162]. Recently, other types of pooling, such as stochastic pooling, have been proposed, which demonstrate improved perfor- mance [195].

3.3.1.3 Fully Connected and Output

Fully connected layers are the same as hidden layers from traditional feedforward net- works, where, as the name suggests, neurons are connected to all activations in the previous layer. As with convolutional layers, an activation function is applied to the output of each neuron. Once connected to the output layer, which is generally a soft- max layer (Section 2.2.1), these form a standard feedforward classifier.

3.3.1.4 Architecture

CNNs are usually comprised of multiple alternating convolution and pooling layers. As more layers are added, the feature abstraction increases, automatically creating more complex, higher-level representations of the input. The low layers contain generic features, whilst features which are more related to the input data can be found in the CHAPTER 3. DEEP NEURAL NETWORKS 72 higher layers [94]. In general, the number of feature maps increase as their spatial resolution decreases with each layer. Other considerations which affect the output size are stride and zero-padding. During convolutional and pooling, the stride determines the distance between each operation. For example, for a stride of one, after each in- ner product of the convolution of a filter and an input, the filter is moved a single pixel. Pooling is generally non-overlapping and therefore the stride is often equal to the pooling size. In certain circumstances it can be convenient to pad the input with zeros. This allows the preservation of spatial size during convolution operations and also ensures that salient features that occupy edge regions are not discarded. The con- volution and pooling layers are then generally followed by fully connected layers and an output layer. Fig. 3.7 shows the LeNet-5 CNN architecture that was originally used by LeCun [111] for handwritten digit recognition. The basic algorithm for CNNs is as follows: 1. Convolve different filter kernels with the input to form output feature maps 2. Apply pooling to each output feature map to form sub-sampled output feature maps 3. Repeat 1 and 2 to achieve high-level features 4. Use resultant features as input to classifier

3.3.2 Other Layers

Over recent years there have been many additions to the basic layers, as described in Sections 3.3.2.1, 3.3.2.2 and 3.3.2.3.

3.3.2.1 Batch Normalisation

Batch normalisation [86] normalises the input data distribution so that it has a . Each activation of an input mini-batch, X = [x1,x2,...,xn], is normalised CHAPTER 3. DEEP NEURAL NETWORKS 73

Figure 3.7: LeNet-5 architecture (adapted from LeCun et al.1998 [112]). to zero mean and unit variance and then linearly transformed using:

x − µ x0 = √ γ + E (3.9) σ2 + ε where µ and σ2 are the mean and variance of the current mini-batch X and ε is a small number to prevent numerical instability. Both µ and σ2 are calculated across the batch dimension for each activation. Optimised during training, γ and E provide scale and shift respectively and prevent the normalisation constraining the representational √ capacity of the layer (by setting γ = σ2 + ε and E = µ the original activations can be realised). At test time the average mean and variance, calculated as exponential moving averages during training, are used. Batch normalisation layers are generally inserted between a linear transform and its respective non-linearity and can improve performance and stability, thus allowing for faster training times and the use of larger learning rates. Batch normalisation layers can also improve a network’s ability to generalise to new data distributions. CHAPTER 3. DEEP NEURAL NETWORKS 74

3.3.2.2 DropOut

DropOut [167] is a regularisation technique that can aid the overfitting problem by in- jecting noise into the network structure. Effectively it artificially creates new models by imposing restrictions on certain neurons within the existing architecture, but only increases training time by about a factor of two [78]. It achieves this by randomly removing or dropping neurons from the network. Specifically, given a constant proba- bility p, neurons are temporarily removed from the network during training. Therefore, since these neurons do not contribute to forward or back propagation the architecture of the model has effectively changed. This prevents neurons from relying on the output of others or co-adapting, forcing the network to learn more robust features, which are less reactive to noise. In order to prevent scaling issues, the weights are scaled by 1/p. Previously, connection strategies that employed subsets were generally used to de- crease the number of parameters and break up the symmetry of the network, thus, forcing it to extract different features [112]. In addition to DropOut there are now other similar layers such as DropConnect, which works by removing the weights or edges between neurons instead of the neurons themselves and thus the neurons can remain partially active [181].

3.3.2.3 1 × 1 Convolutions or Network in Network

1×1 layers are a special case of convolutional layer that effectively allow for the depth or channels of an input volume to alter, whilst keeping the width and height constant.

For example, a given input of size h × w × da could be reduced in size to h × w × db by a linear projection via a 1 × 1 × da × db convolutional layer. This is effectively a full connection via shared weights that results in the parametric pooling of channels or feature maps via a weighted summation at each spatial location, which enables the learning of cross-channel interactions [116]. 1 × 1 convolutional layers are often used CHAPTER 3. DEEP NEURAL NETWORKS 75

for dimensionality reduction, where db < da, which can also aid overfitting. However, they can also be used to increase non-linearity by introducing a ReLU whilst keeping the input and output dimensions the same. This technique was originally introduced in Network in Network [116] to enhance local modelling and discriminability, however, its dimensionality reduction ability has since been adopted by many others, allowing for considerably deeper architectures, notably [73, 82, 170].

3.3.3 Optimisation and the Backpropagation Algorithm

Optimisation of the CNN is usually performed by stochastic gradient decent. Whilst more advanced methods (second order) could be used, this method generally finds a satisfactory minimum. Whilst the objective is to find the global minimum of the error function, given the dimensionality of the space, a local minimum is often adequate. Choromanska et al. [23] demonstrated a phenomenon of concentrated local minima evidenced in larger networks and that their respective test errors were similar. In ad- dition, it was found that a global minimum could lead to overfitting. The CNN is first trained on a training set of known example input-output pairs, before validation on an unseen test set is performed. The predicted output of the CNN is the class j that is maximal, over the categorical probability distribution, given by the output layer, which is defined as:

ypred = arg max p(y = j|x;θ) (3.10) j

Naturally, the objective of the CNN is to reduce the number of errors between the predicted labels and the ground truth (true labels). The zero-one loss calculates these errors as: n−1 loss = 0,1 ∑ Φh(x j)6=y j (3.11) j=0 where n is the number of examples to be tested, x and y are the input and the ground CHAPTER 3. DEEP NEURAL NETWORKS 76 truth of the jth example respectively, h(x) is the predicted output y as given in equation

3.10 and Φ is the indicator function, as defined as:

  1 if i is true Φi = (3.12)  0 otherwise

Since the zero-one loss is not differentiable, it would be intractable to optimise the zero-one loss for models which contain many parameters, therefore, the maximum likelihood estimator is used instead (Section 2.2.1). For mathematical convenience, the negative logarithm of the maximum likelihood estimator is used, which is known as negative log likelihood (NLL) or cross-entropy loss for multi-class problems. The objective of training the model is to minimise the NLL of the predictions. If p(y = j|x;θ) is the softmax probability of label j, given example x, then the loss over a batch of training examples of size n is defined by the equation:

1 n−1 NLL(θ;X) = − ∑ log p(y = j|x j,θ) (3.13) n j=0

To achieve this objective, the CNN’s parameters θ are updated using the following equation: ∂E ∆θiter = −LR (3.14) ∂θiter where LR is the learning rate, iter is the iteration index, and E is the mean of the cost function for a single mini-batch. The backpropagation algorithm is used to find the partial derivatives of the cost function, with respect to the parameters. The backpropa- gation algorithm performs the following steps:

1. For a given input, set the current input activation a(1−1) and perform a forward CHAPTER 3. DEEP NEURAL NETWORKS 77

pass for each layer l = 1,2,3,...,L

z(l) = W(l)a(l−1) + b(l),a(l) = φ(z(l)) (3.15)

z(l) = W(l)φ(z(l−1)) + b(l) (3.16)

2. Calculate the output error δ(L) for the cost function E(θ).

3. Backpropagate the errors from the output layer back to the inputs. To this end,

derive an equation for the error δ(l) in terms of the error δ(l+1) using the chain rule: ∂ ∂E ∂z(l+1) ∂z(l+1) δ(l) = E(θ) = = δ(l+1) (3.17) ∂z(l) ∂z(l+1) ∂z(l) ∂z(l)

δ(l) = (W(l+1)T δ(l+1)) φ0(z(l)) (3.18)

4. Finally compute the gradients for each layer with respect to the parameters:

∂ E(θ) = δ(l)a(l−1)T (3.19) ∂W(l) ∂ E(θ) = δ(l) (3.20) ∂b(l) where indicates the element wise product and W and b are the weights and biases respectively. Stochastic gradient descent can be extended by the implementation of a concept called momemtum. Momentum can help prevent a network getting stuck in local min- ima and generally leads to better convergence and reduced oscillation. It achieves this by making large jumps in the direction of velocity, which will accumulate in any di- rection that has regular gradient. Stochastic gradient decent with momentum uses the following equation:

viter = ϖ × viter−1 − LR∇θE(θiter) (3.21) CHAPTER 3. DEEP NEURAL NETWORKS 78

∆θiter = viter (3.22) where ϖ is the momentum (which it typically 0.9) and v is the velocity. In addition, there is also Nestorov [10, 169] momentum, which, unlike classical momentum, makes a step in the direction of previous velocity first and then makes a correction based on the evaluation of the gradient at the new location. This foresight prevents it making too much progress in an unfavourable direction and makes it, in practice, work better than classical momentum. Nestorov momentum is performed using the following equation:

viter = ϖ × viter−1 − LR∇θE(θiter + ϖ × viter−1) (3.23)

∆θiter = viter (3.24)

Decaying the learning rate during training will provide a more optimal solution. However, finding the optimal rate of annealing can be difficult. Decay the learning rate too quickly and the solution is more likely to get stuck in local minima. Conversely, decay it too slowly and suffer more computations through unnecessary training iter- ations. There are many different annealing strategies employed, such as stepped and exponential decay. Care must be taken when setting the initial learning rate as training is not guaranteed to make progress on the loss function if it is set too high. In contrast, momentum can be increased in the later stages of training. In order to regularise the weight update weight decay can be used. This penalises the magnitude of the parameters by introducing a second term into the objective equa- tion to prevent overfitting. Generally there are two types of weight decay: L1 and L2. L2 regularisation effectively prevents large values in the parameter space, instead pre- ferring a more even distribution. This therefore prevents the network relying on a few CHAPTER 3. DEEP NEURAL NETWORKS 79 strong features and encourages it to learn features that are more generalised. L2 regu-

2 2 larisation uses the squared L norm and the loss becomes NLL +λ∑i θi where λ is the regularisation strength. L1 regularisation on the other hand has the effect of sparsifying the weight vectors which helps when the inputs are “noisy”. For L1 regularisation the

2 loss becomes NLL + λ∑i |θi|. In practice, L generally leads to better performance. Lastly, it is important to note that there are other more advanced optimisation algo- rithms such as Adagrad [44], and Adam [97].

3.3.4 Pre-processing

Whilst CNNs have the ability to learn from raw inputs, in practice, methods gener- ally perform better when combined with pre-processing techniques. This section will briefly summarise some of the most popular techniques. Data normalisation is perhaps the most commonly used type of pre-processing technique used. There are many different types depending on the application:

1. Rescaling simple scales the data so that it lies within a particular range. Usually either [0,1] or [−1,1]. This can be achieved using the following:

(b − a)(x − min(x)) x0 = + a (3.25) max(x) − min(x)

where x is the original value and x0 is the rescaled value, and a and b are the target range [a,b]. Another option is to rescale data so that is has unit length, by dividing the vector by its Euclidean or L2 norm kxk. However, in some applications it may be preferable to use the L1 norm.

2. Feature standardisation independently normalises each feature of the data to have CHAPTER 3. DEEP NEURAL NETWORKS 80

zero mean and unit variance. This is achieved by using the following:

x − x¯ x0 = i (3.26) i ε

q 1 n−1 1 n−1 2 where x¯ = n ∑i=0 xi, and ε = n ∑i=0 (xi − x¯) are the mean and standard devi- ation of each feature.

3. Per-example standardisation uses the same equation as feature standardisation,

however,x ¯ and ε are the mean and standard deviation of each example. Thus each example of the data is normalised separately. More often it is performed by only subtracting the mean, although, examples can also be divided by the standard deviation. By removing the mean this has the effect of removing the average intensity or brightness of each example.

Whitening is a natural step to follow normalisation. Due to the highly correlated nature of images and videos, much of the raw input is redundant. Whitening removes this redundancy by linearly transforming the input data using PCA (Section 2.1.2.1), so that its features have a covariance equal to the identity matrix (less correlated) and each feature has equal variance. Therefore for a input vector x:

 1  T xwhite = Wdiag W x (3.27) pdiag(S) + ε where W is the eigenvectors, S is the eigenvalues, and ε is a small constant to prevent numerical instability. CNNs are complex models which require sufficiently large data to train. One method, which artificially increases the size of a dataset, is call augmentation. In general, augmentation is the application of noise to a training set, in order to increase its variability. In practice, this is often achieved by applying affine combinations of CHAPTER 3. DEEP NEURAL NETWORKS 81 transformations (translation, reflection, scaling, rotation and shear) and elastic distor- tions [106]. Krizhevsky et al. [101] create additional training images by randomly cropping patches and applying horizontal flipping. These simple transformations have also been successfully applied to action recognition [6, 94, 161]. Other techniques di- rectly alter the intensity values of pixels. For example, Krizhevsky et al. [101] altered RGB values of training images by adding multiples of the principal components. In addition, standard CNNs require a fixed input and therefore inputs usually need to be resized. Although, there are adaptations that can allow for variable resolution inputs, such as spatial pyramid pooling [72].

3.3.5 Issues

Whilst highly competitive given a large dataset, CNNs, like other neural networks, suf- fer performance problems when the amount and quality of training data is insufficient, and unfortunately, as previously mentioned, to acquire a reliable dataset can be labo- rious and time expensive [180]. In addition, due to the capacity of the architectures used, limited datasets can cause overfitting [101]. Many strategies to overcome the problems associated with limited datasets have been attempted. Some of the most common strategies employed are DropOut and aug- mentation, as described in Sections 3.3.2.2 and 3.3.4, respectively. Other techniques provide an optimised starting point, which is beneficial to convergence. Whilst ran- dom weight initialisation works well when the datasets are of sufficient size [180], previous work on object classification has demonstrated that transfer learning and un- supervised pre-training has also helped to improve performance. Transfer learning involves the use of previously obtained knowledge. For example, Simanyon and Zis- serman [161] successfully used the ImageNet (Section 2.3.1.4) database to pre-train the spatial stream of their action recognition model. Karpathy et al. [94] showed that CHAPTER 3. DEEP NEURAL NETWORKS 82 by fine-tuning the last few layers of a CNN, pre-trained on a related dataset, it is pos- sible to achieve better performance (65.4%, increased from 41.3%) than when trained from scratch on the new dataset. Instead of using knowledge gained from domain re- lated datasets, it is also possible to use more generic knowledge such as edges and colour blobs, which are a common visual pattern. Transferring knowledge from unre- lated datasets was demonstrated by [180], although it was also found that the direction of transfer learning can prove significant. In addition, the use of transfer learning can constrain the target architecture. The aim of unsupervised pre-training is to achieve an internal representation of the input that can reproduce the original data without signifi- cant loss. It was demonstrated that using for weight initialisation, instead of assigning weights randomly, improved the test error by 41% on average when us- ing only 3% of the MNIST dataset [180]. However, unsupervised pre-training has fallen out of favour in recent years with transfer learning gaining more traction in the literature. Whilst many methods have been proposed to increase the performance of CNNs and prevent overfitting, it is still acknowledged that the main increases in the per- formance of CNNs are attributed to increases in complexity, which therefore require more labelled data for optimisation. However, given the problems acquiring labelled data this is not a sustainable pursuit.

3.4 2D Convolutional Neural Networks Case Studies

This section details the implementation and outcome of experiments performed using 2D CNNs on two separate datasets. It was thought prudent to first perform some sim- ple initial experiments to confirm the validity of CNNs and the author’s implementa- tion. Section 3.4.1 details an experiment on 2D image classification using the MNIST CHAPTER 3. DEEP NEURAL NETWORKS 83 dataset. In Section 3.4.2, an experiment on action classification using frame-wise in- puts is performed. Information about the datasets used can be found in Section 2.3.1. All experiments were carried out using Theano and a NVIDIA Geforce GT620 GPU.

3.4.1 Image Classifcation on the MNIST Dataset

The first experiment was conducted on the MNIST image dataset. Since this is an im- age dataset, the classifier was simply trained on the raw pixel values of each image. The only pre-processing undertaken on the images was to scale them to the range [0,1]. The 60000 training examples from the original MNIST dataset is split, so that 10000 are used as a validation set. By monitoring the performance of the model on the valida- tion set, it is possible to select hyper-parameters which prevent overfitting. Therefore, the final dataset consisted of 50000, 10000 and 10000 images in the training, validation and test sets, respectively.

3.4.1.1 Network Architecture

The network architecture used in these experiments contains six layers, excluding the input, which roughly corresponds to the architecture used by LeCun [112]. The first four layers were two alternating convolution and pooling layers, which were followed by one fully connected layer, which was then connected to a softmax output layer. The hyperbolic tanh function was used as the activation function. The network is described as 1 × 28 × 28 − 20C5 − MP2 − 50C5 − MP2 − 500FC − 10So ft. Where C, MP, FC and So ft represent convolutional, max pooling, fully connected and softmax layers, respectively. This shorthand represents a network architecture, with an input image whose spatial dimensions are equal to 28 pixels with a single channel, two convolutional layers featuring 20 and 50 feature maps using 5 × 5 filters, two max- pooling layers of size 2×2, a single fully connected hidden layer of 500 neurons and a CHAPTER 3. DEEP NEURAL NETWORKS 84

Softmax output layer of 10 neurons. The following describes each layer in more detail where the layers are labelled Cl, Pl and Fl, where l represents the layer index:

Layer C1 was composed of 20 feature maps (C1.1,C1.2,C1.3,...,C1.20) of size 24 × 24. Each feature map contains 576 nodes. Each node in each feature map was con- nected to an over-lapping 5×5 neighbourhood in the input plane. Since the connection weights between the 5×5 neighbourhood and each feature map node were constrained to be equal, there were 25 trainable weights per feature map and therefore 500 trainable parameters in total.

Layer P1 was composed of 20 feature maps of size 12 × 12. Each feature map contained 144 nodes. Each node took its input from a non-overlapping 2 × 2 neigh- bourhood in the corresponding C1 feature map and performed a max-pooling operation.

P1 had 20 trainable parameters.

Layer C2 was composed of 50 feature maps of size 8 × 8. Each feature map con- tained 64 nodes. This performed a similar function to that of C1, however, each node in each feature map was connected to overlapping 5×5 neighbourhoods in identical posi- tions in all of P2 maps. Once again, the connection weights for each feature map were constrained to be equal, however, since every C2 map was formed from convolutions between each P1 map and a distinct filter kernel, there must be separate connecting weights for each one. Thus, C2 had 25000 trainable parameters.

Layer P2 performed the same role as P1 and is composed of 50 feature maps of size

4 × 4. P2 had 50 trainable parameters.

Layer F3 was a standard hidden layer composed of 500 hidden nodes and was fully connected to P2. F3 had 400500 trainable parameters.

Finally, the output layer S4 was a softmax layer composed of 10 output nodes; one for each class. The softmax layer had 5010 trainable parameters. In total, there were 431080 trainable parameters. CHAPTER 3. DEEP NEURAL NETWORKS 85

3.4.1.2 Training

The model was trained using mini-batch stochastic gradient descent, with a constant learning rate of 0.1 and a batch size of 500. For each epoch, the model was first trained using the training set, then tested on the validation set to calculate the zero-one loss. The model was only tested on the test set if the validation error for the current epoch was the lowest error of all previous epochs. At this point, the model parameters were also saved as the best model. Therefore, the best model parameters refer to the best performing model on the validation set. The algorithm utilises early-stopping parameters to avoid overfitting so that if a relative improvement on the validation error is not made after a certain number of epochs, the algorithm ceases further optimisation. As recommended by Bengio [9], the weights for each layer apart from the soft- max layer are initialised from a uniform distribution between −a and a where a = r  6 , and f an and f an are the number of inputs and outputs for a hid- f anin+ f anout in out den neuron in that layer, respectively. The weights for the softmax layer and all the biases were initialised to zero.

3.4.1.3 Results and Discussion

Over three runs, the model achieved an average test error of 0.97 ± 0.04%, which compares favourably to the 0.95% test error reported in [112]. However, in should be noted that, whilst the architecture implemented in this experiment used fewer layers, it utilised more feature maps, as well as a denser connection between layers P1 and C2. As such, it has many more trainable parameters. Fig. 3.8 shows the learned filter for the first convolutional layer at 0 (Fig. 3.8a), 1000 (Fig. 3.8b) and 16999 (Fig. 3.8c) iterations, respectively. The filters for the 16999 iteration represent the best filters for the model, since with these filters, the model achieved the best results on the validation set. As can be seen, the filters for CHAPTER 3. DEEP NEURAL NETWORKS 86

(a) 0 iterations (b) 1000 iterations (c) 16999 iterations

Figure 3.8: First layer convolutional filters at different stages of training on MNIST. the best model exhibited various frequencies and orientation, although most are low frequency. Comparing the filters showed that as the iterations increase and the model converges towards a minimum, the filters change until some eventually appear to re- semble Gabor-like filter edge detectors. This is as to be expected, since in the lower layers we should see more generic feature extractors. Edges are elementary visual fea- tures that are present in all images and from observation, are representative of the input distribution. By inspecting the confusion matrix, which can be seen in Fig. 3.9, the most com- mon mistakes were made on digits three and nine. Threes were sometimes misclas- sified as twos or nines, whilst nines were sometimes misclassified as zeros or sevens. However, these mistakes are justifiable, considering the similarity of the shapes of the misclassified digits.

3.4.2 Action Classification on the Weizmann Dataset

The next experiment was conducted on the Weizmann (Section 2.3.2) action dataset. To simplify classification, it was performed per frame, thus the experiment is effec- tively the same as that in Section 3.4.1, using the frames of the videos as images. To further simplify the problem, only three action classes were used (bend, run, wave1). These classes were purposely chosen due to the fact they exhibit very different image CHAPTER 3. DEEP NEURAL NETWORKS 87

Figure 3.9: Confusion matrix for the MNIST experiment CHAPTER 3. DEEP NEURAL NETWORKS 88 representations. The experiment was evaluated using leave-one-out cross-validation. Therefore, the model was run nine times and each time a different agent was used as the test set. From the remaining eight agents, one was randomly selected as the validation set and the rest were used for training.

3.4.2.1 Pre-processing

Instead of using the original video files as inputs, the aligned background subtracted video files were used, thus providing a global image representation. However, these were of variable spatial resolution, and since the CNN requires a fixed size input, some pre-processing step were undertaken. Firstly MATLABs blob analysis function was used to detect the centroid of each frame. A 58×80 crop of each of the first 30 frames of each video was then taken, cen- tred at the centroid. This crop resolution was chosen based on a compromise between the minimum spatial resolution of all the source videos and the objective of extracting the complete anthropometry of each agent, using the smallest resolution possible. The duration of the shortest video was 31 frames, therefore crops were only taken from the first 30 frames, in order to ensure that there were the same number of examples per class in the dataset, so that it was not biased towards a particular class. All images were then resized to 70 × 70 to provide the CNN with a square input. In order to augment the dataset, each frame was flipped horizontally to generate horizontal reflections, then ten random 64 × 64 crops were taken from each. It is these crops that formed the final dataset. This therefore increased the size of the dataset by a factor of 20, to help combat overfitting. However it should be noted that the resulting images exhibited strong similarities. Each image was then reduced in scale by 0.5, so that the CNN could be implemented with a similar architecture to that used in Section 3.4.1. A larger input would require a deeper network and thus increase the complexity and therefore require a larger dataset for training. In total, the dataset CHAPTER 3. DEEP NEURAL NETWORKS 89 consisted of 12600 images in the training set, and 1800 images in both the validation and test sets.

3.4.2.2 Architecture

The architecture used in this experiment was very similar to that used in Section 3.4.1, apart from a few details. Firstly, the input size was increased to 32 × 32, therefore the size of the feature maps in each layer increased. Thus, layer P2 was now composed of 50 feature maps of size 5 × 5. In addition, since only three classes were used, it was decided that only 100 hidden neurons should be used in F3. Therefore, there were

125100 and 303 trainable parameters in F3 and the softmax layer, respectively. In total, the model had 150973 trainable parameters, 280107 fewer than Section 3.4.1.

3.4.2.3 Training and Testing

The model for each sub-experiment was trained using mini-batch stochastic gradient descent with the learning rate initially set to 0.01, learning rate decrease constant of 0.0001 and a batch size of 200. The learning rate was annealed each epoch using the following equation: LR LR = epoch (3.28) epoch+1 1 + Λ × epoch where epoch is the epoch index and Λ is the learning rate decrease constant.

3.4.2.4 Results and Discussion

Averaging the errors across each of the nine sub-experiments gave a validation error of 0.90% and a test error of 0.88%. The results for each individual sub-experiment can be seen in Table 3.1. Columns one and two show the errors for the validation and test sets. Columns three, four and five show the intra-class percentage error for the test set. The intra-class error is the number of incorrect classifications within each class. For CHAPTER 3. DEEP NEURAL NETWORKS 90 example, for the experiment in the first row, the result in the fourth column indicates that 4.3% of the run examples contained in the test set were misclassified as either bend or wave1. It is interesting to note that, apart from when Denis is used as the test subject, the misclassifications are not really spread out among the classes for each sub- experiment; indicating that the test subjects used for particular sub-experiments were prone to the same misclassifications by the CNN. Table 3.2 shows the absolute errors for each class across all sub-experiments. For example, the first two columns show the instances when bend was misclassified as run and wave1, and in which sub-experiments they occurred. Columns 3-4 and 5-6 show the same for run and wave1, respectively. Overall, the run class was the most misclassified, with 79 out of a total 142 misclassifications. This was followed by 45 wave1 misclassifications and 18 bend misclassifications. In general, the majority of bend misclassifications were classified as wave1, with 14 misclassifications out of a total of 18. This is reflected by wave1, which misclassified 44 out of a total of 45 as bend. In fact, the only time that each class was misclassified as run was during the same sub-experiment. The run class was misclassified as bend and wave1 in almost equal measures, with 40 classified as bend and 39 classified as wave1. Whilst the results for this experiment showed that the accuracy for frame level recognition is high, it also highlighted distinct problems when classifying frames of actions in isolation; even when using three highly different classes. Whilst bend, run and wave1 appear so different to humans - they were explicitly chosen for this pur- pose - at frame level, high inter-class similarities occur. Yet, it should be stressed that no strong conclusions can be drawn from this set of experiments due to the reduced dataset. Moreover, due to this fact, the results cannot be compared against other reports in the literature. CHAPTER 3. DEEP NEURAL NETWORKS 91

Table 3.1: Validation and test error as well as intra-class error using different subjects as the test set on Weizmann.

Test Val. Test Bend Run Wave1 Subject Error (%) Error (%) Error (%) Error (%) Error (%) Shahar 0.11 1.44 0.00 4.30 0.00 Moshe 1.39 0.06 0.20 0.00 0.00 Lyoya 0.00 0.67 2.00 0.00 0.00 Lena 0.00 1.72 0.00 0.00 5.20 Ira 0.00 1.89 0.00 5.60 0.00 Ido 0.00 0.00 0.00 0.00 0.00 Eli 0.94 0.06 0.20 0.00 0.00 Denis 1.06 1.33 0.70 3.20 0.20 Daria 4.61 0.72 0.00 0.00 2.20 Avg 0.90 0.88 0.34 1.46 0.84

Table 3.2: Absolute misclassification for each class using different subjects as the test set on Weizmann.

Test True class = Bend Run Wave1 Subject Run Wave1 Bend Wave1 Bend Run Shahar 0 0 0 26 0 0 Moshe 0 1 0 0 0 0 Lyoya 0 12 0 0 0 0 Lena 0 0 0 0 31 0 Ira 0 0 21 13 0 0 Ido 0 0 0 0 0 0 Eli 0 1 0 0 0 0 Denis 4 0 19 0 0 1 Daria 0 0 0 0 13 0 Total 4 14 40 39 44 1 CHAPTER 3. DEEP NEURAL NETWORKS 92

3.5 3D Convolutional Neural Networks Case Study

In 2D convolutions, the input of an image will result in an output image (the same happens for multiple input images which are treated as separate channels). Therefore features are trained on spatial dimensions only. Whereas, 3D convolutions encode mo- tion information over multiple contiguous frames to produce volumetric outputs that preserve the temporal, as well as the spatial dimensions. To perform 3D convolution, the network is presented with an volumetric input, formed of multiple stacked frames of a video or similar, which is convolved with a 3D kernel. The difference between 2D and 3D convolutions is illustrated in Fig. 3.10. 3D pooling is an extension of 2D pooling to the temporal dimension.

3.5.1 Action Classification on the UCF Sports Dataset

For this set of experiments, the UCF Sports dataset (Section 2.3.2.2) was used. The authors of the dataset recommend using Leave-One-Out (LOO) cross validation. How- ever, there are problems with this method. There are strong correlations between videos in multiple classes. Many video clips are created from the same root video and display high similarities. Some feature the same actor performing highly simi- lar actions and are captured from the same viewpoint. Thus, these similarities can be exploited in training. Furthermore, it would take too long to train the proposed deep networks for each round of validation. Lan et al. [103] suggested splitting the data in order to mitigate the correlation issues, by taking one third of videos from each class to form a test set. They do this by simply selecting the first third of the videos for each class from their alphabetical arrangement. However, certain classes (golf and kicking) are split by subclasses and therefore this method means certain sub-classes are disproportionally represented in the splits. For example, golf has 18 videos in total, divided into three subclasses (golf CHAPTER 3. DEEP NEURAL NETWORKS 93

(a) 2D convolution

(b) 3D convolution

Figure 3.10: Comparison of 2D (a) and 3D (b) convolutions (the temporal depth of the 3D filter is equal to 3). The colours indicate shared weights (adapted from Ji et al. 2013 [89]) . CHAPTER 3. DEEP NEURAL NETWORKS 94 swing-back, golf swing-front, golf swing-side) featuring five, eight and five videos, respectively. By selecting the first third to form the test set, there are no golf swing- back videos present in the training set. Instead, for these experiments a third of the videos were randomly selected from each class to form the test set. Therefore whilst the total number of videos in each split is the same as Lan et al. [103], the splits differ in the actual videos that make up each split. The same randomly selected split was used throughout the experiments, in order to compare performances. Section 3.5.1.5 includes an experiment conducted using a standard 3D CNN in order to classify the video in a global sense. In Section 3.5.1.6, an experiment was performed using figure-centric bounding box annotations and compared against the baseline 3D experiment. Both experiments used the same architecture and all exper- iments were carried out using Theano/Lasagne and NVIDIA GeForce GTX 980 and GTX TITAN X GPUs.

3.5.1.1 Architecture

The network used was based on that of Tran et al. [173]. They took inspiration from Si- monyan and Zissermans VGG [162], who proposed using very deep networks of 16–19 weights layers and very small convolution filters (3×3). Spatial pooling is not used af- ter every convolution layer, since a stack of two 3×3 filters, which achieve an effective receptive field size of a single 5 × 5 filter, have been shown to be more discriminative (two non-linearities are used instead of one), in addition to using less parameters than their single filter counterpart. The network architecture is presented in Table 3.3 and is described as 3×8×112×112−64C3−MP−128C3−MP−256C3−256C3−MP2− 512C3−512C3−MP2−512C3−512C3−MP2−4096FC −4096FC −10So ft. The network featured five convolutional and pooling layers, two fully connected layers and a softmax output layer. The last three convolutional and pooling layers featured two stacked convolutional layers. The number of convolutional filters was relatively small, CHAPTER 3. DEEP NEURAL NETWORKS 95

Table 3.3: 3D CNN architecture for UCF Sports

Input 3 × 8 × 112 × 112 RGB video clip 64 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 1 × 2 × 2 max-pooling with stride 1 × 2 × 2 128 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 1 × 2 × 2 max-pooling with stride 1 × 2 × 2 256 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 256 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 2 × 2 × 2 max-pooling with stride 2 × 2 × 2 512 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 512 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 2 × 2 × 2 max-pooling with stride 2 × 2 × 2 512 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 512 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 2 × 2 × 2 max-pooling with stride 2 × 2 × 2 4096 fully connected 4096 fully connected 10-class softmax

with 64 in the first layer, and with the number doubling in each layer, apart from the fifth layer, which had the same number (512) as the fourth. All convolutional filters were 3 × 3 × 3, with stride 1 × 1 × 1 and all pooling layers were 2 × 2 × 2, with stride 2 × 2 × 2, except for the first and second layers, which used a size of 1 × 2 × 2 and stride 1 × 2 × 2 in order to preserve temporal information early on, as in [173]. Both fully connected layers had 4096 units. The ReLU non-linear activation function was applied to the output of all convolutional and fully connected layers. In spite of the depth, because of the number and size of the filters, the number of parameters was 78M, with an input size of 112 × 112, which is not much bigger than equivalent 2D versions [162]. CHAPTER 3. DEEP NEURAL NETWORKS 96

3.5.1.2 Pre-training

Considering the dataset used was limited in size, and the large number of parameters that needed to be trained, it would be problematic to train from scratch. Thus, because of the well-reported advantages [140, 162, 187] of using pre-trained weights, the net- work was initialised with the publicly available weights of [173]. These weights were trained on Sports-1M (Section 2.3.2.5), which is presently one of the largest video classification benchmarks, featuring over one million videos split into 487 classes. Us- ing pre-trained weights from a dataset with such a large number of classes and videos should give additional performance benefits [140]. Since the pre-trained dataset has many more classes, the provided parameters were only used up to the last fully con- nected layer. The weights in the last layer were randomly initialised so that W ∼ q U[−a,a], where a = 12 , and the biases were set to zero [57]. f anin+ f anout

3.5.1.3 Training

Since the dataset contains videos of varying length, clips consisting of eight consecu- tive frames were randomly selected to form the input to the network. Once selected, each frame was resized to 128×171, then the central 112×112 crop was taken so that the final input dimensions were 3 × 8 × 112 × 112. To form a mini-batch, each video was sampled with replacement uniformly across the entire training set. Therefore, for each clip in the batch, first a random video was selected, from which a clip was randomly sampled. To use the pre-trained features without fine-tuning would be inap- propriate, since they were trained on a dataset with many more classes and although of a similar nature, the dataset statistics will inevitably be different. The whole network was fine-tuned using mini-batch stochastic gradient decent with nesterov momentum (NAG). The batch size was 32, with the learning rate and momentum set to 0.001 and 0.9, respectively. Training ran for 50 epochs. No augmentation was performed on the CHAPTER 3. DEEP NEURAL NETWORKS 97

Table 3.4: 3D baseline: accuracy on UCF Sports

Run Clip Loss Clip Acc. (%) Vid. Acc. (%) 1 0.6299 86.59 87.94 2 0.4033 88.44 88.65 3 0.4883 84.61 85.11 Avg - 86.55±1.56 87.23±1.53

dataset since it has been pre-trained on a similar dataset. Furthermore, since each input was randomly selected, some form of augmentation was performed as a result. In ad- dition, this process also mitigated any problems with temporal correlations associated with the dataset.

3.5.1.4 Testing

At test time ten clips were randomly selected from each video in the test set. Each clip underwent the same treatment as during training. Loss was measured on a clip-wise basis, averaged across all examples. Accuracy was measured on a clip-wise and video- wise basis. To calculate the video-wise accuracy, the softmax outputs of all ten clips for each video were averaged.

3.5.1.5 Baseline Results and Discussion

The network was trained three times and then each model was independently evaluated on the test set a total of three times, using the model parameters from the last epoch of training and then averaged. This yielded a mean clip and video accuracy of 86.55% and 87.23% over the three separately trained networks, respectively (Table 3.4). Since the video accuracy is found by averaging the softmax output of all ten clips of a video, a high weighted misclassification can have a considerable impact on the value. Fig.3.11 shows the confusion matrix for the clip-wise predictions. Only the confu- sion matrix for the first test of the first run is shown, however, this is fairly exemplar CHAPTER 3. DEEP NEURAL NETWORKS 98

Figure 3.11: Confusion matrix for the 3D baseline experiment on UCF Sports. CHAPTER 3. DEEP NEURAL NETWORKS 99 of the other evaluations. As shown, the most common mistakes are on running and skateboarding videos. There are also mistakes for kicking and walking. Some of these mistakes seem reasonable. However, further investigation of the errors highlights some shortcomings, specifically to do with the holistic approach. Since the model is using the central crop as its input, it assumes that things will be canonical in scale and po- sition. When these assumptions do not hold, it can lead to misclassifications. If the actor performing the action is of small-scale compared to the resolution of the input, the actor will present a weak signal to the system. A particular example that high- lights the problems with position is a video from the walking class. In the video the actor occupies the extreme right of the frame and thus the target is missing from the input. The video was taken on a golf course and was misclassified as golf, which also highlights the background sensitivity of holistic methods. While not present in the test set, the dataset contains some videos which feature multiple actors performing actions, which are not all consistent with the ground-truth label. Since this approach has no prior knowledge of the target instance, features that correspond to actors that do not represent the ground-truth action could be harnessed, which could present problems during training. Whilst deep models are somewhat robust to noise, the dataset is rel- atively small so this could lead to overfitting. Another problem that is evident from some of the errors observed, is that eight frames are not always long enough to capture the most discriminative part of the action.

3.5.1.6 Figure-centric Bounding Box Results and Discussion

Since in some instances the holistic approach of the previous method led to problems with position and scale, figure-centric bounding box annotations were used. Bounding boxes are one of the most popular ways of extracting global representations which also indicate the target instance. Since they are annotated about the target they implicitly provide scale and positional information. In addition, selecting the target instance will CHAPTER 3. DEEP NEURAL NETWORKS 100

Table 3.5: 3D bounding box: accuracy on UCF Sports

Test Clip Loss Clip Acc. (%) Vid. Acc. (%) 1 0.2314 92.82 93.62 2 0.2341 90.14 92.91 3 0.2158 93.33 95.03 Avg - 92.10±1.40 93.85±0.88

eliminate the possibility of extracting features associated with the wrong actor in a video containing multiple actors performing different actions. The use of bounding boxes also mitigates the effects of scene correlations associated with the dataset. During training and testing, for each randomly selected clip, figure-centric bound- ing box annotations were used to extract the final cropped input. The bounding box annotations contain x and y coordinates, as well as height and width measurements. To ensure the input to the network was square, the smallest side of the bounding box was made equal to that of its largest size. For each clip, the coordinates which correspond to the first frame only were used, and the remaining seven frames were cropped ac- cording to the same coordinates. Once cropped, the input was resized to 112 × 112, so that the final input dimensions were consistent with 3 × 8 × 112 × 112. The network was trained three times. Each model was then independently eval- uated on the test set, a total of three times, using the model parameters from the last epoch of training and averaged. This yielded a mean clip and video accuracy of 92.10% and 93.85% for the three trained networks, respectively (Table 3.5). This is 6% mean improvement over the holistic approach used in Section 3.5.1.5. The addition of bounding boxes clearly increased performance. However, there were some instances where the model was still making errors. Fig. 3.12 shows the confusion matrix for the clip wise predictions. Again only the predictions from the first test for the first run are shown due to the results for remaining tests being simi- lar. As with the holistic approach in Section 3.5.1.5, the main misclassifications are CHAPTER 3. DEEP NEURAL NETWORKS 101

Figure 3.12: Confusion matrix for the 3D bounding box experiment on UCF Sports. CHAPTER 3. DEEP NEURAL NETWORKS 102 associated with the running and skateboarding classes. There are still minor misclas- sifications related to the kicking actions, although the performance has improved. The majority of these improvements were probably due to the limitations of the holistic ap- proach, as discussed above. Notably, the misclassifications appeared more spread out than previously. The results in Section 3.5.1.5 show that the misclassifications were more consistent. Studying the results in more detail suggests that the consistency was perhaps due to correlations in the background. Since the background noise had been reduced, the network was less able to exploit these correlations, instead, the network is encouraged to find more figure-centric spatio-temporal features. Although, it should be noted, that the background was still causing problems in some cases.

3.5.1.7 Overall Discussion and Comparison to State-of-the-art

Table 3.6 shows the results for the baseline experiment as well as the figure-centric bounding box experiment (Baseline + BB) along with other state-of-the-art results. Results are included for the recommended Leave-One-Out (LOO) evaluation strategy, as well as the train/test split used by Lan et al. [103] (1/3 split (first)), in addition to the train/test split used here (1/3 split (rand.)). Firstly, the results for the 3D CNNs in- dicate that the use of bounding box annotations can significantly increase performance (93.58% vs 87.23%). In addition, the results are better than the other state-of-the-art approaches which all use hand-crafted feature descriptors. Lan et al. [103] and Wang et al. [183] both used dense HOG3D and Action MACH [150] employed a template- based method using a Maximum Average Correlation Height (MACH) filter. The result for the 3D CNNs even show improved performance over the results which use LOO, which should demonstrate improved performance over the other two evaluation strate- gies, due to reduced test set and increased training set sizes (evidenced by the 10% increase for Lan et al. [103]). However, the 3D CNNs utilised pre-trained features which were fine-tuned on the target dataset, and thus leveraged the inbuilt knowledge CHAPTER 3. DEEP NEURAL NETWORKS 103

Table 3.6: Accuracy on UCF Sports

Method Evaluation strategy Accuracy (%) Action MACH [150] LOO 69.2 Lan et al. [103] LOO 83.7 Wang et al. [183] LOO 85.6 Lan et al. [103] 1/3 split (first) 73.1 Baseline 1/3 split (rand.) 87.23±1.53 Baseline + BB 1/3 split (rand.) 93.85±0.88

of a large labelled dataset. The importance of well annotated datasets was further highlighted by the increased performance when bounding box annotations were used. Chapter 4

Self-Organising Map Network

4.1 Introduction

The task of vision-based image classification has garnered much attention in recent years. Robust implementations of this method have many wide-ranging applications. However, the task is extremely challenging due to well known intra-class variations, such as lighting, deformations, occlusions and misalignments. Up until recently, low- level hand-crafted features such as Gabor [92], local binary patterns (LBP) [137] and scale invariant feature transform (SIFT) [121] were used extensively to success- fully overcome such problems. Yet, manually designed features are very problem- dependent, overly relying on domain-specific knowledge. Ergo they may not gener- alise well to other related data. In addition, hand-crafted features can often be complex and thus expert knowledge is required to employ them on new conditions. Using data driven representative learning techniques to learn task-specific salient features has been proposed to obviate the need for complicated and costly engineered features. Deep learning has recently made numerous advances in many applications, including object detection and localisation [157], image recognition [73], as well as hu- man action classification [89,161]. Of these methods, the convolutional neural network

104 CHAPTER 4. SELF-ORGANISING MAP NETWORK 105

(CNN) [111] has become ubiquitous in the field of image classification [101] due to its superior performance, demonstrating above human-level accuracy in some cases [25], and its relative ease of use compared to more traditional hand-crafted approaches. The cornerstone of its performance lies in its ability to leverage ever complex data-specific features through a multi-stage non-linear architecture to provide greater robustness to intra-class variations. Despite CNN’s widespread use, there is limited understanding of its complex learn- ing mechanism and it is often used as a “black box”. In addition, the supervised back- propagation algorithm used to train CNN’s can be costly in terms of time and the need for vast amounts of labelled data. Furthermore, obtaining the best performance often requires fine-tuning of its numerous parameters manually and the use of vari- ous expedient add-ons [78]. Scattering convolutional networks (SCATNet) [18] shone some much needed light on the underlying processes governing CNNs by using gen- erated wavelet operators as learning-free filters. Used in a similar cascaded architec- ture, SCATNet outperformed both CNN and deep neural networks on two challenging datasets [18, 159]. Inspired by SCATNet, PCANet [20] was introduced as a simple deep-learning framework for image classification. It utilises multi-stage principal component analy- sis (PCA) derived filter banks, followed by binarisation and local histogramming for indexing and pooling. Despite its simple structure, it performed surprisingly well on many image classification tasks, outperforming many more complex methodologies. This simple approach was further developed into a learning-free data-independent ver- sion called DCTNet [134]. DCTNet uses discrete cosine transform (DCT)-derived fil- ter banks, as it has been demonstrated that DCT is a good approximation for the most important principal components. However, there are a number of inherent constraints that limit the architecture of both networks. This chapter presents a self-organising map (SOM)-based alternative to PCANet CHAPTER 4. SELF-ORGANISING MAP NETWORK 106 and DCTNet. Both unsupervised data-driven and generative approaches are demon- strated and a simple trick to the binarisation process is introduced, which greatly re- duces the final feature vector size without significantly affecting performance. In ad- dition, SOM offers many benefits over the use of PCA and DCT, especially that the number of possible filters is not affected by the filter size.

4.2 Related Work

Chan et al. [20] proposed PCANet that uses a filter bank derived by PCA. The resultant features are then convolved with the input to form activations. This is repeated for each subsequent layer. No activation function is used between layers, instead, a non- linearity is introduced in the final layer. Their experiments with different architectures find that a two-layer network achieves the best results. The output layer produces the final feature vector by first binarising the activation maps of the previous layer and then local histogramming. Binarisation involves converting the activation maps to binary using a Heaviside function, then transforming each pixel’s bit string to decimal, to produce a single activation map. Binarised Statistical Image Features (BSIF) [93] used this binarisation technique prior to PCANet, however, it was only used as an image descriptor. Due to the size of the resultant output feature vectors, further reduction techniques are sometimes necessary. Whilst PCANet produces excellent results for such a compact network, there are some limitations. Specifically, the filter number and network depth are limited due to aforementioned processes used to achieve the final feature vector (all architectures experimented with, restrict the number of filters in the second layer to eight). In addition, since PCA is used to learn the filters, the maximum number achievable is limited due to the covariance matrix. The size of the filters also governs the mixture of low- and high-frequency components in the filter bank. In DCTNet [2], DCT basis functions are chosen as an alternative to PCA, due to CHAPTER 4. SELF-ORGANISING MAP NETWORK 107 its data independence and low complexity. DCT functions are cosine functions which oscillate at different frequencies in both horizon and vertical directions, which can be summed together to express data. Similar to characteristics that are used in DCT-based JPEG compression, DCT filters are selected in a zig-zag fashion excluding the mean component. In addition to using a different basis, DCTNet also employs a further tied- rank (TR) normalisation step, which regulates the histogram and leads to improved performance over PCANet on certain data. However, DCTNet suffers from similar drawbacks as PCANet, namely, the maximum number of filters is limited by the filter size. In addition, the learning-free approach relies on the input distribution following a local high correlation assumption, which can be considered as a 2D first-order Markov process. The self-organising map (SOM) [98] is an unsupervised algorithm, first introduced by Kohonen. It uses competitive learning in order to quantise an input space, whilst preserving data topological structure on the map. A set of input vectors are reduced or mapped to a smaller set of prototype vectors, which are representative of the data distribution. In this respect, SOM is similar to PCA, except SOM is not restricted to finding orthogonal principal directions. In fact SOM can be considered as a non-linear version of PCA [192]. Recently, SOMs have been used to learn features [5] for tasks including face recognition [4] and handwritten digit recognition [4, 30]. SOMs have also recently been used for tasks such as the recognition of human-object interactions [128], and word segmentation [17]. As an extension of the simple Markov process of DCT, Markov random fields (MRFs) have a long history in computer vision and have found applications in tex- ture synthesis [33] and image classification [145]. Generated MRF features are also explored in this chapter. CHAPTER 4. SELF-ORGANISING MAP NETWORK 108

4.3 Methodology

4.3.1 Convolutional Self-Organising Map

In the case that the convolution is used as the similarity measure, the winner or best matching unit at each time step is found by maximising the activation between the input x(t) and all the neurons of the map:

bmu(t) = argmax[wi ∗ x(t)] (4.1) i∈Ω and weights of the winner and its neighbours are updated according to

wi(t) + LR(t)η(bmu,i,t)[x(t) − wi(t)] wi(t + 1) = (4.2) kwi(t) + LR(t)η(bmu,i,t)[x(t) − wi(t)]k where ∗ denotes convolution, Ω is the set of neuron indices, LR (0 < LR(t) < 1) is the 2 (bmu,i,t) = [− krbmu−rik ] monotonically decreasing learning rate, η exp 2σ(t)2 is the Gaussian neighbourhood function, with ri being the location of neuron i on the map and σ the effective range of the neighbourhood, which decreases with time t. The learning rate

LR and the neighbourhood σ are annealed using the following:

LR LR(t) = LR(t − 1) b (4.3) (LRb +t)

σ σ(t) = σ(t − 1) b (4.4) (σb +t) where LRb and σb are the decrease constants for the learning rate and neighbourhood respectively. CHAPTER 4. SELF-ORGANISING MAP NETWORK 109

4.3.2 Discrete Cosine Transform (DCT)

The discrete cosine transform, first introduced by Ahmed et al. [2], decomposes a signal into sum of weighted sinusoids of varying frequency. DCT is popular method for lossy compression and is used in JPEG and MPEG standards. Specifically, a M ×N input (or block of an image) can be represented as a sum of MN 2D basis functions of the form [66]: π(2m + 1)p π(2n + 1)q α α cos cos (4.5) p q 2M 2N where p = 0,1,...,M − 1 and q = 0,1,...,N − 1 (4.6)   √ 1/ M, if p = 0 αp = (4.7) p  2/M, otherwise   √ 1/ N, if q = 0 αq = (4.8) p  2/N, otherwise

The DCT coefficients Bpq can then be considers as the weights applied to each basis function. As shown by equation 4.5, the total number of basis functions is restricted by their size.

4.3.3 Markov Random Field

A Markov random field is a set of random variables which can be described by a undi- rected graph having Markovian properties. Given a random field F = {F1,F2,··· ,Fk} indexed by the set S, and a undirected graph G , (S, N ), where N is the neighbour- hood system, and each random variable takes a value fi from label set L. F forms an MRF with respect to G providing the below properties are satisfied: CHAPTER 4. SELF-ORGANISING MAP NETWORK 110

1. Positivity: Any configuration is positive, and therefore possible to realise.

P( f ) > 0, ∀ f ∈ F, (4.9)

where f = { f1, f2,··· , fk} is one possible configuration and F denotes the set of all possible configurations of the field F.

2. Markovianity: The conditional probability that a random variable takes a value depends upon the random variables in its neighbourhood only.

P( fi | fS−{i}) = P( fi | f j, j ∈ Ni) = P( fi | Ni), (4.10)

where S − {i} denotes all sites excluding site i, fS−{i} is the set of labels at the sites in S − {i}.

The Hammersley-Clifford theorem was introduced to overcome the difficulty of specifying an MRF from conditional dependencies. Proposed by Hammersley and Clifford and developed by Besag [13, 26, 67], it demonstrates the equality of MRFs and Gibbs distribution for the same graph, which states that the joint probability of any MRF can be written as a Gibbs distribution, and for any Gibbs distribution there exists an MRF model. While MRFs only specify the conditional dependencies, Gibbs distributions provide a way of expressing the conditional dependencies of the random variables as functions of the cliques in a graph, thus providing a set of functions which represent an MRF. A clique is any fully connected subset or configuration of the graph. Let P( f ) denote a Gibbs distribution on the set S. Then the joint probability P( f ) takes the form

1 − 1 U( f ) P( f ) = e T (4.11) Z CHAPTER 4. SELF-ORGANISING MAP NETWORK 111 where

− 1 U( f ) Z = ∑ e T (4.12) f ∈F Z is a normalising constant which is also called the partition function, T is a constant called the temperature and U( f ) is the energy function. The energy

U( f ) = ∑ Vc( f ) (4.13) c∈C sums clique potentials Vc( f ) over all possible cliques C. The value of Vc( f ) depends upon the clique configuration and remains positive for all possible configurations. Natural images are highly locally correlated: the intensity level of a single pixel is dependant on the intensity of its immediate neighbours. Therefore MRFs are a natural choice to model image features. Given an image, MRFs treat pixel values as random variables and model the conditional probabilities between them. Auto-models and multi-level logistic models are typical models for establishing neighbourhood func- tions and conditional probability distributions. Given the contextual constraints, sim- ulated annealing [54] is a commonly used method for approximating the global mini- mum.

4.3.4 Self-Organising Map Network (SOMNet)

The architecture of SOMNet closely resembles PCANet and DCTNet, except the en- coding is adapted during the binarisation process. The structure is depicted in Fig. 4.1. The following sections detail the behaviour of each layer. Features were learned from randomly sampled s × s × d patches from the input, which were mean normalised by subtracting the patch mean from each pixel (Fig. 4.2). Using a patch based convo- lutional SOM reduces the number of parameters, effectively training local receptive fields through shared weights. When sampling from activations a single s × s patch is CHAPTER 4. SELF-ORGANISING MAP NETWORK 112

Convolutional Layers Binary Hashing Block-wise Histogram

Input …

Output Classifier Vector (SVM)

… … … … …

Figure 4.1: Block diagram of SOMNet. SOM was used to derive the convolutional layer filter banks, as depicted in Figure 4.2 [68]. sampled from a single activation map uniformly. To avoid too much redundancy in the feature space, a 1D SOM map was used. Therefore, for a filter bank of size 8, a 1 × 8 SOM was used. Each SOM was trained until convergence, with the second SOM being trained for twice that of the first. Given that the activation data space was larger than the input, it was thought best to scale the training time of the second SOM by the same ratio. How- ever, preliminary results showed that this was unnecessary (data not shown). Whilst the first SOM was able to reach a global minimum, due to the bootstrap sampling process and the increased input space of the activations, the second layer SOM only converged to local minima. Yet, this does not appear to be detrimental to the performance. CHAPTER 4. SELF-ORGANISING MAP NETWORK 113

SOM Nodes Activations Patch Selection and Input Pattern Mean-patch Removal

Figure 4.2: Training the SOM-based filter banks [68].

4.3.4.1 Convolution Layer

Given an h×w×d input image Id, where h and w are the height and width of the input respectively and d is the number of channels. The input is convolved with a filter bank

2 of size s × d × Hl which remains the same for each layer. Zero padding was applied with pad size (s − 1)/2 so that the output was the same size as the input. Convolution

2 1 Hl s ×d×Hl of the input Id with the filter bank W l = [wl ,...,wl ] ∈ R , where Hl is the

1 Hl number of filters in layer l, provides Ol = [ol ,...,ol ], where

i i ol = [Id ∗ wl],i = 1,2,...,Hl (4.14)

i Each ol was taken as input to subsequent layers.

4.3.4.2 Binarisation and Histogram Layer

The final layer has an input size h × w × H1 × H2. By performing the heaviside step function, He(·), each set of H1 real-valued outputs is binarised by thresholding the values at zero as He(O). Each resultant binary string is grouped into four bit nibbles

(hexadecimal), Go, and encoded using

4 κ−1 κ H2 ∑ 2 Go,o = 1,2,..., (4.15) κ=1 4 CHAPTER 4. SELF-ORGANISING MAP NETWORK 114

H2 4 which produced 4 ×H1 encoded output images where each pixel had the range [0,2 −

1]. Each of these images is split into Bsize × Bsize blocks with overlapping ratio τ. The 24 H2 B final output feature vector Ξ ∈ R 4 H1 is formed by concatenating each block’s his- togram together, where B is the number of local histograms. Splitting the images into local histograms encodes spatial information and also some degree of invariance to

H2 B translations. In comparison, PCANet has Ξˆ ∈ R2 H1 giving an output feature vector ratio

Ξˆ 2H2 = (4.16) Ξ 4H2 which is exponentially increasing. Therefore Ξ will always be smaller than Ξˆ when

H2 > 4. The same is also true for DCTNet and other PCANet derivatives which employ the same encoding strategies. Given that four filters is uncommon in state-of-the-art architectures, it is safe to assume that this binarisation process offers a more compact representation. The final output feature vector is then classified using a linear support vector machine (SVM) [48] classifier.

4.3.5 Markov Random Field Self-Organising Map Network (MRF-

SOMNet)

When generated MRF filters were reduced to a subset using SOM, named MRF-SOM, and used as filter banks, this network is referred to as MRF-SOMNet. MRF-SOM is trained in a similar way to SOM to ensure convergence. Apart from the filters, which are replicated across both layers, all other parts of MRF-SOMNet remain the same as SOMNet. To produce the original set of MRF filters, a multi-level logistic model with a fifth order neighbourhood system and label set L = {0,1} was used. Specifically the val- ues of β were randomly varied to model different distributions using an anisotropic CHAPTER 4. SELF-ORGANISING MAP NETWORK 115 neighbourhood described as,

β = (β11,β12;β21,β22;β31,β32;β41,β42;β51,β52;)

  β51 β41 β32 β42 β52     β41 β21 β12 β22 β42     = β β 0 β β  (4.17)  31 11 11 31     β42 β22 β12 β21 β41   β52 β42 β32 β41 β51

20000 100 × 100 MRF images were simulated using 500 iterations of a Metropolis sampler with simulated annealing and then sampled to produce the filters. The temper- ature was initialised at T = 1 and was annealed by a factor of 0.999 each iteration. For colour filters each image was sampled three times (one for each channel).

4.3.6 Computational Complexity

The computational complexity of the proposed methods are now considered. SOM- Net is considered first, for which the complexity includes mean patch removal and convolution for both convolution stages, as well as binary hashing and histogram- ming for the final stage. In addition to this, there is the actual complexity of SOM

2 itself, which is s (H1 + H2). However, the overall complexity can be calculated as  2  O hws (H1 + H2) . Similarly, PCANet, DCTNet and MRF-SOMNet have the same network complexities, however, PCANet has the added operations included in eigen- decomposition, as well as the increased dimensionality in the second convolutional stage. Table 4.1 summaries the final complexity results. For DCTNet and MRF- SOMNet, the operations taken to generate the filters were ignored, since this occurs only once. CHAPTER 4. SELF-ORGANISING MAP NETWORK 116

Table 4.1: Computational Complexity

Method Complexity  2  SOMNet O hws (H1 + H2)  2 2 PCANet O hws (H1 + H1H2) + hw(H1H2)  2  DCTNet O hws (H1 + H2)  2  MRF-SOMNet O hws (H1 + H2)

4.4 Experiments and Discussion

In this section the performance of SOMNet and MRF-SOMNet is compared against PCANet, DCTNet and other state-of-the-art methods on benchmark data. Due to the high dimensionality of the output feature vector, all experiments used linear SVM for classification.

4.4.1 Comparison of Different Features and Encodings

To investigate the use of different feature types and encoding strategies, experiments were conducted to compare three types of unsupervised features, namely SOM, PCA and k-means, and two different styles of encoding: SOMNet and PCANet. These ex- periments were conducted using the first 10000 examples of the CIFAR-10 dataset. The CIFAR-10 [100] object recognition dataset consists of 60000 32 × 32 colour im- ages of 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse and truck. The class distribution is uniform and the training and testing sets contain 50000 and 10000 examples, respectively. In the experiments conducted here, a double convolu- tion layer architecture where H1, H2, s1, s2, Bsize, and τ were kept constant was used. The respective filters were trained once and then evaluated using different encodings for a fair comparison. For PCA and k-means filters the procedure in [42] and [20] were followed, respectively. The first layer SOM was trained for 600000 iterations, using an CHAPTER 4. SELF-ORGANISING MAP NETWORK 117

Table 4.2: Comparing features and encodings

Feature type Encoding Accuracy (%) SOM SOMNet 68.56 PCA SOMNet 68.68 k-means SOMNet 68.23 SOM PCANet 71.31 PCA PCANet 72.44 k-means PCANet 71.25

initial LR and σ of 0.1 and 20 respectively which were annealed using LRb = 50000 and

σb = 5263. The second layer SOM was trained for 2400000 iterations using LR = 0.1,

σ = 4, LRb = 200000 and σb = 133333. In addition, both k-means and SOM were trained using whitening data. Although the results were not statistically tested, Ta- ble 4.2 indicates that using different features has little impact on the overall accuracy, whereas using PCANet encoding does show a small increase in accuracy. However, SOMNet encoding produces a code that is 8 times smaller than PCANet.

4.4.2 Evaluation on the MNIST Dataset

Formed from the larger NIST dataset the MNIST image dataset [112] is a collection of 70000 grayscale 28 × 28 images of handwritten digits 0-9. For the purposes of these experiments 60000 were used for training and 10000 for testing. With regard to the parameters of SOMNet, a double convolution layer architecture with H1 = H2 = 8, s1 = s2 = 7, Bsize = 7, and τ = 0.5 was used; which is in line with PCANet. The first layer SOM was trained for 300000 iterations, using an initial LR and σ of 0.1 and 4 respectively which were annealed using LRb = 50000 and σb = 33333. The second layer SOM was trained for 600000 iterations using the same settings apart from LRb = 100000, σb = 66666. The MRF-SOM was trained for 5000000 iterations with an initial LR and σ of 0.1 and 4 respectively which were both annealed using CHAPTER 4. SELF-ORGANISING MAP NETWORK 118

LRb = σb = 33333. To demonstrate the benefits of the new binarisation process, re- sults for a SOMNet with H1 = 16 and H2 = 32 (all other parameters remain the same as above) is also given, referred to as SOMNet16−32. For SOMNet16−32 the first layer SOM was trained for 300000 iterations, using an initial LR and σ of 0.1 and 8 respec- tively which were annealed using LRb = 50000 and σb = 14286. Whereas, the second layer SOM was trained for 600000 iterations using LR = 0.1, σ = 16, LRb = 100000 and σb = 13333. For PCANet, both published results, and the results of the experi- ments conducted using the provided code, are shown. DCTNet does not provide pub- lished results on MNIST, so again the code provided was used. In addition, DCTNet does not use overlapping histograms so results are provided for the default version, as well as a version which has, to match SOMNet and PCANet. These are referred to as DCTNet and DCTNetoverlap, respectively. For both these methods, a SVM with default parameters was used for a fair comparison with SOMNet (DCTNet originally uses the nearest neighbour for classification). Results for SOMNet, after fine-tuning of the SVM using five fold cross-validation, are also shown. Given that SOMNet’s second layer filters converge to local minima, the results are the average of three runs.

For all other methods (PCANet-2 (mine), DCTNet, DCTNetoverlap, SOMNetrep and MRF-SOMNet) only a single result is necessary. In addition, to explore whether it is essential to learn filters in the second layer, a result for when the first layer filters of the SOMNet are replicated over both filter banks (referred to as SOMNetrep) is given.

Whilst SOMNet16−32 uses more filters than PCANet and DCTNet methods, the output feature vector is equal in size to PCANet and DCTNetoverlap. Table 4.3 shows the results along with other state-of-the-art methods. The results for DCTNet are surprising given their simplicity, and are the best in the second part of the table (excluding the published PCANet result, which could not be replicated here). DCTNet has already produced credible results on face-recognition [134], where the equivalence between PCA and DCT is clear. Therefore, the results shown here, CHAPTER 4. SELF-ORGANISING MAP NETWORK 119 suggest that DCT filters are actually quite flexible in their application. It is noted that the TR normalisation does not seem to have much effect here and the use of over- lapping histogram improves the accuracy. Whilst SOMNet provides slightly inferior results compared to PCANet (mine) and DCTNet (0.86% vs 0.77% and 0.74% respec- tively), it provides similar results to other unsupervised methods in first half of the table such as CDBN (more complex) and CSOM (larger dictionary). Furthermore, SOM- Net achieves these results using an output feature vector that is eight times smaller than PCANet/DCTNetoverlap and four times smaller than DCTNet. The benefits of this more compact representation are further highlighted by the results of SOMNet16−32 (0.65%) which utilises additional features for a more over-complete representation, providing, alongside the published PCANet and DCTNetoverlap results, the best per- formance in the second and third part of the table. However, this result is still inferior to other state-of-the-art approaches such as ConvNet (trained supervisedly), ScatNet (which used an RBF SVM) and MRF-CNN (which used a deeper architecture with many more parameters). The results for MRF-SOMNet are the worst in the table.

Comparing SOMNet to SOMNetrep clearly shows that learning filters, based on the activations of the previous layer, improves performance over using replicated ones. Figs. 4.3 - 4.6 show the filter banks for SOMNet, PCANet, DCTNet and MRF- SOMNet, respectively. Since the DCT filters are generated they appear, quite generic, as expected. However, they share some similarities with PCANet filters. The most obvious similarities are the more complex filters (second two from the right). Given their stated equivalence in some applications and similar accuracy, this is not surpris- ing. PCANet and SOMNet filters also share some similarities, however, as both are data-driven the filters appear more specific to the dataset. In addition, the complex features present in DCTNet and PCANet are absent in SOMNet, which could explain why SOMNet requires a larger number of features to achieve similar accuracy. The MRF-SOMNet filters contain a lot more high frequency components, which are absent CHAPTER 4. SELF-ORGANISING MAP NETWORK 120

Table 4.3: Error rate on MNIST

Method Error rate (%) CDBN [113] 0.82 CSOM (linear SVM) [4] 0.82 ConvNet [88] 0.53 ScatNet-2 (RBF SVM) [18] 0.43 MRF-CNN [145] 0.38 PCANet-2 [20] 0.66 PCANet-2 (mine) 0.77 DCTNet 0.74 DCTNet (TR Norm) 0.76 DCTNetoverlap 0.68 DCTNetoverlap (TR Norm) 0.68 SOMNet 0.86±0.05 SOMNet (fine-tune) 0.83±0.07 SOMNet16−32 0.65±0.02 SOMNetrep 1.01 MRF-SOMNet 1.23

Figure 4.3: Learned SOMNet filters on MNIST. Top row: the first layer filter bank. Bottom row: the second layer filter bank. in the input space (handwritten digits) and explains the poorer result.

4.4.3 Optimising Parameters on the CIFAR-10 Dataset

To determine the optimal block size and overlapping ratio, experiments were con- ducted using SOM features and SOMMet encoding on the first 10000 examples of CIFAR-10. In the experiments conducted here, a double convolution layer architec- ture was used, where H1, H2, s1 and s2 were kept constant. The SOMNet filters were trained once and then evaluated using different block sizes Bsize and overlapping ra- tios τ for a fair comparison. Whilst not statistically tested, Table 4.4 indicates that a CHAPTER 4. SELF-ORGANISING MAP NETWORK 121

Figure 4.4: Learned PCANet filters on MNIST. Top row: the first layer filter bank. Bottom row: the second layer filter bank.

Figure 4.5: Generated DCTNet filters for MNIST. Replicated across both filter banks. block size of 8×8 and an overlapping ratio of 0.5 are optimal. This is aligned with the parameters used by Chan et al. [20]. In order to further optimise CIFAR-10 parameters, experiments were also carried out using different numbers of filters in both the first and second layers of a double convolutional architecture. As with previous experiments, the first 10000 examples of CIFAR-10 were used, but here block size and overlapping ratio were kept constant at 8×8 and 0.5, respectively. Each results is for a single run only. Whilst not statistically tested, the preliminary results shown in Table 4.5 indicate that an increased number of total features is optimal, as shown by the corresponding output feature vector size

Ξ for the combination of H1 and H2. This is aligned with results for MNIST (0.83% vs 0.65% for SOMNet and SOMNet16−32 respectively). Furthermore, of the feature configurations tested, an architecture of H1 = 40 and H2 = 32 resulted in the highest accuracy.

Figure 4.6: Clustered MRF filters for MNIST. Replicated across both filter banks. CHAPTER 4. SELF-ORGANISING MAP NETWORK 122

Table 4.4: Variations in block size and overlapping ratio of SOMNet on CIFAR-10

Block size (Bsize) Overlapping ratio (τ) Accuracy (%) 2 × 2 0 61.54 4 × 4 0 65.51 8 × 8 0 66.19 16 × 16 0 65.01 8 × 8 0.25 67.51 8 × 8 0.5 68.60 8 × 8 0.75 68.53

Table 4.5: Variations in feature numbers of SOMNet on CIFAR-10

Number of filters (H1) Number of filters (H2) Output size (Ξ) Accuracy (%) 40 16 40960 71.59 80 8 40960 71.24 40 32 81920 72.83 80 16 81920 72.43 160 8 81920 72.40 CHAPTER 4. SELF-ORGANISING MAP NETWORK 123

4.4.4 Evaluation on the CIFAR-10 Dataset

As with the previous MNIST experiment, a double convolutional architecture was used. Like PCANet, SOMNet parameters were set to s1 = s2 = 5, Bsize = 8, and τ = 0.5. However, unlike PCANet the reduced output feature vector size was taken ad- vantage of to increase H1 = 40 and H2 = 32. Since SOM is incapable of dealing with correlations, before training, the data was first whitened so that it was uncorrelated; exaggerating the high frequency content. SOMNet layers were then trained using the whitened data. The first layer SOM was trained for 600000 iterations, using an ini- tial LR and σ of 0.1 and 20 respectively which were annealed using LRb = 50000 and

σb = 5263. The second layer SOM was trained for 1200000 iterations using LR = 0.1,

σ = 16, LRb = 100000 and σb = 13333. Max-pooling was applied over 4 × 4 sub- regions so that the maximum response in each histogram bin was pooled producing 16 pooled histograms. PCANet attaches spatial pyramid pooling (SPP) [63] to the output to leverage responses in 4 × 4, 2 × 2, and 1 × 1 sub-regions however, this was found to be unnecessary. Once again, due to different local minima found during training the results for SOMNet were averaged over three runs, and the SVM was fine-tuned using five fold cross validation. The published PCANet result as well as the fine-tuned SOMNet result alongside the results for other state-of-the-art methods are presented in Table 4.6. SOMNet achieves an accuracy of 78.51% which is 1.34% higher than PCANet. In addition, PCANet per- forms an extra dimensionality reduction step in producing this; the effects of which are not examined in this work. In comparison to other state-of-the-art approaches, SOM- Net is comparable to other unsupervised approaches such as k-means (which uses 4000 features) but is slightly worse than CUNet (which uses a similar pipeline to PCANet but employs k-means based filters and additional pooling). The other approaches all achieve significantly superior results, however, they all employ supervised training of CHAPTER 4. SELF-ORGANISING MAP NETWORK 124

Table 4.6: Accuracy on CIFAR-10

Method Accuracy (%) k-means (Triangle, 4000 features) [27] 79.60 CUNet + Weighted Pooling 80.31 Stochastic Pooling ConvNet [195] 84.87 NIN + Dropout [116] 89.59 MRF-CNN [145] 91.25 PCANet-2 [20] 77.14 SOMNet (fine-tune) 78.51±0.06

Figure 4.7: Learned SOMNet filters on CIFAR-10. Top five rows: the first layer filter bank. Bottom four rows: the second layer filter bank. features. Fig. 4.7 and Fig. 4.8 show the filter banks for SOMNet and PCANet respectively. Although the PCANet filters are data-driven, they appear quite generic, which could be attributed to the high variance assumption of PCA. On the other hand the SOMNet filters appear more similar to the whitened input data, which could explain their supe- rior performance on this more complex task, in comparison to the more simple features learned on MNIST. CHAPTER 4. SELF-ORGANISING MAP NETWORK 125

Figure 4.8: Learned PCANet filters on CIFAR-10. Top five rows: the first layer filter bank. Bottom row: the second layer filter bank.

4.5 Conclusions and Future Work

In this chapter a simple SOM based unsupervised deep learning framework was pro- posed. Like its predecessors, PCANet and DCTNet, it offers an uncomplicated alterna- tive to more traditional supervised convolutional-based deep learning structures. Yet, it differs from the aforementioned approaches by learning features which are not con- strained by limitations imposed due to constraints such as filter size or orthogonality. In addition, SOMNet introduced an alternative binarisation technique, which allowed for the introduction of more filters to produce a more over-complete representation, which provided comparable or improved performance. In addition, the use of repli- cated features across both filter banks was explored in SOMNetrep which demonstrated the importance of learning second layer filters. In summary, SOMNet demonstrated its ability to extract useful information and give competitive results against other similar approaches using few unsupervisedly trained features on both the MNIST and CIFAR- 10 datasets. Whilst in this chapter the parameters have been kept mostly in line with those used for PCANet and DCTNet, in the future, experiments with different filter sizes, as well as incorporating further layers to produce more complex architectures, should be explored. Given more filters of different sizes, and increased network depth, which CHAPTER 4. SELF-ORGANISING MAP NETWORK 126 this method allows for, it would be conceivable that SOMNet could improve even further on its current performance. Furthermore, given the advantages of learning second layer filters and the strong performance of the generated DCT filters, it could be worthwhile investigating the effects of learning second layer filters from first layer activations for DCT, as well as MRF, which were both used in a replicated fashion in this work. Although MRF-SOMNet performed poorly on MNIST it was not examined for CIFAR-10, which could demonstrate better performance due to the more complex nature of the input data. In addition, the binarisation technique could be explored in more detail, along with alternative encoding strategies. Generally, an evaluation on further datasets could be undertaken. For further details please see Section 7.2.1. Lastly, the use of single channel filters, as explored here, via the sampling of pre- vious layer activations, could be explored in a supervised context. Currently, in super- vised learning, the depth of the filters is typically equal to the number of features in the preceding layer. Thus, it could be interesting to pursue this sampling technique as a way of reducing the number of parameters required for a supervised CNN, and thus assist with overfitting. Chapter 5

SOMNet with Aggregated Channel Connections

5.1 Introduction

The classification of images is a very challenging task that has received considerable attention for the last few years, due to many potential applications. Good feature repre- sentations must discriminate between independent classes whilst being robust to many intra-class variations. Approaches to this problem generally fall into two categories. Traditionally, hand-crafted methods were prevalent, with models such as SIFT [121] and HOG [38] performing well. However, these methods can present problems when they are applied to other tasks, and therefore each new problem may involve the de- sign of new methodologies. Recently, using deep models to learn representations from both labelled and unlabelled data has made much progress, with models such as the supervised convolutional neural network (CNN) [111] and its many variants achieving the current state-of-the-art on a number of tasks [25, 101]. Key to the success of deep models are their inherent ability to learn higher level data-specific features, in contrast to the low-level features of hand-crafted techniques. However, whilst successful, deep

127 CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 128 learning models can be difficult to train, as they are burdened by many parameters which can require expert fine-tuning. These parameters are in addition to the architec- ture parameters, such as receptive field size, feature number and depth. Furthermore, supervised approaches require vast labelled datasets, which present problems with an- notations. Recently, PCANet [20] showed a simple approach to deep learning that used a limited filter bank and could perform as well as some deep learning models, whilst requiring less configuration and parametrisation. SOMNet [68] and CUNet [42] used the self-organising map (SOM) and k-means algorithms, respectively, to train filter bank ‘dictionaries’ and adopt similar efficient pipelines to that of PCANet. In addi- tion, SOMNet introduced an alternative encoding strategy which enabled the use of additional filters for a more over-complete solution, and improved performance. These methods demonstrated that unsupervised approaches could compete with more com- plex supervised methods. DCTNet [134] showed that generated features could also perform well. However, these methods learn similar features replicated over multiple levels which visually do not always concur with higher level representations of the input. This is because the second layer of these methods do not learn from combina- tions of first layer features. In comparison, common deep learning approaches build feature hierarchies over many layers with features from previous layers, combining to form more complex features in subsequent ones. Early CNNs used a parsimonious approach when establishing the connections between layers [112], however, in recent years it has become standard to use full connections [101, 162]. Yet, this requires that the features in a given layer have a depth equal to the number of features in the pre- ceding layers, potential performing unnecessary operations. In addition, it have been demonstrated that unsupervisedly trained features perform poorly in later layers when using full connections [29]. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 129

Recently, in supervised learning, techniques such as DropOut [167] have demon- strated that randomly dropping neurons from a network can actually be beneficial, forcing the network to learn from different combinations of neurons which results in features which are more robust to noise. Other recent advances have been pro- posed which combine features more explicitly. Network in Network or 1 × 1 con- volutions [116] were proposed between layers for which only a linear weighting is learned, which forces the network to learn combination of features instead of offsets, and can also be used to reduce the dimensionality of the features, reducing redun- dancy. Maxout [61] also functions in a similar way, combining multiple features maps through channel pooling. Given the problems with unsupervised approaches learning full connections, and the aforementioned problems with higher level feature learning in PCANet and SOMNet style networks, it is justifiable to investigate the application of reduced connection schemes for unsupervised multi-layer architectures. Inspired by this, a novel approach is proposed in this chapter, which selects lo- cal receptive fields in order to learn data-dependent higher order features from more primitives features in lower layers. To this end, a two layer SOMNet architecture is utlilised and a new layer is introduced, which is placed between the two layers of the SOMNet. This layer combines the activations of the previous layer which encourages SOMNet to learn hierarchical features without increasing their dimensionality. This approach demonstrates competitive results on the handwritten digit recognition task and only uses the simple competitive SOM algorithm throughout. In contrast to the often complex deep learning models, this approach to feature learning, avoids tuning millions of parameters for an elegant and time efficient solution, capable of handling vast amounts of data without the need for annotations. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 130

5.2 Related Work

Vector quantisation (VQ) has been used extensively for computer vision tasks [34,107]. Algorithms such as SOM and k-means are used to learn a visual dictionary of low level features (feature learning), which are then used to map or encode the input into higher level image representations (feature extraction). k-means has proved to be an effi- cient and competitive algorithm in recent literature when combined with appropriate data pre-processing and encoding [28]. In [27], a single layer network demonstrated competitive performance on a number of datasets belying its simplicity. The self- organising map [98, 192] can be considered as a topological preserving alternative to k-means. It uses competitive learning to quantise an input distribution while main- taining a topographic structure. This neighbourhood preservation makes SOM more immune to initialisation and outliers than k-means. Feature learning with SOM has been explored for face [4] and handwritten digit [4, 30] recognition tasks. Chen et al. [20] proposed a simple framework named PCANet that used a novel approach to deep learning characterised by minimal filters and less parametrisation. Filter banks for each layer were trained by applying PCA to the input. Once learned, the filter banks were used in a traditional feed forward convolution architecture, as is commonplace in deep learning structures. The authors used a binarisation technique called Binarised Statistical Image Features (BSIF) as introduced by Kamala and Rahtu [93]. Once binarised, the activations were split into sub-regions and local histograms were formed, which were concatenated to form the final feature vector. SOMNet [68] proposed the use of the self-organising map algorithm to learn filter banks, which alleviated the constraints imposed by the use of PCA, namely orthogonality and the number of filters, which are limited by the filter size due to the covariance matrix. Unburdened by these limitations, SOMNet also adapted the binarisation process to enable the use of further filters. CUNet [42] used k-means to learn the filter banks and CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 131 introduced a new weighted pooling method to further improve classification accuracy. DCTNet [134] used generated discrete cosine transform basis functions as filters. Feature learning has always garnered a lot of attention. Yet, in recent years work has also focused on the importance of connections between layers. 1 × 1 convolu- tions provide parametric pooling of feature maps via a weighted summation, enabling the learning of cross-channel interactions [116]. The maxout activation function [61] groups and pools across features maps enabling the learning of more complex activa- tion functions. A maxout unit can learn a piecewise linear approximation to a convex function for which the number of pieces is equal to the number of feature maps within each group [61]. There have also been many recent attempts to group features in un- supervised learning. In [35] and [29] it was shown that grouping features randomly or by similarity can aid performance. Dundar et al. [45] suggested learning 1 × 1 convo- lutions across channels between unsupervised layers using gradient descent. However, they proposed grouping the features randomly first, allowing the network to learn the best combinations. In addition, Coates and Ng [28] showed that the choice of encoding scheme for a unsupervised single layer network was actually more important than the choice of dictionary learning method.

5.3 Methodology

5.3.1 Proposed Method

The proposed method consists of a two-layer SOMNet (Chapter 4) with additional layers as described below. SOMNet (Fig. 4.1) is considered to have a full utilisation between the first and second layers, since during the second layer feature learning, activations from the first are sampled uniformly. All of these new layers described below can be placed between the two layers of SOMNet, as shown in Fig. 5.1. These CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 132 additional layers are used to encourage the learning of more complex features in the second SOMNet layer by promoting only the most competitive channels for a given spatial location. All layers reduce the dimensionality of the activations from the first layer of SOMNet through the use of pooling operators (Section 3.3.1.2), whilst hope- fully increasing the complexity. Training involves three steps: feature learning, feature extraction and classification using a linear SVM [48]. Different pooling operators such as mean and maximum are investigated. Prior to classification, the output feature vec- tors are normalised to unit length.

5.3.1.1 Fully Aggregated Connections (FAC) Layer

This is a naive approach which simply pools the activations along the activation di- mension to produce a single activation, from which a patch is sampled to learn the second layer filters. Specifically, for a set of activations from the first layer A =

1 h i h×w [O ,...,O 1 ], where O ∈ R , pooling is performed at each spatial location (i, j) with i ∈ {1,2,...,h} and j ∈ {1,2,...,w}. Therefore, only the highest activated chan- nel at each location (i, j) is retained for second layer feature learning.

5.3.1.2 Sparse Aggregated Connections (SAC) Layer

This layer selects a subset of activations to be pooled by considering the correlations between each filter and a particular input. Specifically, for a given input, multiple s×s×d patches are sampled from a 2D Gaussian distribution and the winning filter is selected as the one with the highest activation. Majority voting is performed over each of the patches to select the top x filters, where the notation SACx is used. These filters are used to get activations over the whole input and then pooled as in Section 5.3.1.1. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 133

Second Layer First Layer First Second Activations Activations Pooled SOM SOM Input Pattern Activations

Aggregated Connection Layers

Figure 5.1: Application of proposed aggregation layers to a two layer SOMNet. The SOM-based filter banks correspond to the convolutional layers in Fig. 4.1. Each SOM layer is trained and frozen prior to training any subsequent SOM layer [69].

5.3.1.3 Grouped Aggregated Connections (GAC) Layer

This layer selects subsets of activations to be pooled by grouping them according to their respective filter position on the map. Since SOM clusters the filters whilst main- taining topology, neighbourhood structure is implicit in the groupings. Specifically, for a given input, the set of activations are grouped into non-overlapping subsets of twos and fours (named GAC2 and GAC4 respectively) and then pooled. As with the original SOMNet, a patch is sampled from the resultant pooled activations uniformly.

5.4 Experiment and Discussion

5.4.1 Whitening

For all of the following experiments, whitening was applied to the training set prior to training the SOMNet filters. This de-correlates the input such that the input features have a covariance equal to the identity matrix and have equal variance, thus enabling the SOM to learn more discriminative features. Whitening was carried out using equa- tion 3.27. Whitening was not used during feature extraction. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 134

5.4.2 Digit Classification on the MNIST Dataset

Firstly, experiments were conducted using a SOMNet with 8 and 32 filters of size 7 × 7 in the first and second layer, respectively. The remaining parameters were set as

Bsize = 7, τ = 0.5. In later experiments, 16 and 32 filters are used, as in Chapter 4. For

SOMNet8−32, the first layer SOM was trained for 300000 iterations, using an initial

LR and σ of 0.1 and 4, respectively, which were annealed using LRb = 50000 and

σb = 33333. The second layer SOM was trained for 600000 iterations using LR = 0.1,

σ = 16, LRb = 100000 and σb = 13333. For SOMNet16−32, the first layer SOM was trained for 300000 iterations, using an initial LR and σ of 0.1 and 8, respectively, which were annealed using LRb = 50000 and σb = 14286. Since the second layer is the same size as SOMNet8−32, the same SOM settings apply. In addition, spatial max pooling was performed over 4 × 4 sub-regions of the output. It is noted that this step improves performance and reduces the size of the final output feature vector. Max pooling was selected, as it selects the most competitive features and has been shown to work effectively with linear SVM [190]. The results reported on are the average of three runs and SVM parameters were fine tuned using five-fold cross-validation for all results.

5.4.2.1 FAC Layer

Firstly the FAC method was tested. In order to examine whether the proposed method should be employed during both feature learning and extraction, or just feature learn- ing, both schemes were tested. Both average and max pooling strategies were also explored. Table 5.1 shows the results. FE indicates that the proposed layer was used during feature extraction in addition to the feature learning stage.

The results for SOMNet8−32 show that the error rate significantly increased when the FAC layer was used during both feature learning and feature extraction. Focussing CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 135 on the results for which the FAC layer was only used during feature learning, both results show significant decreases in the error rate. With regards to the pooling strategy, more improvement was observed when max pooling was used, compared with mean pooling (significant at 1% versus 10%, respectively). As previously mentioned, the improvements seen with the addition of the FAC layer could be attributed to a better selection of first layer filters via the channel pooling.

When the same experiments are performed using SOMNet16−32 there is no statis- tical difference when compared with the baseline. This may be due to the increased number of filters that need to be combined, resulting in the aggregation of too much information into a single channel. Figures 5.2 and 5.3 show the second layer filters for both SOMNet8−32 and SOM16−32 architectures with and without FAC. Both sets of filters appear more complex, with the proposed method producing notably more high frequency filters of different orientations. However, the filters for SOM16−32 appear more noisier than those for the smaller architecture, confirming perhaps, that too much information is being aggregated from the first layer. When the proposed layer is used during feature extraction, as well as feature learn- ing, the size of the final output feature vector is reduced in size proportional to the number of feature maps that are pooled. In this case, all eight feature maps in the first layer are pooled and therefore the final output feature vector is eight times smaller. In order to conduct a fairer comparison, an additional experiment with a SOMNet8−256 architecture and max pooling was conducted (all other settings remained the same), however, an error rate of 0.72±0.09 was achieved, indicating no improvement over the baseline. Since the results indicate that it is best to employ FAC during feature learning only and utilise max pooling, all the remaining experiments in the chapter use this methodology. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 136

Table 5.1: FAC layer: error rate on MNIST (p-values represent the comparison be- tween each method with the method labelled not applicable (N/A) for each section).

Method Pooling FE Error rate (%) p-value SOMNet8−32 - - 0.69±0.02 SOMNet8−32 + FAC Max No 0.57±0.01 0.0007 SOMNet8−32 + FAC Max Yes 1.33±0.04 0.0001 SOMNet8−32 + FAC Mean No 0.64±0.03 0.0742 SOMNet8−32 + FAC Mean Yes 1.36±0.22 0.0063 SOMNet16−32 - - 0.56±0.07 N/A SOMNet16−32 + FAC Max No 0.55±0.03 0.8312

Figure 5.2: Learned second layer SOMNet + FAC filters on MNIST. Top four rows: SOMNet8−32 architecture. Bottom four rows: SOMNet8−32 + FAC architecture. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 137

Figure 5.3: Learned second layer SOMNet + FAC filters on MNIST. Top four rows: SOMNet16−32 architecture. Bottom four rows: SOMNet16−32 + FAC architecture.

5.4.2.2 SAC Layer

Table 5.2 shows the results for the SAC layer when employed for feature learning only and using max pooling. Results for SOMNet8−32 suggested that using increasing number of filters correlated with an improvement in error rate. Specifically, using

SAC1 showed no improvement over the baseline, however, using SAC2, SAC4 or SAC6 showed increasing improvement, with the latter two being statistically significant at the

1% level, versus the baseline. The worst performances were observed when SAC1 was used and this may be because when too few filters are selected, too much information is lost. The improvements observed with more filters are perhaps again attributed to better selection of filters from the first layer.

Small improvements were observed with the larger SOMNet16−32 architecture, however, the decrease in error rate was not statistically significant. Interestingly, the best performance for both SOMNet sizes were observed when a similar number of combined filters are used.

Observing the filters for the both SOMNet8−32 and SOMNet16−32 (Figures 5.4 and 5.5, respectively) again clearly shows more complex filters are being learned with CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 138

Table 5.2: SAC layer: error rate on MNIST (p-values represent the comparison be- tween each method with the method labelled not applicable (N/A) for each section).

Method Error rate (%) p-value SOMNet8−32 0.69±0.02 N/A SOMNet8−32 + SAC1 0.69±0.02 1.0000 SOMNet8−32 + SAC2 0.62±0.06 0.1277 SOMNet8−32 + SAC4 0.56±0.02 0.0013 SOMNet8−32 + SAC6 0.54±0.04 0.0044 SOMNet16−32 0.56±0.07 N/A SOMNet16−32 + SAC2 0.55±0.04 0.8404 SOMNet16−32 + SAC4 0.53±0.02 0.5148 SOMNet16−32 + SAC8 0.53±0.01 0.5032 SOMNet16−32 + SAC12 0.54±0.03 0.6728

the proposed method. In addition, for the SOMNet16−32, the filters appear less noisy when compared to the FAC case (Figure 5.3) which could be explained by half the number of filters from the first layer being combined. Therefore, perhaps the minor improvements recorded for SOMNet16−32 + SAC, are due to the fact that the addi- tional filters in the first layer compensate for the reduced complexity witnessed in the baseline.

5.4.2.3 GAC Layer

The results of incorporating the GAC layer are shown in Table 5.3. The GAC layer is implemented during feature learning only and with max pooling. The results for

SOMNet8−32 demonstrate that using either GAC2 or GAC4 an improvement in the error rate versus the baseline was observed, however, only GAC4 showed a statistically significant improvement at the 5% level. Again, by selecting four rather than two filters more information is combined to produce the output. The results for SOMNet16−32 do not show improvement against the baseline, although, the difference is not significant. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 139

Figure 5.4: Learned second layer SOMNet + SAC filters on MNIST. Top four rows: SOMNet8−32 architecture. Bottom four rows: SOMNet8−32 + SAC6 architecture.

Figure 5.5: Learned second layer SOMNet + SAC filters on MNIST. Top four rows: SOMNet16−32 architecture. Bottom four rows: SOMNet16−32 + SAC8 architecture. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 140

Table 5.3: GAC layer: error rate on MNIST (p-values represent the comparison be- tween each method with the method labelled not applicable (N/A) for each section).

Method Error rate (%) p-value SOMNet8−32 0.69±0.02 N/A SOMNet8−32 + GAC2 0.66±0.03 0.2230 SOMNet8−32 + GAC4 0.61±0.03 0.0184 SOMNet16−32 0.56±0.07 N/A SOMNet16−32 + GAC2 0.58±0.04 0.6896 SOMNet16−32 + GAC4 0.58±0.02 0.6590

Figure 5.6: Learned second layer SOMNet + GAC filters on MNIST. Top four rows: SOMNet8−32 architecture. Bottom four rows: SOMNet8−32 + GAC4 architecture.

Observing the filters for SOMNet8−32 and SOMNet16−32 (Figures 5.6 and 5.7 re- spectively), the filters do not appear more complex than the baseline, in contrast to both FAC and SAC methods. This can perhaps be explained by the grouping mecha- nism: since groupings are based on the filter’s position in the map, less variations are captured when filters are combined.

5.4.2.4 Discussion and Comparison to State-of-the-Art

The best results from the three above methodologies were compared to other simi- lar approaches and the current state-of-the-art in Table 5.4. The proposed methods CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 141

Figure 5.7: Learned second layer SOMNet + GAC filters on MNIST. Top four rows: SOMNet16−32 architecture. Bottom four rows: SOMNet16−32 + GAC4 architecture. performed as well as, or better than, other more complex methodologies, apart from ConvNet (trained supervisedly), ScatNet (which used an RBF SVM) and MRF-CNN (which used a deeper architecture with many more parameters). These results are promising considering only 40 filters were used. Indeed, the proposed approach pro- duced improved results over the SOMNet8−32 baseline for all three configurations. The significance of the top two results, when FAC and SAC6 were used, is demonstrated to be at the 1% level. Out of the three methodologies used here, SAC6 produced the largest improvement in error rate of -0.15, with FAC resulting in an improvement of -0.12. This could be explained because SAC ignores less relevant features, whereas FAC combines all filters and therefore there is no discrimination about its selection. GAC produced the least improvement in error rate (-0.08) suggesting that grouping by map position may not aid performance. This is further illustrated by comparing GAC4 and SAC4, which group the same numbers of filters but resulted in improved error rates with SAC (0.56% versus 0.61% with GAC).

The same methodologies were used on a SOMNet16−32 architecture. These are CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 142

Table 5.4: Error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section).

Method Error rate (%) p-value CDBN [113] 0.82 - CSOM (linear SVM) [4] 0.82 - PCANet-2 [20] 0.66 - ConvNet [88] 0.53 - ScatNet-2 (RBF SVM) [18] 0.43 - SOMNet16−32 [68] 0.65±0.02 - MRF-CNN [145] 0.38 - SOMNet8−32 0.69±0.02 N/A SOMNet8−32 + FAC 0.57±0.01 0.0007 SOMNet8−32 + SAC6 0.54±0.04 0.0044 SOMNet8−32 + GAC4 0.61±0.03 0.0184 SOMNet16−32 0.56±0.07 N/A SOMNet16−32 + FAC 0.55±0.03 0.8312 SOMNet16−32 + SAC8 0.53±0.01 0.5032 SOMNet16−32 + GAC4 0.58±0.02 0.6590

also compared against the baseline results from the previous chapter and publica- tion [68], however, the main improvement for the baseline observed in this chapter can be attributed to the use of whitening and the fine-tuning of parameters (0.56% versus 0.65%). When the layer aggregation experiments were carried out on 16-32 architecture only minor improvements are noted, however, it does appear that a similar trend to that observed with the 8-32 architecture (both FAC and SAC providing the most improvement) is also observed. Yet, no statistical significance can be attributed to these improvements.

5.4.3 Object Classification on the CIFAR-10 Dataset

Through experimentation, the optimal parameters of SOMNet were found to be H1 =

40, H2 = 32, s1 = s2 = 5, Bsize = 8 and τ = 0.5. The first layer SOM was trained for 600000 iterations, using an initial LR and σ of 0.1 and 20, respectively, which were CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 143

annealed using LRb = 50000 and σb = 5263. Whereas the second layer SOM was trained for 1200000 iterations using LR = 0.1, σ = 16, LRb = 100000 and σb = 13333. The output was divided into 4 × 4 sub-regions to which max pooling was applied to give 16 pooled histograms. In order to simplify spatial pooling of the block histograms bins, zero padding was added to the final activation maps to produce an even sized output. In order to speed up the experimentation process, experiments were first conducted using a subset of data to compare the application of the different layers and fine-tune parameters. Once complete, experiments were conducted on the full dataset. The subset was formed by uniformly sampling 1000 examples from each class. The results reported on the subset are the average of three runs, whereas for the full set, results reported are the average of four runs. SVM parameters were fine-tuned using five-fold cross-validation for all results.

5.4.3.1 FAC Layer

Table 5.5 shows the results for the FAC layer applied during feature extraction only and using max pooling. The results indicate that the FAC layer has a detrimental effect on performance, with accuracy decreasing by 1.06% against the baseline. These results are perhaps due to the input space being more complex and the increased number of filters compared to the MNIST experiments, resulting in the reduced ability to learn useful features in the second layer. Observing the filters (Figure 5.8), it appears that many more low frequency filters are learned when compared with the baseline SOMNet40−32 architecture. This could be attributed to the high frequency content being lost due too many filters being com- bined. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 144

Table 5.5: FAC layer: accuracy on CIFAR-10 (the p-value represent the comparison between each method with the method labelled not applicable (N/A)).

Method Accuracy (%) p-value SOMNet40−32 73.25±0.19 N/A SOMNet40−32 + FAC 72.19±0.15 0.0016

Figure 5.8: Learned second layer SOMNet + FAC filters on CIFAR-10. Top four rows: SOMNet40−32 architecture. Bottom four rows: SOMNet40−32 + FAC architecture. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 145

Table 5.6: SAC layer: accuracy on CIFAR-10 (p-values represent the comparison be- tween each method with the method labelled not applicable (N/A)).

Method Accuracy (%) p-value SOMNet40−32 73.25±0.19 N/A SOMNet40−32 + SAC2 73.21±0.35 0.8703 SOMNet40−32 + SAC5 72.81±0.06 0.0187 SOMNet40−32 + SAC10 72.51±0.24 0.0138 SOMNet40−32 + SAC20 72.38±0.18 0.0045 SOMNet40−32 + SAC30 72.13±0.19 0.0020

5.4.3.2 SAC Layer

Table 5.6 shows the results for implementing the SAC layer with max pooling during feature learning. Here, the opposite trend to the MNIST results was observed, with a negative correlation between accuracy and the number of features selected in the SAC layer. This could again be attributed to the increased complexity of the input space and subsequent filters leading to detrimental effects on the learning of second-layer features. Indeed, all the results showed decreased accuracy versus the baseline, apart from SAC2, where the decrease in performance was not statistically significant. The filters (Figure 5.9) showed no significant differences compared with the base- line, although this is hard to interpret given their inherent complexity and the fact that each second-layer SOM only converges to a local minima.

5.4.3.3 GAC Layer

The results for the GAC layer are shown in Table 5.7. The GAC layer is implemented during feature extraction only and used max pooling for channel aggregation. As be- fore, better results were observed when fewer features were combined, with GAC2 exhibiting better performance at 73.33% compared with GAC4 at 72.98%. Whilst the

GAC2 was increased from baseline by 0.08, it was not statistically significant. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 146

Figure 5.9: Learned second layer SOMNet + SAC filters on CIFAR-10. Top four rows: SOMNet40−32 architecture. Bottom four rows: SOMNet40−32 + SAC2 architecture.

Table 5.7: GAC layer: accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A)).

Method Accuracy (%) p-value SOMNet40−32 73.25±0.19 N/A SOMNet40−32 + GAC2 73.33±0.29 0.7098 SOMNet40−32 + GAC4 72.98±0.31 0.2678

The filters for both the baseline SOMNet40−32 and SOMNet40−32 + GAC are shown in Figure 5.10. Again it is hard to compare the filters given the complexity of the filters and the fact that the SOMs are only locally optimised. However, it does appear that there was slightly less redundancy with the proposed method, which could explain the minor improvement in accuracy.

5.4.3.4 Discussion and Comparison to State-of-the-Art

This section includes the results for the best methods on both the subset and the full set, and compares them to similar approaches and the current state-of-the-art. The results for SOMNet, as well as PCANet, CUNet and other state-of-the-art methods are shown in Table 5.8. Results on the subset can be found in the second section of the table and CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 147

Figure 5.10: Learned second layer SOMNet + GAC filters on CIFAR-10. Top four rows: SOMNet40−32 architecture. Bottom four rows: SOMNet40−32 + GAC2 architec- ture. the results on the full set in the final section. The results indicate that the baseline SOMNet (Table 4.6) is competitive with other similar approaches such as PCANet and CUNet. In terms of other unsupervised ap- proaches, such as k-means, RF Learning and NOMP-20, the SOMNet approach used here is competitive with k-means, but has much lower accuracy than NOMP-20 and RF-Learning, although, all these approaches use far larger dictionaries. The other su- perior results shown in the table use supervised learning approaches (Stocastic Pooling ConvNet, NIN + Dropout, MRF-CNN). In this chapter, the objective was to improve the accuracy with SOMNet through addition of the aggregation layers, however, a non- significant improvement was shown only when the filters were grouped in twos on the subset using GAC2. However, this was not seen when the experiment was replicated on the full set. This is in contrast to the MNIST results, which demonstrated improve- ments when more features were combined, for which FAC, and SAC with greater num- bers of filters, were superior. This was possibly due to the CIFAR-10 input space being inherently more complex compared to MNIST, and therefore, the combination of too many filters diminished the ability of the resultant filters learned in the second layer CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 148

Table 5.8: Accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section).

Method Accuracy (%) p-value Tiled CNN [135] 73.10 - PCANet-2 [20] 77.14 - K-means (Triangle, 4000 features) [27] 79.60 - CUNet + Weighted Pooling [42] 80.31 - RF Learning [29] 82.0 - NOMP-20 [117] 82.9 - Stochastic Pooling ConvNet [195] 84.87 - NIN + Dropout [116] 89.59 - MRF-CNN [145] 91.25 - SOMNet40−32 73.25±0.19 N/A SOMNet40−32 + FAC 72.19±0.15 0.0016 SOMNet40−32 + SAC2 73.21±0.35 0.8703 SOMNet40−32 + GAC2 73.33±0.29 0.7098 SOMNet40−32 78.51±0.06 N/A SOMNet40−32 + FAC 77.56±0.12 0.0001 SOMNet40−32 + SAC2 78.29±0.28 0.1753 SOMNet40−32 + GAC2 78.13±0.35 0.0761

to discern useful features. Furthermore, since the input space was far larger, more in- dependent features were learned, and therefore the influence of the neighbourhood on the map was reduced, resulting in more arbitrary GAC groupings. This could explain why the GAC result performed better with CIFAR-10 than with MNIST. A qualitative analysis revealed that there was no real observable difference between filters of the baseline and proposed methodologies, apart from FAC, which produced the worst results. However, perhaps the improvement observed on the subset with

GAC2 was due to reduced redundancy or noise in the second layer, induced by the competitive aggregation, which leads to improved generalisation of the filters. Yet, statistically, no strong conclusions can be made. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 149

5.5 Conclusion and Future Work

This chapter presents a novel approach to representation learning. The proposed method can build representations based on combinations of previous layer features. This ap- proach to selecting receptive fields requires no additional parameter learning over the previously proposed SOMNet, but can significantly increase the performance on the MNIST dataset. In general, the results here suggest that using techniques that com- bine more filters may be beneficial for simple datasets such as MNIST. However, the improvements on MNIST were not significant when tried on a larger architecture with additional first-layer filters, indicating that the choice of encoding may be less impor- tant than the number of filters employed. In addition, no significant improvements were observed for the more complex CIFAR-10 dataset. In fact, results indicated that the opposite may be true, i.e. that combining fewer filters may be beneficial, although no strong conclusions can be made due to statistically non-significant results. Whilst the results were inferior or similar compared with the baseline for CIFAR-

10 (GAC2 producing the best results out of the methods tested), more research may provide promising results on this more complex dataset. Specifically, further methods for combining the filters, such as linear weightings, could be explored. In addition, the possibility of this methodology being used alongside more tradi- tional pooling layers to reduce redundancy and dimensionality in the feature space, should be investigated. Specifically, experiments with these channel aggregation lay- ers could be conducted using alternative learning strategies, such as the supervised CNN to assess whether it compliments standard spatial pooling. Whilst operators such as the maxout unit have already been proposed, the aggregation layers proposed in this chapter are sufficiently different to warrant further investigation. Furthermore, exper- iments on more data and further subsets could be investigated. For more details on future work, please see Section 7.2.2. CHAPTER 5. SOMNET WITH AGGREGATED CHANNEL CONNECTIONS 150

In general, this study gives further evidence that simple unsupervised algorithms, such as SOM, may have a place alongside more complex deep models. Although some of the results were inferior, this is still a worthwhile line of enquiry, either as a completely unsupervised endeavour, or coupled with supervised learning. Chapter 6

Filter Replacement in Convolutional Networks using Self-Organising Maps

6.1 Introduction

Deep learning methodologies have made great strides in recent years [73,82], however, the main advancements come from supervised models trained on millions of labelled examples. The labelling of data is expensive in terms of time and cost and the problem is even worse for video data, where each frame may require individual annotations: Section 3.5.1.6 showed that in the inclusion of bounding box annotations increased accuracy over the baseline. The existence of sufficiently large labelled data for im- age classification [118, 153] has made the transferring on knowledge from one task to the next possible; where knowledge exists as human-annotated labels. In fact, it has become almost ubiquitous to pre-train models on ImageNet before fine-tuning on the target task [71, 161]. Video datasets of the same thoroughness are lacking, and thus video models have not seen the same level of progress as their image-based cousins.

151 CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS152

Indeed, an initial attempt to recognise human actions using deep-learning saw minimal improvement compared with hand-designed models [94]. Furthermore, training com- plex deep models from scratch can lead to disappointing results [94, 174]. In recent years, the use of pre-training has started to be used in video classification, enabling improvements in accuracy [173]. However, some of the larger gains have been fa- cilitated by a combination of pre-training alongside the availability of a large-scale video dataset [95]. In addition to the data hungry nature of supervised methodologies, they also have well documented problems with overfitting the training data [167,181]. Whilst certain regularisation techniques have been introduced to counteract this phe- nomenon, such as DropOut [78] (Section 3.3.2.2), augmentation [101] (Section 3.3.4), dimensionality reduction and weight decay (Section 3.3.3), the problem still persists and it is often necessary to combine multiple different techniques. Unlike supervised learning which requires both an input and its respective label, unsupervised learning can be trained on just the input. Specifically, supervised learn- ing methodologies model p(y|x) directly. In contrast, unsupervised techniques learn to model only the input p(x) and therefore do not require labelled data. Under the as- sumption that the input will contain information in order to differentiate between cate- gories p(y|x); a model that is capable of reconstructing an input should also be able to tell the difference between examples of different categories. However, this constraint does limit the degrees of freedom when seeking a solution. Yet, these constraints can prevent the model from learning noise and prevent overfitting. For these reasons, the unsupervised learning paradigm is believed to offer the most attractive solution. The combination of unsupervised and supervised methodologies is not new. Un- supervised pre-training has been used previously, however, it is generally used as an initialisation point. Models such as the deep belief networks [12,76] and autoencoders [77] and their convolutional variants have been used for unsupervised pre-training, but CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS153 these are complex and time consuming to train. Additionally, the assumption that un- supervised features are useful to discriminate abstract classes has been weakened by the emergence of transfer learning [172], for which the transferred knowledge is usu- ally discriminative. Thus, these methods have fallen out of favour, replaced mainly by a resurgence in the interest in supervised convolutional neural networks [101]. This chapter concerns itself with the combination of unsupervised and supervised methodologies in the application of image and video classification. This is being ex- plored in the knowledge that a combination could help alleviate the need for vast labelled data whilst still achieving comparable results and improve generalisation. Specifically, it will investigate the combination of the convolutional neural network (CNN) and the self organising map (SOM). Whilst unsupervised pre-training has al- ready been explored, to my knowledge it has not been done with SOM. In fact, SOM is generally ignored by the deep-learning community. In addition, most previous ap- proaches use unsupervised pre-training as an initialisation point only [12, 76, 77] and subsequently fine tune, unlike the work proposed here which uses fixed unsupervisedly- trained filter banks. In contrast to other complex unsupervised pre-training techniques, SOM is very efficient. In addition, there is definite need for alternative training method- ologies which can make use of the abundance of unlabelled data. While much current work in deep learning is focused on incremental gains using supervised learning, more focus on the unsupervised paradigm could prove more fruitful in the longer term. Fur- thermore, is it practical to rely on the time-consuming manual curating of data for future advancements in data understanding?

6.2 Related Work

Over the last few years, deep learning, has made tremendous strides in visual data classification [25, 73, 88, 116, 195]. Deep learners use multiple levels of neurons to CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS154 learn local hierarchical features at multiple levels of abstraction and increasingly com- plexity. Much of its success can be attributed the progress of supervised convolutional neural networks [101] and their ability to leverage knowledge from large-scale curated datasets, setting major new benchmarks for computer vision tasks. Image classifica- tion in particular has reaped the benefits of such large and well-labelled datasets such as ImageNet [101, 153] and Places [198] - leading to more advanced architectures. AlexNet [101] used non-saturating ReLU non-linearities in order to increase depth and take advantage of the numerous labelled examples provided leading to dramati- cally improved error-rates over previous hand-crafted solutions. The existence of such large labelled datasets has also lead to resurgence in transfer learning [172] in recent years, where knowledge is transferred from one task to the next. In fact, it has be- come near ubiquitous to use pre-trained ImageNet weights for many computer vision tasks such as image classification on small datasets [99], action classification [161] to detection and segmentation [71]. However, video classification has not seen the same level of success with im- provements being more minimal. Training deep spatio-temporal model of sufficient size for video classification are even more reliant on the availability of large labelled datasets given the considerably larger parameter space. Unfortunately, until recently, most have been well labelled but small in size [102, 149, 165]. For video, the cura- tion of large scale datasets is even more challenging since manual annotation takes far longer for video when compared to images resulting in a few large but weakly labelled datasets [1, 94]. Whilst the existance of such datasets have improved the learning of deep spatio-temporal features over baselines trained from scratch, it has not always demonstrated improved performance when compared to some hand-crafted or even single-frame solutions. Karpathy et al. [94] only achieved marginal improvements over a single-frame baseline. Tran et al. [173] only improved on a two steam RGB and optical flow based approach by combining deep and hand-crafted features. In fact CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS155 smaller action datasets are still dominated by hand-crafted solutions [149, 182]. Re- cently a new large scale video dataset, Kinetics [95], has been released (600K video and 600 categorises) which, has lead to more significant improvements in accuracy when used for pre-training [19]. To make the job easier some approaches rely on knowledge transferred from 2D in terms of data or model design. In Simonyan and Zisserman [161] a two-stream deep network which also used 2D convolutions on individual frame RGB and optical flow inputs. The spatial RGB stream was initialised using ImageNet and fine-tuned on the target action dataset. More recent work by Carreira et al. [19] also initialised a flow stream from ImageNet weights. Karparthy et al. [94] used a large video dataset to train deep networks for action recognition, however, most variants (except one) used 2D convolutions instead of 3D. Carreira et al. [19] extended pre-trained 2D kernels to 3D by simply replicating the spatial kernel into the temporal dimension and fine- tuning. However, networks of this type constrain 3D models to be identical to 2D, which may not be optimum. Unsupervised pre-training has also been used to take advantage of the abundance of unlabelled or weakly labelled data. Whilst unsupervised pre-training is not new, the majority of existing approaches use the pre-trained state as an initialisation point in order to position the parameters close to a region of local minima [12]. The pre-trained parameters are then further optimised using supervised backpropogation. Much work in this area has been applied to 2D datasets [47, 77, 141, 148]. However, recently transfer learning has become the most popular approach. There has also been previous work using unsupervised learning for video representation. Le et al. [109] learned unsupervised spatio-temporal features using cascaded independent subspace analysis (ISA) but results were only slightly better than hand-crafted solutions. Other work in this area has produced results which can be inferior to supervised features [130]. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS156

The work in this chapter attempts to replace filters with unsupervised trained al- ternatives. Specifically, this chapter explores the use of self-organising maps as an effective and efficient means to replace the first layer filters of a CNN network. Some of this work has already been explored by Peng, Hankins and Yin [144], however, this chapter explores the application of self-organising maps as replacement filter banks in more detail and applies the methodology to further datasets including both object and action classification.

6.3 Methodology

6.3.1 Proposed Method

The experiments is this chapter are concerned with the replacement of CNN filters via SOM. Specifically the proposed methodology is to replace the first layer filters of a standard CNN with SOM derived filter banks, which are subsequently fixed during training. Experiments with both 2D and 3D CNNs are performed. For the 2D CNN, experiments are performed on two challenging image datasets namely CIFAR-10 and CIFAR-100 (Section 2.3.1). Both CIFAR-10 and CIFAR-100 have 50000 and 10000 examples for training and testing, respectively. For the 3D CNN experiments, UCF-50 (Section 2.3.2.3) is used. All results reported are the average accuracy on the test set for the last five epochs of training over three independent runs of the CNN. Each SOM was trained until convergence. For simplicity, the notations used throughout will be the same. The size of the data inputs are referred to as d × td × h × w where d is the number of channels, td is the number of frames, and h and w are the height and width of the frame respectively. For 2D inputs the number of frames is 1 and therefore the size reduces to d × h × w. With regard to the filters the size is d × td × s × s, where td is the temporal depth and s is CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS157 spatial size of the filter respectively. For 2D filters the temporal depth is 1 and therefore the size of the filters is just s × s. All CNN models are trained using Theano/Lasagne and Nvidia GTX 980, GTX Ti- tan X and Titan V GPUs. SOMs are trained using either MATLAB or Theano/Lasagne.

6.3.2 Self-Organising Maps

Two varieties of SOM are experimented with in this chapter. The first variety, which is referred to as just simply SOM, is the convolutional SOM used previously in Chap- ters 4 and 5. The second variety, named SOMTI, implements translation invariance convolutional feature learning using the following methodology (similar to [45]). Each neuron of the SOMTI is convolved with an input whose dimensions are double the size

d(2s)2n of its nodes. Specifically, given a set of patches ϒ = [υ1,υ2,...,υn] ∈ R and a ds2m set of nodes W = [w1,w2,...,wm] ∈ R , at each time step, convolution is performed between each SOM node wi and a given patch υζ. The highest activated neuron wi and corresponding d × s × s subpatch of υζ, ξ, are then used to update the SOMTI using equation 4.2. Where the best matching unit bmu(t) = wi and the input x(t) = ξ. The SOM parameters LR and σ are annealed using equations 4.3 and 4.4 from Section 4.3.1

For SOMs trained on 3D video data, each d ×td × s × s input is simply collapsed dt s2 such that wi ∈ R d .

6.3.3 Convolutional Neural Networks

In this section the configurations of the standard CNNs are discussed. These are used as baselines for which results in Section 6.3.4 are compared. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS158

6.3.3.1 2D Network Architecture

The network employed is based on the VGG network proposed by Simonyan and Zis- serman [162]. It uses three stacked convolutional layers followed by two fully con- nected layers and a softmax output layer. After each stacked convolution there are max-pooling layers. DropOut is inserted after each max-pooling layer and fully con- nected layer. Whilst VGG uses 3 × 3 filters throughout, in this work, experiments are performed using 3 × 3, 5 × 5 and 7 × 7 filters in the first layer only - the filters for all remaining layers are 3 × 3. This is similar to a ResNet [73] structure which used different size filters in the first layer. Specifically 7 × 7 filters were used in the first layer and 3 × 3 were used in all others. The network employed in this work features padding in the first stacked convolutional layer and therefore the output size remains the same as the input size thus different size filters can be used without altering the dimensions of the final output. Specifically, the network implemented is described as 3 × 32 × 32 − 64C − 64C3 − MP2 − D − 128C3 − 128C3 − MP2 − D − 256C3 − 256C3 − MP2 − D − 128FC − D − 128FC − D − So ft, where D represents a DropOut layer with a probability of 0.5 and the number of output softmax layer neurons is 10 and 100 for CIFAR-10 and CIFAR-100, respectively. The ReLU function is used as the activation function on all layers apart from the output layer. Table 6.1 details the architecture.

6.3.3.2 3D Network Architecture

The network implemented is similar to the network used in Section 3.5.1 however, it is altered as follows. The first two convolution and pooling layers are replaced by a stack of convolutional layers and 1024 neurons are used in the hidden layers opposed to 4096. Other sizes were experimented with however, this was found to give a good balance between accuracy and complexity. In addition, the network accepts a input CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS159

Table 6.1: Baseline 2D CNN architecture for CIFAR-10/CIFAR-100

Input 3 × 32 × 32 RGB image 64 s × s conv ReLU with stride 1 × 1 and padding 64 3 × 3 conv ReLU with stride 1 × 1 and padding 2 × 2 max-pooling with stride 2 × 2 DropOut at 0.5 128 3 × 3 conv ReLU with stride 1 × 1 128 3 × 3 conv ReLU with stride 1 × 1 2 × 2 max-pooling with stride 2 × 2 DropOut at 0.5 256 3 × 3 conv ReLU with stride 1 × 1 256 3 × 3 conv ReLU with stride 1 × 1 2 × 2 max-pooling with stride 2 × 2 DropOut at 0.5 128 fully connected DropOut at 0.5 128 fully connected DropOut at 0.5 10/100-class softmax

of 3 × 16 × 64 × 64, therefore, increasing the temporal dimensions whilst reducing the spatial dimensions. Temporal information has shown to be very important to previous work [22, 161] and it was noted in Section 3.5.1 that perhaps 8 frames did not provide enough temporal information. In addition, Karpathy et al. [94] showed that a 3D CNN which slowly fused the temporal information worked best. Therefore this network uses more input frames and maintains the temporal information for longer than the network used in Section 3.5.1. Due to these changes the network parameters have decreased from 78M to 16M. The reduction in parameters is necessary as these experiments will be trained from scratch on a relatively small dataset whereas before the previous network was first pre- trained on a large dataset and only fine tuned on a reduced dataset. With this in mind DropOut was also added to aid generalisation. The exact structure is as follows 3 × 16 × 64 × 64 − 64C − 64C3 − MP − 128C3 − CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS160

128C3 − MP − 256C3 − 256C3 − MP − 512C3 − 512C3 − MP − D − 1024FC − D − 1024FC − D − So ft, where the DropOut probability was set to 0.5. The convolutional filters are of size 3 × 3 × 3 with a stride of 1 × 1 × 1 throughout apart from the first layer where, as with 2D, experiments are performed with both 3 × 3 × 3 and 3 × 5 × 5. Tran et al. [173] investigated the size of the temporal dimension of 3D filters and they concluded that a size of 3 was optimal and that 3 × 3 × 3 filters produced superior accuracy for 3D CNNs. This is therefore consistent with Simonyan and Zisserman [162] and therefore the same is followed here. Padding is applied to the first two stacked convolutional layers in order that the size of output remains the same as the input. For all the other convolutional layers no padding is applied. For the first two pooling layers a pool size and stride of one are used for the temporal dimension only so that to keep the temporal information intact. For the spatial dimensions in the first two pooling layers as well as all the other pooling layers a pool size and stride of two are used. By using this structure, after the second pooling layer, both spatial dimensions and the temporal dimension of the input are the same size. ReLUs are used as the activation function on all layers apart from the output layer. Table 6.2 details the architecture.

6.3.4 Filter Replacement with Self-Organising Maps

In this section the configurations of the modified CNNs which incorporate SOM filter banks are discussed. These are used as comparison against the baseline CNN models.

6.3.4.1 2D Network Architecture

The network employed was exactly the same as for the 2D baselines apart from the first layer. For the experiments conducted on M = 8 × 8 SOM the network remains identical but with the first layer filters being replaced with SOM derived filter banks. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS161

Table 6.2: Baseline 3D CNN architecture for UCF-50

Input 3 × 16 × 64 × 64 RGB video clip 64 3 × s × s conv ReLU with stride 1 × 1 × 1 and padding 64 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 and padding 1 × 2 × 2 max-pooling with stride 1 × 2 × 2 128 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 and padding 128 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 and padding 1 × 2 × 2 max-pooling with stride 1 × 2 × 2 256 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 256 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 2 × 2 × 2 max-pooling with stride 2 × 2 × 2 512 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 512 3 × 3 × 3 conv ReLU with stride 1 × 1 × 1 2 × 2 × 2 max-pooling with stride 2 × 2 × 2 DropOut at 0.5 1024 fully connected DropOut at 0.5 1024 fully connected DropOut at 0.5 50-class softmax CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS162

Table 6.3: Proposed 2D CNNNIN+SOM architecture for CIFAR-10/100

Input 3 × 32 × 32 RGB image M s × s conv ReLU with stride 1 × 1 and padding 64 1 × 1 conv ReLU 64 3 × 3 conv ReLU with stride 1 × 1 and padding 2 × 2 max-pooling with stride 2 × 2 DropOut at 0.5 128 3 × 3 conv ReLU with stride 1 × 1 128 3 × 3 conv ReLU with stride 1 × 1 2 × 2 max-pooling with stride 2 × 2 DropOut at 0.5 256 3 × 3 conv ReLU with stride 1 × 1 256 3 × 3 conv ReLU with stride 1 × 1 2 × 2 max-pooling with stride 2 × 2 DropOut at 0.5 128 fully connected DropOut at 0.5 128 fully connected DropOut at 0.5 10/100-class softmax

However, where larger SOM maps are used, the first layer filters are replaced by the combination of a SOM filter bank layer and a 1 × 1 convolutional layer. This layer functions in two ways. Firstly it learns combinations of the SOM filter banks in the first layer and secondly, it facilitates the insertion of larger maps whilst keeping the rest of the network unchanged. Experiments relating to the first case are called CNN+SOM, whereas CNNNIN+SOM refers to the second, for which Table 6.3 details the altered network design.

6.3.4.2 3D Network Architecture

For the 3D case, experiments are not performed using a network with an additional 1× 1 layer. Thus, all experiments performed use the same networks as the 3D baselines, apart from the first filter layer which is replaced by SOM-trained filter banks. Thus, CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS163 these experiments are simply referred to as CNN+SOM.

6.4 Object Classification Experiments and Discussion

on the CIFAR-10 and CIFAR-100 Datasets

6.4.1 Convolutional Neural Networks

The baseline CNN network was trained using stochastic gradient descent with Nes- terov momentum (NAG) with a batch size of 100, learning rate of 0.01 and momentum of 0.9. The learning rate was annealed using a constant of 0.989 and L2 weight de- cay was applied with a λ of 0.0005. Training was performed until convergence for 300 and 400 epochs for CIFAR-10 and CIFAR-100, respectively. The weights for all q layers were initialised as W ∼ U[−a,a], where a = 12 [57]. In terms of f anin+ f anout preprocessing, whitening was performed on both CIFAR-10 and CIFAR-100. The baseline results are shown in Table 6.4. The second two experiments used batch normalisation, whereas the first did not. The difference between the second two experiments is in terms of the number of layers batch normalisation is applied to. For the first, batch normalisation was used on all convolutional and fully-connected layers, whereas for the second batch, normalisation was used on all the same layers apart from the first. It is standard to apply batch normalisation to all convolutional and fully connected layers, however, since the proposed methodology uses fixed, independently trained filters, a more comparable baseline was sought. As can be seen from the results, the accuracy when batch normalisation was applied are far superior. In addition, it appears that the difference between the two different batch normalisation results are minimal. Thus, for all further experiments which in- clude batch normalisation, batch normalisation is applied to every layer apart from the first one to maintain consistency with the proposed methodology. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS164

Table 6.4: Baseline 2D CNN accuracy on CIFAR-10 (p-values represent the compari- son between each method with the method labelled not applicable (N/A)).

Batch normalisation Accuracy (%) p-value No 88.39±0.17 0.0013 Yes 89.57±0.17 0.5671 Yes (except first) 89.67±0.22 N/A

Table 6.5: Baseline 2D CNN accuracy on CIFAR-100 (p-values represent the compar- ison between each method with the method labelled not applicable (N/A)).

First layer filter size Accuracy (%) p-value 3 × 3 60.22±0.19 0.9643 5 × 5 60.08±0.17 0.5032 7 × 7 60.23±0.31 N/A

The baseline experiments for CIFAR-100 are shown in Table 6.5. Experiments were performed using different size filters in the first convolutional layer. Again, batch normalisation was applied to all layers, apart from the first. The results indicated that a first layer filter size of either 3×3 or 7×7 was optimal, although, the difference between all the results was not statistically significant.

6.4.2 Filter Replacement with Self-Organising Maps

6.4.2.1 Optimising SOM Parameters

In order to produce optimal overall results, the parameters of SOM were first investi- gated. Training the CNN was conducted in the same way as the baseline experiments. Since the CNN parameters were initialised using Glorot, the SOM filters used to re- place the first layer filters were scaled similarly using the following:

W − min(W) W = (Γ − Γ ) + Γ (6.1) scaled max(W) − min(W) max min min CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS165

q where Γ = −a, Γ = a, and a = 12 . Different scaling operations were min max f anin+ f anout experimented with, however, the above method achieved the best results. Many parameters for SOM were tested, and it was found that the following param- eters performed well for an 8×8 map. The SOM was trained for 600000 iterations, on a set of 900000 whitening patches, with initial values of 0.1 and 1.0 for the learning rate and σ, respectively. The learning rate and σ were annealed using LRb = 50000 and σb = 11111. However, when the SOM size was increased and combined using the 1 × 1 convolutional layer, the results were poor. Since the additional convolutional layer introduces combinations of first layer filters, it was thought that perhaps by using a small σ the SOM learned features were too independent and therefore the resultant combinations were not useful. Therefore, further experiments were conducted which maintained σ at half the spatial dimensions of the map. Firstly, experiments (Table 6.6) were performed for M = 10 × 10,15 × 15,20 ×

20,25 × 25 and 30 × 30 SOM sizes for a fixed σb on CIFAR-10. However, it was observed that the best results were between 10×10 and 20×20 sizes. Therefore, SOM sizes between this range (M = 12 × 12,14 × 14,16 × 16,18 × 18) were investigated further. In addition, a further set of experiments were performed using these same

SOM sizes, but also with a fixed final σ of 0.2. The final σ of 0.2 was chosen as results using the fixed σb showed the smallest standard deviations around this final σ value, suggesting that these results converged better, despite not showing the highest accuracy. Although M = 15 × 15 achieved the best accuracy for each method, there were no real trends observed in the data. In general, the two methodologies showed no real differences in terms of accuracy, however, it was decided to proceed with the first methodology, since, by varying σb, the larger maps could struggle to converge, as σ would be annealed too quickly. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS166

Table 6.6: 2D CNNNIN+SOM accuracy on CIFAR-10

SOM size Initial σ σb Final σ Accuracy (%) 10 × 10 5.0 11111 0.091 86.34±0.13 12 × 12 6.0 11111 0.109 86.47±0.28 14 × 14 7.0 11111 0.127 86.17±0.23 15 × 15 7.5 11111 0.136 87.54±0.26 16 × 16 8.0 11111 0.145 86.46±0.13 18 × 18 9.0 11111 0.164 87.01±0.08 20 × 20 10.0 11111 0.182 86.16±0.09 25 × 25 12.5 11111 0.227 87.13±0.09 30 × 30 15.0 11111 0.273 86.31±0.08 10 × 10 5.0 25000 0.2 87.23±0.18 12 × 12 6.0 20690 0.2 87.22±0.13 14 × 14 7.0 17647 0.2 86.72±0.31 15 × 15 7.5 16438 0.2 87.41±0.14 16 × 16 8.0 15385 0.2 87.05±0.33 18 × 18 9.0 13636 0.2 86.92±0.29 20 × 20 10.0 12245 0.2 87.17±0.11 25 × 25 12.5 9756 0.2 86.97±0.08 30 × 30 15.0 8163 0.2 86.83±0.30 CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS167

6.4.2.2 Comparing SOM and SOMTI Trained Filters

To evaluate the two different ways of training the SOM, further experiments were conducted with SOMTI so that a comparison with the experiments in Section 6.4.2.1 could be made. Specifically, a CNN with both SOM and SOMTI filters was trained on CIFAR-10. Again the CNNs were trained in exactly the same way as for the baseline experiments and the SOM filters were scaled as in Section 6.4.2.1. For the

CNN+SOMTI experiments, batch normalisation was applied as this was found nec- essary to keep the CNN weights stable during training. The SOM was trained for

600000 iterations, on a set of 900000 whitening patches, with LR = 0.1 and σ set to half the spatial dimensions of the map. The learning rate and σ were annealed using

LRb = 50000 and σb = 11111. The SOMTI was trained using the same settings except for 1.2 million iterations using LRb = 100000 and σb = 22222. Since the SOMTI is processing an input which has double the height and width of its filters, it was thought best to increase the training iterations by two, and both the LRb and σb were doubled. Only 3 × 3 filters sizes were used. Experiments were performed for a variety of SOM sizes form 10 × 10 to 30 × 30. The results of the experiments can be seen in Table 6.7. The p-value compares the accuracy between both methodologies (CNNNIN+SOM vs CNNNIN+SOMTI) for each size SOM.

The results show a significant improvement (p < 0.01) when using the SOMTI filters. Some of this improvement can be attributed to the use of batch normalisation as this demonstrated improvement when applied to the baseline (Table 6.4). However, the improvement can also be considered due to the SOMTI learning improved translational invariant features. Since the filters are applied via convolution, when used as first layer

CNN filters, shifted filters do not provide additional information. Therefore, SOMTI filters provide improved representations of the input by removing this redundancy. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS168

Table 6.7: 2D CNNNIN+SOM accuracy on CIFAR-10

SOM Size Accuracy (%) p-value CNNNIN+SOM CNNNIN+SOMTI 10 × 10 86.34±0.13 89.40±0.15 0.0001 12 × 12 86.47±0.28 89.32±0.12 0.0001 14 × 14 86.17±0.23 88.51±0.21 0.0002 15 × 15 87.54±0.26 89.44±0.15 0.0004 16 × 16 86.46±0.13 89.58±0.10 0.0001 18 × 18 87.01±0.08 89.44±0.26 0.0001 20 × 20 86.16±0.09 89.69±0.10 0.0001 25 × 25 87.13±0.09 89.34±0.14 0.0001 30 × 30 86.31±0.08 89.47±0.14 0.0001

6.4.2.3 Comparing Different Ratios of Supervised and Unsupervised Training Data

To evaluate how differing ratios of supervised and unsupervised training data effected the accuracy, a set of experiments were carried out. Specifically, the number of unsu- pervised examples available to sample from was fixed at the full size of the respective datasets training set and the number of supervised samples was varied. Experiments were performed on both CIFAR-10 and CIFAR-100. For the CIFAR-10 experiments, batch normalisation was not used on either the baseline CNN or CNN+SOM experi- ments. Whereas, for the CIFAR-100 results, batch normalisation was applied to both the baseline CNN and CNN+SOMTI experiments. To evaluate the ability of the trained SOM filters to generalise to different datasets, for the CIFAR-100 experiments, the filters are trained on CIFAR-10 only. Whilst these datasets are similar, there is no overlap in terms of examples, as the categories used are different. In addition, the size of the datasets are exactly the same, making it straightforward to use the same experimental protocols. The CNNs were trained ex- actly the same way as the baseline experiments and the SOM filters were scaled as in Section 6.4.2.1. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS169

The SOM filters were trained using an 8 × 8 SOM, for 600000 iterations, on a set of 900000 whitening patches, with initial values of 0.1 and 1.0 for the learning rate and σ, respectively. The learning rate and σ were annealed using LRb = 50000 and

σb = 11111. Whereas the SOMTI filters were trained using the same settings except for 1.2 million iterations using LRb = 100000 and σb = 22222. Only 3×3 filters sizes were used. Experiments were performed for five different ratios from 1:1 to 1:25. For the

5 CIFAR-100 experiments, further ratios between 1: 2 and 1:25 were tried. The results for the CIFAR-10 and CIFAR-100 experiments can be found in Table 6.8 and Ta- ble 6.9, respectively. The p-value compares the accuracy between the baseline CNN and CNN+SOM/CNN+SOMTI for each corresponding ratio. As can be seen from the results, when the ratio is set to 1:25, the network hardly performs any better than random choice (10% and 1% for CIFAR-10 and CIFAR-100, respectively). Focussing on the CIFAR-10 results, it is clear that with a high ratio of supervised to unsupervised training examples the baseline CNN performs better than CNN+SOM. However, as the ratio decreases, the CNN+SOM accuracy increases over the baseline, although the difference is not significant. The results for CIFAR-100 better illustrate this point. Again the baseline CNN achieves higher accuracies for higher ratios, although the difference between the two techniques is less pronounced

5 25 and actually not statistically significant for both 1: 2 and 1: 8 ratios. For lower ratios,

CNN+SOMTI significantly outperforms the baseline at the 5% level or less for all but 25 the 1: 6 ratio.

6.4.2.4 Evaluating SOM Size and Filter Size

In this section, different size SOMs as well as different size SOM filters are explored. Specifically, for CIFAR-10, SOM sizes of M = 10 × 10,20 × 20,30 × 30,40 × 40 and 50 × 50 are investigated. The same size SOMs are also explored for CIFAR-100 in CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS170

Table 6.8: 2D baseline CNN vs CNN+SOM subset accuracy on CIFAR-10

Supervised Supervised Supervised to Accuracy (%) p-value training examples unsupervised size per class ratio Baseline CNN CNN+SOM 50000 5000 1:1 88.39±0.17 87.07±0.08 0.0003 5 20000 2000 1: 2 83.06±0.13 81.63±0.09 0.0001 10000 1000 1:5 75.41±0.28 73.56±0.42 0.0032 5000 500 1:10 42.11±3.85 44.97±0.83 0.2769 2000 200 1:25 10.00±0.00 10.45±0.62 -

Table 6.9: 2D baseline CNN vs CNN+SOMTI subset accuracy on CIFAR-100

Supervised Supervised Supervised to Accuracy (%) p-value training examples unsupervised size per class ratio Baseline CNN CNN+SOMTI 50000 500 1:1 60.22±0.19 59.52±0.27 0.0213 5 20000 200 1: 2 47.28±0.22 47.14±0.47 0.6646 25 18000 180 1: 9 45.67±0.10 44.88±0.26 0.0080 25 16000 160 1: 8 43.96±0.18 43.62±0.48 0.3147 25 14000 140 1: 7 40.45±0.16 40.45±0.20 1.0000 25 12000 120 1: 6 37.61±0.25 37.84±0.10 0.2131 10000 100 1:5 30.43±0.60 32.20±0.53 0.0186 50 9000 90 1: 9 27.16±1.16 29.81±0.29 0.0185 25 8000 80 1: 4 19.00±2.05 25.33±0.17 0.0060 50 7000 70 1: 7 14.77±2.23 21.60±0.79 0.0075 25 6000 60 1: 3 5.09±0.20 12.52±0.55 0.0001 5000 50 1:10 4.11±0.09 7.11±0.73 0.0021 2000 20 1:25 1.00±0.00 1.00±0.00 - CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS171 addition to varying the filter size using s = 3,5 and 7. For CIFAR-100 experiments, again, SOMTI filters trained on CIFAR-10 are used. Batch normalisation is applied to the CNNs and they are trained in exactly the same way as for the baseline experiments. The SOM filters were scaled as in Section 6.4.2.1. Table 6.10 shows the results for CIFAR-10. The results indicate that altering the SOM size has no real impact on the accuracy on CIFAR-10 and there is no trend with increasing SOM sizes. A SOM size of M = 20 × 20 was found to have the greatest accuracy, but this was not statistical greater than the baseline. Additionally, altering the SOM size resulted in no statistically significant difference in accuracy versus baseline for any of the SOM sizes used. The filters for the optimal SOM size are displayed in Figure 6.1. Tables 6.11 - 6.13 show the results for CIFAR-100 using filter size s = 3,5 and 7, respectively. Using a filter size of s = 3, CNNNIN+SOMTI was found to improve accuracy versus CNN with a M = 20 × 20 and 30 × 30, although, this difference was not statistically significant for either result. By increasing the SOM sizes with

CNNNIN+SOMTI with s = 3, the optimum result was found using a SOM size of M =

20×20, after which the accuracy decreases. Using a filter size of s = 5, CNNNIN+SOMTI was found to improve accuracy versus CNN with a M = 20 × 20,30 × 30 and 40 × 40, however in this case, results were statistically significant at the 5% level with all three SOM sizes. By increasing the SOM sizes with CNNNIN+SOMTI with s = 5, the accuracy increases to the optimum result, which was found using a SOM size of M = 30 × 30, after which the accuracy decreases. Using a filter size of s = 7,

CNNNIN+SOMTI was not found to improve accuracy versus CNN with any of the

SOM sizes tested. By increasing the SOM sizes with CNNNIN+SOMTI with s = 7, accuracy increased to an optimum results with a SOM size of M = 30 × 30, however with bigger SOM sizes, there are small fluctuations in the accuracy that are not stati- cally different versus the optimum result for this filter size. The filters for the optimal CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS172

Table 6.10: 2D CNNNIN+SOMTI accuracy on CIFAR-10 (p-values represent the com- parison between each method with the method labelled not applicable (N/A) in each column).

Type SOM size Initial σ Accuracy (%) p-value CNN - - 89.67±0.22 N/A 0.8929 CNNNIN+SOMTI 10 × 10 5.0 89.40±0.15 0.1539 0.0495 CNNNIN+SOMTI 20 × 20 10.0 89.69±0.10 0.8929 N/A CNNNIN+SOMTI 30 × 30 15.0 89.47±0.14 0.2548 0.0911 CNNNIN+SOMTI 40 × 40 20.0 89.55±0.16 0.4847 0.2681 CNNNIN+SOMTI 50 × 50 25.0 89.61±0.15 0.7162 0.4850

Table 6.11: 2D CNNNIN+SOMTI using 3×3 filters accuracy on CIFAR-100 (p-values represent the comparison between each method with the method labelled not applicable (N/A) in each column).

Type SOM size Initial σ Accuracy (%) p-value CNN - - 60.22±0.19 N/A 0.1444 CNNNIN+SOMTI 10 × 10 5.0 60.02±0.30 0.3845 0.0790 CNNNIN+SOMTI 20 × 20 10.0 60.54±0.24 0.1444 N/A CNNNIN+SOMTI 30 × 30 15.0 60.45±0.31 0.3348 0.7112 CNNNIN+SOMTI 40 × 40 20.0 60.17±0.22 0.7806 0.1204 CNNNIN+SOMTI 50 × 50 25.0 59.88±0.18 0.0876 0.0189

SOM sizes for s = 3,5 and 7 are displayed in Figures 6.1–6.3 Overall, as can be seen in the above tables, there was a suggested trend between the size of the filter and the SOM size which showed the peak accuracy. Indeed, as the filter size increases the optimum SOM size gets larger (i.e. the peak accuracy using s = 3 was found using a SOM size of M = 20 × 20, and the peak accuracy using s = 5 was found using a SOM size of M = 30×30). Although the peak accuracy using s = 7 was found using a SOM size of M = 30×30, the trend in the data suggests that perhaps the optimum SOM size has not been identified. This trend is further illustrated in Fig. 6.4, which shows a comparison of the results obtained in the above experiments. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS173

Table 6.12: 2D CNNNIN+SOMTI using 5×5 filters accuracy on CIFAR-100 (p-values represent the comparison between each method with the method labelled not applicable (N/A) in each column).

Type SOM size Initial σ Accuracy (%) p-value CNN - - 60.08±0.17 N/A 0.0133 CNNNIN+SOMTI 10 × 10 5.0 59.94±0.46 0.6469 0.0495 CNNNIN+SOMTI 20 × 20 10.0 60.58±0.19 0.0274 0.3439 CNNNIN+SOMTI 30 × 30 15.0 60.76±0.22 0.0133 N/A CNNNIN+SOMTI 40 × 40 20.0 60.73±0.35 0.0444 0.9060 CNNNIN+SOMTI 50 × 50 25.0 59.92±0.19 0.3382 0.0075

Table 6.13: 2D CNNNIN+SOMTI using 7×7 filters accuracy on CIFAR-100 (p-values represent the comparison between each method with the method labelled not applicable (N/A) in each column).

Type SOM size Initial σ Accuracy (%) p-value CNN - - 60.23±0.31 N/A 0.5389 CNNNIN+SOMTI 10 × 10 5.0 57.57±0.13 0.0002 0.0004 CNNNIN+SOMTI 20 × 20 10.0 59.79±0.12 0.0836 0.3383 CNNNIN+SOMTI 30 × 30 15.0 60.04±0.38 0.5389 N/A CNNNIN+SOMTI 40 × 40 20.0 59.77±0.25 0.1160 0.3620 CNNNIN+SOMTI 50 × 50 25.0 59.92±0.03 0.1598 0.6145 CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS174

Figure 6.1: Learned SOMTI filters on CIFAR-10 using M = 20 × 20 and s = 3. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS175

Figure 6.2: Learned SOMTI filters on CIFAR-10 using M = 30 × 30 and s = 5. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS176

Figure 6.3: Learned SOMTI filters on CIFAR-10 using M = 30 × 30 and s = 7. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS177

Figure 6.4: Accuracy of different SOM sizes (M) for CNNNIN+SOMTI on CIFAR-100 CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS178

6.4.2.5 Discussion and Comparison to State-of-the-Art

Table 6.14 compares the best results from the previous experiments described in this chapter, to the best baseline and state-of-the-art, and results published by Peng, Han- kins and Yin [144]. Also included in this table are the parameter counts for each proposed methodology. For the results produced in this chapter, as well as the paper by Peng, Hankins and Yin [144], only the supervised parameters are counted. There- fore, where SOM-derived filters are used, these filter banks are not included in the total number of parameters, although, in most cases, since only the first layer is replaced, these exceptions are minimal. The results shown here for CIFAR-10 indicate that using the proposed method shows similar results to the baseline. When compared with results from Peng, Han- kins and Yin [144], the proposed method shows comparable accuracy to one of the methods tested previously, but is statistically inferior to the other result. However, both methods from Peng, Hankins and Yin [144], use twice as many parameters as the method proposed in this study. In terms of other state-of-the-art results, the proposed approach offers comparable accuracy, even against methods that utilise more parame- ters, such as Maxout [61]. However, the performance is not as good as RCNN [115] and DenseNet [82], which use far greater numbers of layers. In addition, all state- of-the-art methods mentioned use supervised learning, compared with the proposed method and the methods used in Peng, Hankins and Yin [144], which use a combina- tion of supervised, unsupervised and/or generative approaches. The results shown here for CIFAR-100 suggest that using the proposed method im- proves upon the best baseline, and this is statistically significant at the 10% level. How- ever, in comparison with the results from the Peng, Hankins and Yin [144], accuracy is significantly decreased. This is also true when compared with the state-of-the-art methods, apart from Stochastic Pooling, which used six times fewer parameters [195]. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS179

However, it is worth noting that the CIFAR-100 SOM filters were not trained on the target data set, instead they were trained on CIFAR-10, therefore this may explain the lower accuracy with the current approach. The experiments in this section have demonstrated that first layer filters of a 2D CNN can be successfully replaced with unsupervisedly trained filters. Simply replac- ing the filters like-for-like shows a small decrease in performance, however, exper- iments performed on different ratios of supervised and unsupervised data show the advantages of using unsupervised approaches when labelled training data is scarce. These results are as expected since the SOM trained filters provide the CNN with a better initialisation point than training from scratch. The generalisation of the unsu- pervsiedly trained filters is further explored through the experiments on CIFAR-100. Remarkably, the proposed approach improves on the baseline even when the first layer filters were trained on a separate dataset, perhaps indicating that the baseline is prone to overfitting. CIFAR-100 is more complex than CIFAR-10 since it has 10 times the number of classes whilst featuring the same number of training examples and therefore fewer examples per class, which suggests that more generalised filters could provide improved performance. When experiments are performed using larger SOM sizes the results are similar, if not better, than the baselines. This suggests that there are advantages to approaches that combine different learning paradigms over standard supervised only methodologies. In terms of the SOM size, evidence suggests that increasing filter size leads to larger optimum SOM sizes. This is understandable, given smaller filters sizes have fewer degrees of freedom and therefore fewer potential states to occupy. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS180 ------N/A 0.0002 0.0001 0.0324 0.0732 0.3068 p-value 19 24 . . 0.19 0.18 0.31 0.22 0 0 ± ± ± ± ± ± - 57.49 65.46 64.32 66.29 64.43 67.41 75.85 CIFAR-100 62.94 67.02 60.23 60.76 60.22 60.54 Accuracy (%) ------N/A 0.1420 0.0001 0.8929 p-value 0.15 0.11 0.22 0.10 ± ± ± ± - - 81.3 84.87 88.32 89.59 90.92 90.69 91.02 94.08 CIFAR-10 89.88 91.90 89.67 89.69 Accuracy (%) - 0.2M 6.0M 1.0M 1.4M 0.7M 1.2M 0.8M 2.5M 2.5M 1.2M 1.2M 1.2M 1.3M No. Params 20) 30) × × 20 30 = = M M 3) 7) 3, 5, -CNN [144] = = = = TI s s s s ( ( TI TI Method -SOM Maxout [61] CNN ( CNN ( 5 RCNN-96 [115] ALL-CNN [166] RCNN-128 [115] Rot DropConnect [181] NIN + Dropout [116] +SOM +SOM DenseNet-BC-100 [82] MRF-SOM-CNN [144] Stochastic Pooling [195] NIN NIN MRF CNN CNN Table 6.14: Accuracylabelled on not CIFAR-10 applicable and (N/A) for CIFAR-100 each (p-values section). represent the comparison between each method with the method CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS181

6.5 Action Classification Experiments and Discussion

on the UCF-50 Dataset

6.5.1 Convolutional Neural Networks

The baseline CNN was trained with stochastic gradient descent, with Nesterov mo- mentum, using a small mini-batch of 32. The learning rate was set to 0.003 and was annealed at a rate of 0.989 every epoch, with momentum set to 0.9. To further combat over-fitting, L2 weight decay was applied with a λ of 0.0005. During training, all video frames were first resized to 73 × 98 from the original 240 × 320. Spatial and temporal augmentation was performed by cropping 3 × 16 × 64 × 64 video clips from videos to form a training batch. Horizontal flipping with 50% probability was also option- ally applied as additional augmentation. Due to the sampling process an epoch was defined as selecting a clip from each video roughly once (the number of videos does not divide by the mini-batch size evenly). Training from scratch was performed until convergence for 300 epochs. Weights for all layers were initialised as W ∼ U[−a,a], q where a = 12 [57]. In terms of pre-processing, each batch was normalised f anin+ f anout by standardising the inputs by the batch mean and standard deviation. Lastly, in keep- ing with the 2D experiments, batch normalisation was applied to every layer of the network, apart from the first one. At test time, for each video in the test set, ten clips were extracted at random temporal locations. For each of these clips, only the central 64 × 64 crop was taken. The overall prediction rate reported was the average over all clips. The results in Table 6.15 show the accuracy on UCF-50, with and without hori- zontal flipping, using both 3 × 3 and 5 × 5 filters in the first layer, across 5 folds. The authors of UCF-50 recommend using 25-fold cross validation, however, due to time constraints 5-fold cross validation is used instead. The results indicated that the use of CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS182 horizontal flipping improves performance by 2% for 3 × 3 and 1.5% for 5 × 5 (signif- icant at the 10% and 5% level, respectively). In addition, it appears that using 3 × 3 filters in the first layer is optimal, which concurs with previous work [173]. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS183 N/A 0.0721 0.0009 0.0213 N/A N/A p-value 0.0721 0.0155 0.81 0.73 0.92 0.58 ± ± ± ± Overall 52.88 53.89 51.21 52.70 1.84 0.43 0.55 0.47 ± ± ± ± 0.73 53.28 0.73 55.21 1.22 52.00 0.16 53.64 ± ± ± ± 0.19 52.38 1.22 52.52 1.31 49.12 0.91 51.26 ± ± ± ± 0.32 52.59 0.67 53.70 0.98 51.12 0.89 52.23 ± ± ± ± 0.97 55.05 0.61 56.75 0.55 53.58 0.48 55.69 ± ± ± ± Fold = 1 2 3 4 5 Accuracy (%) 51.12 51.26 50.23 50.70 No No flip Yes Yes Hoz. 3 3 5 5 × × × × size filter First 3 3 5 5 layer Table 6.15: Baseline 3D CNN accuracy onnot UCF-50 applicable (p-values (N/A) represent for the comparison each between section each in method each with the column). method labelled CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS184

6.5.2 Filter Replacement with Self-Organising Maps

6.5.2.1 Optimising SOM Parameters

As with the 2D work described previously, experiments were performed here to ascer- tain the best parameters for the SOM. The CNN for these experiments was trained in exactly the same way as for the 3D baseline experiments, except the SOM filter banks, which were scaled as in Section 6.4.2.1. All experiments were performed using the first fold of the dataset with no augmentation, in the form of horizontal flipping.

Based on the results from the 2D experiments, SOMTI was first trained using 900000 patches for 1200000 iterations, with initial values of 0.1 and 1.0 for the learn- ing rate and σ, respectively. The learning rate and neighbourhood were annealed using

LRb = 100000 and σb = 22222. However, given that the data space is considerably larger, the SOMTI had difficulties converging. To aid convergence, the number of patches were reduced and further experiments were performed to fine-tune the param- eters of the SOM. Table 6.16 shows the results for these experiments. Firstly, the number of patches were reduced by half, from 900000 to 450000, and the iterations

LRb and σb were kept constant at 1200000, 100000 and 22222, respectively. Secondly, the number of patches were halved again to 225000. Thirdly, to compensate for the reduced number of patches the SOM parameters were also reduced by two. Lastly, 112500 patches were tested using the same parameters as the third set of experiments.

In addition to the settings listed in the table, the initial learning rate and σ were 0.1 and 1.0, respectively. All experiments were performed on both whitened and unwhitened patches. The results in Table 6.16 show that using 225000 patches and training for 600000 iterations using LRb = 50000 and σb = 11111 achieved the best results, regardless of whether the patches are whitened. Although, overall the whitened patches produced the superior performance. It is interesting to note that the SOMs trained on whitening CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS185

Table 6.16: 3D CNN+SOMTI accuracy on UCF-50

Patch Number Patch Whitening Iterations LRb σb Accuracy (%) 450000 No 1200000 100000 22222 48.79±1.25 450000 Yes 1200000 100000 22222 50.91±1.35 225000 No 1200000 100000 22222 48.83±0.27 225000 Yes 1200000 100000 22222 48.24±1.32 225000 No 600000 50000 11111 50.57±1.06 225000 Yes 600000 50000 11111 51.46±0.77 112500 No 600000 50000 11111 49.08±1.39 112500 Yes 600000 50000 11111 50.51±1.11

patches generally achieve better accuracy, even though no whitening is applied to the input to the CNN.

6.5.2.2 Evaluating Filter Size

In this section, different size filters in the first layer were experimented with. Ex- periments were first carried on the first split to ascertain the best filter size. The best performing size was then used for the remaining four splits. 8×8 SOMTIs were trained using 225000 patches for 600000 iterations. The initial learning rate and neighbour- hood size were set at 0.1 and 1, respectively, and were annealed using LRb = 50000 and

σb = 11111. Again the CNN was trained in exactly the same way as for the baseline experiments and the SOM filters were scaled as in Section 6.4.2.1. In order to directly compare against the baseline, augmentation, in the form of horizontal flipping, was also evaluated. The result in Table 6.17 show that using the proposed method and testing on the first fold, s = 3 × 3 shows comparable accuracy versus the baseline. Interestingly, for this case, augmentation for the first fold showed a small non-significant decrease in performance (p = 0.3409), however, since the filters are trained unsupervisedly, perhaps augmentation has less of an effect on accuracy. Furthermore, when using the CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS186 proposed method, the 5 × 5 results are substantially worse than the 3 × 3 results with or without augmentation (p < 0.001 for both). The 2D experiments demonstrated that larger filter sizes required larger SOMs in order to cover a wider variety of features. The results here appear to follow a similar trend, given that the results for 5 × 5 filters were 5–8% lower than the 3 × 3 result. In fact, it appears that the effect is amplified given the added temporal dimension. Additionally, with regards to augmentation for the 5 × 5 case, this appears to have a greater influence than for 3 × 3, however, this could be due to the filters not providing a thorough representation of the input space given the small SOM size. For these reasons, the 5 × 5 experiments were not extended to all five splits. Overall, when the 3 × 3 experiments using the proposed approach were carried out over the five folds, mean accuracy was found not to improve versus baseline results. Overall, augmentation of the 3×3 experiments using the proposed method resulted in a small increase in accuracy, however, this was not statistically significant (p = 0.1701). Additionally, in terms of augmentation for the proposed method, while a small increase is observed, it is less significant than the difference observed with augmentation of the baseline results, for the same reasons stated above. The SOMTI filters for td = s = 3 are displayed in Figure 6.5. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS187 - - N/A 0.0439 0.0007 0.1588 0.0377 0.1701 p-value 0.81 0.73 0.92 0.58 0.60 0.41 ± ± ± ± ± ± - - Overall 52.88 53.89 51.21 52.70 51.42 51.91 1.84 0.43 0.55 0.47 0.62 0.29 ± ± ± ± ± ± 0.73 53.28 0.73 55.21 1.22 52.00 0.16 53.64 0.61 49.70 0.48 49.64 ± ± ± ± ± ± 0.19 52.38 1.22 52.52 1.31 49.12 0.91 51.26 0.48 48.07 0.44 48.79 ± ± ± ± ± ± 0.32 52.59 0.67 53.70 0.98 51.12 0.89 52.23 0.50 53.25 0.31 53.67 ± ± ± ± ± ± 0.97 55.05 0.61 56.75 0.55 53.58 0.48 55.69 0.77 54.61 0.55 56.58 0.780.70 ------± ± ± ± ± ± ± ± Fold = 1 2 3 4 5 Accuracy (%) 51.12 51.26 50.23 50.70 51.46 50.87 43.20 45.90 accuracy on UCF-50 (p-values represent the comparison between each method with the method labelled No No No No flip Yes Yes Yes Yes TI Hoz. 3 3 5 5 3 3 5 5 × × × × × × × × size filter First 3 3 5 5 3 3 5 5 layer TI TI TI TI CNN CNN CNN CNN Type CNN+SOM CNN+SOM CNN+SOM CNN+SOM Table 6.17: 3D CNN+SOM not applicable (N/A) for each section). CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS188

6.5.2.3 Discussion and Comparison to State-of-the-Art

Table 6.18 shows the best results from this section, along with the best baseline and other state-of-the-art approaches. For clarity, the results for Gist [139] and Laptev et al. [105] are reported by Kuehne et al. [102]. In this chapter, five-fold cross validation is used to evaluate performance, as also used by [102,154]. Most of the studies on UCF-50 follow the authors’ [149] suggested experimental setup and use 25-fold cross validation. Therefore, whilst methods using this split are included in the table, it would be unfair to make direct comparisons, since the training and testing sets are scaled by ×1.2 and ×0.2, respectively. This explains the generally higher accuracies reported for the these methods [149, 163, 182, 185]. Other methods [175] which choose arbitrary train and test splits risk the training set bleeding into the testing set, as video clips within each group come from the same long video. Therefore, the results of these methods are not reported here. Among the methods that use the same split as the work in this chapter, Gist [139] (from [102]) extracted coarse low-level representations based on orientations com- puted over three frames, Laptev et al. [105] (from [102]) used a combination of HOG and HOF descriptors from spatio-temporal interest points and Action Bank [154] pro- posed hand-crafted high-level semantic action representations. The results for the pro- posed method indicate that they do not improve on the baseline performance. However, they achieve superior performance to both Gist and Laptev et al.. Yet, performance is not a good as Action Bank, which achieves the best accuracy using five fold cross validation of 57.90%. Comparisons to more similar approaches are not possible, since most convolutional neural network-based approaches tend to opt for the larger UCF-101 dataset. Since the dataset is larger, it is more suitable to train from scratch, although, generally models are pre-trained on larger datasets, such as Sports-1M [173]. Although it was planned CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS189

(a) td = 0

(b) td = 1

(c) td = 2

Figure 6.5: Learned SOMTI filters on UCF-50 using M = 8 × 8 and td = s = 3 (there- fore each filter is of size 3 × 3 × 3). (a)–(c) represent each slice of td where td is the temporal depth or the number of frames. CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS190

Table 6.18: Accuracy on UCF-50

Method Folds Accuracy (%) Solmaz et al. [163] 25 73.70 Reddy and Shah [149] 25 76.90 Wang et al. [185] 25 85.7 Wang and Schmid [182] 25 91.2 Gist [139] (from [102]) 5 38.8 Laptev et al. [105] (from [102]) 5 47.9 Action Bank [154] 5 57.90 CNN (s = 3) 5 53.89±0.73 CNN+SOMTI (s = 3) 5 51.91±0.41

to extend the experiments in this section to UCF-101, since the results on UCF-50 demonstrated inferior results against the baseline, these proposed experiments were not carried out. The experiments in this section have not been able to improve on the baseline results. However, experiments have only been conducted with CNN+SOMTI. As ev- idenced by the 2D results, only CNNNIN+SOMTI showed superior accuracy to the baseline. Therefore, it is reasonable to presume that similar results could be expected if carried out on the 3D dataset. In terms of filter size, the results show similar trends to the 2D results with smaller size filters provided superior results compared to larger filters for small sized SOMs. Furthermore, the results suggest that augmentation has a smaller effect when the filters are trained unsupervisedly.

6.6 Conclusion and Future Work

The experiments on 2D data showed comparable, if not improved, results over the baselines and other state-of-the-art methods of similar scale. Whilst improvements were not demonstrated for all scenarios, the results suggests that with additional fine- tuning of parameters, more improvement could be made. In addition, these results CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS191 were achieved using unsupervisedly trained first layer filters, which requires no costly annotations and as shown, can be trained on limited data. Furthermore, the advantages of unsupervised filter training were further outlined by the results of the transfer learn- ing experiments conducted on CIFAR-100. The same increase in performance was not observed on the 3D data, however, larger SOM maps were not explored in this regard. Nonetheless, the results are promising and demonstrate scope for future improvement. The fact that hand-crafted solutions are still producing superior results on these smaller scale datasets shows that more work is needed on the application of 3D deep learning based methods. In general, the work in this chapter has demonstrated that a combina- tion of unsupervised and supervised filter learning can be successful and can lead to significant improvements in terms of accuracy, showing greater improvements when data is scarce and weakly labelled. However, so far filter replacement has been restricted to the first layer. Since the filters are trained independently, it could be problematic to extend this idea to sub- sequent layers. Thus, an alternative implementation should be sought. With this in mind, it could be beneficial to explore jointly training supervised and unsupervised approaches. One potential avenue of exploration could be combining a CNN with a convolutional stacked autoencoder. Since autoencoders are trained using backpropoga- tion, it may not provide an efficient means for learning large number of features in a single layer as done here. However, it would provide the means to efficiently extend this method to additional layers and therefore recover what has been lost in terms of width, via extending the depth. Yet, other methods explored here, such as the use of a 1 × 1 convolutional layer, could still be investigated to allow the CNN to learn addi- tional auxiliary features. In addition, experiments should be conducted with different CNN architectures, such as ResNet and DenseNet. For further details, and some initial findings, please see Section 7.2.3. In terms of other potential possibilities of combining CNNs with SOM there are CHAPTER 6. FILTER REPLACEMENT IN CONVOLUTIONAL NETWORKS192 a number of interesting lines of enquiry that could be explored. For example, SOM clustering could be used to provide weakly labelled data to train a standard CNN. Specifically, a SOM could be applied to the output features of a CNN and the resultant cluster assignments could provide a weakly supervised signal to update the weights of the CNN. Or, SOM could be considered as a way of reducing redundancy in the feature space, by applying SOMs to existing supervisedly trained features, in a process known as pruning. This could assist in reducing the number of features, however, connections between layers would need further training, similar to Dundar et al.. [45]. Chapter 7

Conclusions and Future Work

7.1 Conclusions

Overall, the results presented in this thesis suggest that it is worthwhile using un- supervised feature learning approaches in the task of image and video classification. Although unsupervised approaches have fallen out of favour in the deep learning era that is dominated by supervised methodologies trained on large datasets, the work car- ried out in this thesis demonstrates that unsupervised approaches are still useful and worth exploring further. In addition, this work shows that, in particular, SOM-based unsupervised learning, which is a popular method in the unsupervised community, can still provide comparable results to other state-of-the-art approaches, when used inde- pendently or in conjunction with other techniques. Indeed, the variety of applications employing SOM-based feature learning shown here, demonstrates their robustness and versatility. In Chapter 4, a simple SOM-based multi-layer convolutional architecture was pro- posed for image classification. This novel method demonstrated comparable or im- proved performance against PCANet and DCTNet baselines when tested on two datasets

193 CHAPTER 7. CONCLUSIONS AND FUTURE WORK 194 or one dataset, respectively. Although based on PCANet and DCTNet, this new ap- proach demonstrated a modified encoding which reduced the dimensionality of the feature vector, enabling greater numbers of features to be utilised. Furthermore, the use of SOM meant that the number of filters learned was not constrained by the fil- ter size, as is the case with DCANet and PCTNet-derived filters. Therefore, the new method proposed here, which used more filters, resulted in potentially better represen- tation of the input and thus increased performance in some cases. SOMNet’s use of an alternative encoding scheme dramatically reduced the final output feature vector. For the proposed approached, this was taken advantage of by increasing the number of filters in the first and second layers, whilst maintaining the output feature vector size the same as with PCANet and DCTNet. However, the encoding scheme could also be taken advantage of by increasing architecture depth (i.e. number of layers) instead of width (i.e. number of filters). In general, more traditional convolutional deep learning approaches trained via backpropagation can be difficult to understand and are often used as a ‘black box’. However, a benefit of the proposed SOMNet methodology is that it provides a simpler, more-transparent approach, which is easier to understand and modify; attributes which could make it more suitable for industrial applications. The importance of encoding was made clear in Chapter 4 (where compact encod- ing resulted in the use of more filters), and the work of other’s has also highlighted the importance of encoding [28]. In Chapter 5, this was built upon by exploring the use of channel aggregation techniques between layers of a two later SOMNet architecture. A novel extension of SOMNet was proposed in an attempt to learn hierarchical features using simple pooling mechanisms. A variety of pooling mechanism were proposed in order to learn complex representations based on combinations of features from previ- ous layers, which required no additional parameter learning over the previously pro- posed SOMNet. Results on MNIST suggested that combining greater numbers of fea- tures (up to a limit) using FAC and SAC statistically improved performance, although, CHAPTER 7. CONCLUSIONS AND FUTURE WORK 195 when experiments were replicated on a larger architecture with increased number of features, only minor non-statistical improvements against the baseline were demon- strated. This suggests that the choice of encoding between layers is less important than the total number of filters employed. Out of all of the pooling methods tested, GAC provided the worst results for MNIST, indicating that grouping filters by map position may not aid performance. In contrast, for the more-complex CIFAR-10, GAC pooling with fewer filters provided the best results, although, given that the features were more independent, the neighbourhood of the map may have less influence on pooling group- ings. However, none of the proposed methods showed improved performance against the baseline, apart form when experiments were conducted on a subset of the training data, indicating that the improvement could possibly be due to reduced redundancy and noise, leading to better generalisation of the filters. However, statistically, for the CIFAR-10 results, no strong conclusions could be made. Overall, results indicated that these pooling techniques may be beneficial for sim- ple datasets when utilising limited numbers of filters, but did not show improvements when more complex datasets were used. Translating to real-world applications, this could mean that this technique may be beneficial for simple tasks which have con- straints regarding model size due to memory limitations where architectures of limited filter numbers could be useful. However, use of the proposed methodologies on more complex datasets requires further exploration. In Chapter 6 SOM was combined with convolutional neural networks in order to examine if low-level features could be replaced by unsupervisedly trained alternatives to improve performance. Specifically, the first layer filters of deep-convolutional neu- ral networks were replaced by efficiently trained SOM-based filters, which can take advantage of using unlabelled data. Extensive experiments were conducted on both 2D and 3D datasets. Results on 2D image classification indicated that comparable, if CHAPTER 7. CONCLUSIONS AND FUTURE WORK 196 not improved, performance versus the baseline and other similarly scaled state-of-the- art approaches was achieved when using larger maps via CNNNIN+SOMTI. Further- more, since the filters were trained using unsupervised learning it could be suggested that the proposed approach could achieve superior performance with fewer supervised training examples than other purely supervised state-of-the-art approaches. This is evi- denced by the improved performance realised against the baseline when the number of supervised training examples was systematically decreased. In addition, the ability of unsupervisedly trained filters to generalise to other datasets was further outlined by ex- periments conducted on CIFAR-100, for which, the SOM-derived filters were trained on CIFAR-10, leading to significant increases over the baseline. Whilst these new methods did not provide comparable results to the state-of-the-art, they are promising given that the first-layer filters were trained on an unrelated dataset. However, results on 3D action recognition were not as successful, outlining the increased inherent com- plexity of 3D trained models. Yet, unlike with 2D, larger SOM maps were not explored which could lead to improved performance, as was the case with 2D. Nonetheless, the results are promising and demonstrate scope for future improvement. In general results demonstrated that SOM-based filters could be trained on limited amounts of unlabelled data and be successfully inserted into a supervised convolutional neural networks, showing improved or comparable accuracy in most cases. However, more work is required to explore if this approach can be extended to further layers and see if additional improvements can be made whilst utilising fewer labelled training exam- ples. Across all of the work presented here, SOM was used as the main feature learn- ing methodology. SOM [98] uses competitive learning in order to quantize an input space whilst preserving data topological structure. Although the choice of unsuper- vised learning method was evaluated in Chapter 4 (SOM vs. PCA vs. k-means), and CHAPTER 7. CONCLUSIONS AND FUTURE WORK 197 it was found that the choice of feature learning method did not majorly affect perfor- mance, SOM offers additional benefits over the other methods. For instance, SOM is not limited by the PCA covariance matrix constraints or constrained to find orthogonal principal directions. In addition, SOM is more immune to initialisation and outliers than k-means. Whilst, other more complicated approaches such as restricted Boltz- mann machines (RBMs) and autoencoders have been proposed, they can be inferior to simple vector-quantisation based approaches when using shallow structures [27]. Yet, for satisfactory performance, the use of SOM requires the inputs to have adequate pre- processing, such as PCA whitening. This was evidenced in the results in Chapter 5 vs Chapter 4, where PCA whitening was used to improve performance. Whilst whiten- ing involved the use of PCA, it is effective to perform this on patches, and therefore it is not computationally expensive. However, there is no natural extension of SOM to multi-layer architectures. The results in Chapter 4 and Chapter 5 demonstrated that a multi-layer architecture could be trained greedily, however, filters are of a single depth. When vector quantisation techniques have been applied to high dimensional inputs, results have been poor [29]. Therefore, for a more elaborate architecture, such as those employed in convolutional neural networks, which feature full connections between layers, an alternative unsupervised approach may be beneficial to extend the work described in Chapter 6. Overall, the results presented here demonstrate that sim- ple SOM-based features can provide good representations which can be useful for data classification under a wide variety of scenarios. Although, these simple approaches are generally ignored by the deep-learning community, this work shows that efficient approaches to unsupervised learning, such as SOM, can still be of use in this field. The use of unsupervised learning has fallen out of favour in recent years, and been replaced by supervised approaches, such as the convolutional neural network [101]. Techniques such as transfer learning [172], enable the transfer of knowledge between applications which has replaced the need for unsupervised pre-training. However, CHAPTER 7. CONCLUSIONS AND FUTURE WORK 198 transfer learning still requires a large labelled dataset to train, which require signifi- cant human effort in terms of curation and annotations. At the same time, hand-crafted solutions have been penalised due to their reliance on human knowledge. Yet, super- vised deep learning approaches change the focus of human know-how from model to data design. Whilst recently there has been more progress on architectural advances, these are only made on the back of efforts into dataset curation. When we consider other areas such as 3D data, advances have been limited due to the use of large weakly labelled datasets. This reliance of strongly labelled large scale datasets is problem- atic and therefore methods that make use of alternative strategies such as unsupervised learning or approaches that combine supervised and unsupervised learning should not be dismissed.

7.2 Future Work

Future work could be explored in a few different areas. Firstly, further extensions could be made to the SOMNet architecture introduced in Chapter 4 by employing deeper architectures and investigating different encoding strategies. Secondly, the pooling structures introduced in Chapter 5 could be investigated with a supervised CNN in order to improve the generalisation of features through reduced redundancy of the feature space. Thirdly, whilst CNN features were successfully replaced with SOM- based features in Chapter 6, other unsupervised algorithms could be investigated, as well as extending the approach to further layers. Lastly, in terms of action recognition specifically, temporal information was implicitly encoded in the form of 3D spatial temporal filters and therefore explicit models were not explored. Thus, further work in the application of explicit models could be investigated. The following sections will discuss these potential avenues of exploration in more detail. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 199

7.2.1 SOMNet

In Chapter 4, a multi-layer, unsupervised architecture named SOMNet was proposed, however, for accurate comparison to other similar methods, the architecture was re- stricted to two layers. In the future, this architecture could be extended to additional layers in order to investigate how network depth effects the classification accuracy. The compact encoding introduced by SOMNet allows for additional filters to be used here, yet, this more compact representation could instead be used to add depth to the architecture. Currently, the encoding used, groups output activations into fours, how- ever, different sized groups could also be explored. Furthermore, alternative encodings could be investigated, such as orthogonal matching pursuit [117]. A truly hierarchical- based SOM could be a useful and worthwhile pursuit as a simple alternative to more complex hierarchical models, such as autoencoders. Further fine-tuning of the SOMNet approach proposed in this thesis could be tack- led in several ways. Firstly, although different unsupervised learning techniques were experimented with (k-means, PCA), other SOM-based techniques could be considered, such as Neural Gas or Growing Neural Gas. Secondly, in terms of generative filters, MRF and DCT filters could be more extensively investigated. The use of MRF and DCT filters in this work was restricted to an experiment which replicated filters fil- ters over both layers, therefore, experiments could be conducted for which the second layer learns filters from MRF and DCT activations. In Chapter 4, learning filters in the second layer was demonstrated to show improved results versus learning in the first layer only and replicating them over both layers, therefore, this approach could also be applied to MRF and DCT filters. Thirdly, experiments with different sized filters could be conducted. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 200

7.2.2 Supervised Channel Pooling

In Chapter 5, multiple channel aggregation layers were proposed, and thoroughly in- vestigated on two benchmark datasets. Whilst the results demonstrated improvements on MNIST, results were inferior when applied to the more complex dataset of CIFAR- 10. Yet, some of the results on the subset of CIFAR-10 showed promise, namely with SAC and GAC aggregation layers. Therefore, further refinements could be made to these layers, such as introducing linearly weighted combinations, conditioned on the activations. Since the improvement demonstrated on the subset was attributed to reduced redun- dancy in the feature space, thus improving the generalisation of the filters, it could be worthwhile pursing these techniques in a supervised context. Specifically, experiments with these aggregation layers could be conducted with a supervised CNN. Much recent work on supervised CNNs is concerned with dimensionality reduction techniques in order to improve feature learning [116, 170]. Therefore, it is justifiable to consider the application of the pooling methods proposed in this thesis to supervised paradigms. Whilst similar work to that proposed here has already been carried out [61], there are notable differences, which would still make this proposal novel. Specifically, the SAC methodology or similar should be considered, as it could provide input-based regular- isation in a similar respect to self-supervised approaches.

7.2.3 Combining Supervised and Unsupervised Learning

In Chapter 6, unsupervised feature learning was successfully applied to the first layer of a CNN. A natural extension to this work would be to apply the proposed method- ology to additional layers. Yet, given that the filters are trained independently in a greedy manner this would be computational expensive. With this in mind, it could be beneficial to explore jointly training supervised and unsupervised approaches. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 201

In pursuit of this a small experiment was conducting on CIFAR-100 to explore the joint learning of CNN and SOM. Specifically, during each forward pass of the CNN a SOM was used to train the first layer filters only. Backpropagation was performed as usual during the backward pass. Initial results indicated an average accuracy of

59.22% which compares favourably to the result of 59.52% for CNN+SOMTI (Ta- ble 6.9), for which, the SOM weights were trained separately prior to being employed as first layer features of the CNN. The SOM used in the jointly trained approach was annealed so that the parameters achieved the same final values as the SOM used in the

CNN+SOMTI. The parameters for the CNN remained the same. Alternatively, a more promising line of enquiry could involve using autoencoders instead of SOM; an encoder is used to map high-dimensional data to a low dimen- sional code and a similar decoder is used to recover the original data from the code. The entire network is trained to minimise the difference between the original data and the reconstruction. Autoencoders can be considered a non-linear generalisation to PCA [108]. For the proposed methodology, for L unsupervised layers, a convolutional autoencoder with L convolutional encoders and decoders would be employed. The out- put from the last encoder convolutional layer would be feed to a traditional supervised CNN structure, and the objective function would be modified so that it minimises both the reconstruction loss of the autoencoder, and the predictive loss of the CNN. Thus, the modified autoencoder should learn features that could reconstruct the input and discriminate it. Since autoencoders can be trained via backpropagation it makes op- timisation of multilayer networks straightforward. While the use of backpropagation will not provide an efficient means for learning larger numbers of features in a single layer, as SOM does, by extending to multiple layers, information lost to width can be recovered in terms of depth. Recent studies have begun to explore this area [108,197], however 3D applications, which arguable have greater potential due to difficulties with large scale data collection, have not been explored. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 202

Preliminary experiments in this area have been performed and a mean accuracy of 52.22% and 53.23% were achieved for one layer and two layer (first and second layer) replacement respectively, against a baseline of 51.12% on the first split of UCF-

50 (Table 6.17). For further reference, CNN+SOMTI achieved 51.46% for this split (Table 6.17). However, the approach uses slightly different parameters during training compared to the baseline and CNN+SOMTI. Namely, the batch size is decreased from 32 to 16, and no weight decay is applied. Nonetheless, the results are encouraging and therefore warrant further investigation. An extension of this could include the use of 1 × 1 convolutions to create an ax- illary network for which the autoencoder trained features function as anchors, which the CNN can use as building blocks for more discriminatingly driven features. The use of hand-crafted features as anchors has been explored previously [91], however, to my knowledge it has not yet been explored with autoencoders. Both 2D and 3D convolu- tional autoencoders could be explored, with 1 × 1 layers used to combine the resultant unsupervised features into discriminative features of necessary width and depth, for each layer. For example, multiple 3D filters could be combined into alternative 3D filters. Alternatively, 2D filters could also be combined to produce 3D filters with additional 1 × 1 layers mapping to each temporal dimension.

7.2.4 Temporal Models

For action recognition, temporal information is currently being encoded implicitly by a 3D CNN, however, it is evident from the literature that the adoption of more ex- plicit temporal methods can aid the classification of sequential information [22, 161]. RNNs have achieved state-of-the-art results in many challenging sequential learning problems such as speech [64] and handwriting [65] recognition. Traditional RNNs use CHAPTER 7. CONCLUSIONS AND FUTURE WORK 203 gradient-based optimisation, however, training can prove to be problematic since back- propagation gradients can vanish or explode [110], which leads to difficulties learning long-term dependencies [194]. Recent advances such as LSTM have mitigated these problems however, its application to action recognition does not always demonstrate significant improvements when more traditional temporal representations such as op- tical flow are used [194]. However, more significant improvements have been demon- strated when combined with 3D CNNs [186] and therefore more work in this area could prove beneficial. Another alternative to training RNNs is the Reservoir Computing (RC) paradigm for which the recurrent connections (named reservoir) are assigned randomly, and only a linear output layer is learned. Since training is restricted to a linear readout, which can be performed using simple linear regression, training can be carried out with com- parable ease. This allows for very large numbers of internal recurrent connections. One of the common types of RC is the architecture known as an Echo State Network (ESN) [87]. Echo State Networks have been traditionally applied to a wide variety of tasks such as non-linear controllers and chaotic attractors [87], however, their applica- tion to action recognition [127] and their leveraging of deep features is limited [90]. Ergo this could be a worthwhile avenue for future work. Bibliography

[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.

[2] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform. Transactions on Computers, 100(1):90–93, 1974.

[3] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with local binary patterns: Application to face recognition. Transactions on Pattern Analysis & Machine Intelligence, 28(12):2037–2041, 2006.

[4] Saleh Aly. Learning invariant local image descriptor using convolutional maha- lanobis self-organising map. Neurocomputing, 142:239–247, 2014.

[5] Saleh Aly, Naoyuki Tsuruta, Rin-Ichiro Taniguchi, and Atsushi Shimada. Vi- sual feature extraction using variable map-dimension hypercolumn model. In International Joint Conference on Neural Networks (IJCNN), pages 845–851. IEEE, 2008.

[6] Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and

204 BIBLIOGRAPHY 205

Atilla Baskurt. Sequential deep learning for human action recognition. In Inter- national Workshop on Human Behavior Understanding, pages 29–39. Springer, 2011.

[7] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic parti- cles in high-energy physics with deep learning. Nature communications, 5:4308, 2014.

[8] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In European Conference on Computer Vision (ECCV), pages 404–417. Springer, 2006.

[9] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer, 2012.

[10] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Ad- vances in optimizing recurrent networks. In International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 8624–8628. IEEE, 2013.

[11] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.

[12] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Pro- cessing Systems (NIPS), pages 153–160. MIT Press, 2007.

[13] Julian Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):192– 236, 1974. BIBLIOGRAPHY 206

[14] Christopher Bishop. Pattern recognition and machine learning. Springer, 2006.

[15] Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal templates. Transactions on Pattern Analysis and Machine Intel- ligence, 23(3):257–267, 2001.

[16] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training al- gorithm for optimal margin classifiers. In Annual Workshop on Computational Learning Theory (COLT), pages 144–152. ACM, 1992.

[17] Raphael C Brito and Hansenclever F Bassani. Self-organizing maps with vari- able input length for motif discovery and word segmentation. In International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.

[18] Joan Bruna and Stephane´ Mallat. Invariant scattering convolution networks. Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, 2013.

[19] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308. IEEE, 2017.

[20] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. PCANet: A simple deep learning baseline for image classification? Transac- tions on Image Processing, 24(12):5017–5032, 2015.

[21] Hsuan-Sheng Chen, Hua-Tsung Chen, Yi-Wen Chen, and Suh-Yin Lee. Hu- man action recognition using star skeleton. In International Workshop on Video Surveillance and Sensor Networks (VSSN), pages 171–178. ACM, 2006.

[22] Guilhem Cheron,´ Ivan Laptev, and Cordelia Schmid. P-cnn: Pose-based cnn BIBLIOGRAPHY 207

features for action recognition. In International Conference on Computer Vision (ICCV), pages 3218–3226. IEEE, 2015.

[23] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard´ Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192–204, 2015.

[24] Dan Cires¸an, Ueli Meier, Luca Maria Gambardella, and Jurgen¨ Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. Neural Computation, 22(12):3207–3220, 2010.

[25] Dan Cires¸an, Ueli Meier, Jonathan Masci, and Jurgen¨ Schmidhuber. A com- mittee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks (IJCNN), pages 1918–1921. IEEE, 2011.

[26] Peter Clifford. Markov random fields in statistics. Disorder in physical systems: A volume in honour of John M. Hammersley, 19, 1990.

[27] Adam Coates, , and Honglak Lee. An analysis of single-layer net- works in unsupervised feature learning. In International Conference on Artifi- cial Intelligence and Statistics (AISTATS), pages 215–223. JMLR, 2011.

[28] Adam Coates and Andrew Y Ng. The importance of encoding versus training with sparse coding and . In International Conference on Machine Learning (ICML), pages 921–928. ACM, 2011.

[29] Adam Coates and Andrew Y Ng. Selecting receptive fields in deep networks. In Advances in Neural Information Processing Systems (NIPS), pages 2528–2536. Curran Associates, Inc., 2011.

[30] Macario O Cordel, Arren Matthew C Antioquia, and Arnulfo P Azcarraga. BIBLIOGRAPHY 208

Self-organizing maps as feature detectors for supervised neural network pat- tern recognition. In International Conference on Neural Information Processing (ICONIP), pages 618–625. Springer, 2016.

[31] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learn- ing, 20(3):273–297, 1995.

[32] Camille Couprie, Clement´ Farabet, Laurent Najman, and Yann Lecun. Indoor semantic segmentation using depth information. In International Conference on Learning Representations (ICLR), pages 1–8, 2013.

[33] George R Cross and Anil K Jain. Markov random field texture models. Trans- actions on Pattern Analysis and Machine Intelligence, (1):25–39, 1983.

[34] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cedric´ Bray. Visual categorization with bags of keypoints. In Workshop on Statis- tical Learning in Computer Vision, European Conference on Computer Vision

(ECCV), pages 1–22. Springer, 2004.

[35] Eugenio Culurciello, Jonghoon Jin, Aysegul Dundar, and Jordan Bates. An anal- ysis of the connections between layers of deep neural networks. arXiv preprint arXiv:1306.0152, 2013.

[36] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.

[37] George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. Multi-task neural networks for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.

[38] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 886–893. IEEE, 2005. BIBLIOGRAPHY 209

[39] Oscar Deniz,´ Gloria Bueno, Jesus´ Salido, and Fernando De la Torre. Face recognition using histograms of oriented gradients. Pattern Recognition Let- ters, 32(12):1598–1603, 2011.

[40] Piotr Dollar,´ Vincent Rabaud, Garrison Cottrell, and Serge Belongie. Behavior recognition via sparse spatio-temporal features. In International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance

(PETS), pages 65–72. IEEE, 2005.

[41] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), pages 647–655. ACM, 2014.

[42] Le Dong, Ling He, Gaipeng Kong, Qianni Zhang, Xiaochun Cao, and Ebroul Izquierdo. CUNet: A compact unsupervised network for image classification. arXiv preprint arXiv:1607.01577, 2016.

[43] Stephan Dreiseitl and Lucila Ohno-Machado. Logistic regression and artifi- cial neural network classification models: a methodology review. Journal of Biomedical Informatics, 35(5-6):352–359, 2002.

[44] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

[45] Aysegul Dundar, Jonghoon Jin, and Eugenio Culurciello. Convolutional clus- tering for unsupervised learning. arXiv preprint arXiv:1511.06241, 2015.

[46] Alexei A Efros, Alexander C Berg, Greg Mori, and Jitendra Malik. Recognizing BIBLIOGRAPHY 210

action at a distance. In International Conference on Computer Vision (ICCV), pages 726–733. IEEE, 2003.

[47] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.

[48] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9(Aug):1871–1874, 2008.

[49] Li Fei-Fei and Pietro Perona. A bayesian hierarchical model for learning natural scene categories. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 524–531. IEEE, 2005.

[50] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.

[51] Bernd Fritzke. A growing neural gas network learns topologies. In Advances in Neural Information Processing Systems (NIPS), pages 625–632, 1995.

[52] Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural net- work. Biological Cybernetics, 20(3-4):121–136, 1975.

[53] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pages 267–285. Springer, 1982.

[54] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Transactions on Pattern Analysis and Machine Intelligence, (6):721–741, 1984. BIBLIOGRAPHY 211

[55] Martin A Giese and Tomaso Poggio. Cognitive neuroscience: Neural mech- anisms for the recognition of biological movements. Nature Reviews Neuro- science, 4(3):179–192, 2003.

[56] Georgia Gkioxari and Jitendra Malik. Finding action tubes. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 759–768. IEEE, 2015.

[57] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelli- gence and Statistics (AISTATS), pages 249–256. JMLR, 2010.

[58] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 315–323. JMLR, 2011.

[59] Melvyn A Goodale and A David Milner. Separate visual pathways for percep- tion and action. Trends in neurosciences, 15(1):20–25, 1992.

[60] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. MIT press Cambridge, 2016.

[61] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.

[62] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247–2253, 2007.

[63] Kristen Grauman and Trevor Darrell. The pyramid match kernel: discrimina- tive classification with sets of image features. In International Conference on Computer Vision (ICCV), pages 1458–1465. IEEE, 2005. BIBLIOGRAPHY 212

[64] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning (ICML), pages 1764–1772. ACM, 2014.

[65] Alex Graves, Marcus Liwicki, Santiago Fernandez,´ Roman Bertolami, Horst Bunke, and Jurgen¨ Schmidhuber. A novel connectionist system for uncon- strained handwriting recognition. Transactions on Pattern Analysis and Ma- chine Intelligence, 31(5):855–868, 2009.

[66] Ziad M Hafed and Martin D Levine. Face recognition using the discrete cosine transform. International Journal of Computer Vision, 43(3):167–188, 2001.

[67] John M Hammersley and Peter Clifford. Markov fields on finite graphs and lattices. 1971.

[68] Richard Hankins, Yao Peng, and Hujun Yin. SOMNet: unsupervised feature learning networks for image classification. In International Joint Conference on Neural Networks (IJCNN), pages 1221–1228. IEEE, 2018.

[69] Richard Hankins, Yao Peng, and Hujun Yin. Towards complex features: com- petitive receptive fields in unsupervised deep networks. In International Confer- ence on Intelligent Data Engineering and Automated Learning (IDEAL), pages 838–848. Springer, 2018.

[70] Hedi Harzallah, Fred´ eric´ Jurie, and Cordelia Schmid. Combining efficient ob- ject localization and image classification. In International Conference on Com- puter Vision (ICCV), pages 237–244. IEEE, 2009.

[71] Kaiming He, Georgia Gkioxari, Piotr Dollar,´ and Ross Girshick. Mask r-cnn. In International Conference on Computer Vision (ICCV), pages 2961–2969. IEEE, 2017. BIBLIOGRAPHY 213

[72] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015.

[73] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE, 2016.

[74] G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed pro- cessing: Explorations in the microstructure of cognition, vol. 1. chapter Dis- tributed Representations, pages 77–109. MIT Press, 1986.

[75] Geoffrey E Hinton. To recognize shapes, first learn to generate images. Progress in Brain Research, 165:535–547, 2007.

[76] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[77] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

[78] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus- lan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[79] Sepp Hochreiter and Jurgen¨ Schmidhuber. LSTM can solve hard long time lag problems. In Advances in Neural Information Processing Systems (NIPS), pages 473–479, 1997.

[80] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6):417, 1933. BIBLIOGRAPHY 214

[81] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support vector classification. 2003.

[82] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition, pages 4700–4708. IEEE, 2017.

[83] David H Hubel. Eye, brain, and vision. Scientific American Library/Scientific American Books, 1995.

[84] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1):106–154, 1962.

[85] David H Hubel and Torsten N Wiesel. Receptive fields and functional archi- tecture of monkey striate cortex. The Journal of Physiology, 195(1):215–243, 1968.

[86] Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[87] Herbert Jaeger. The echo state approach to analysing and training recurrent neu- ral networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148(34):13, 2001.

[88] Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi- stage architecture for object recognition? In International Conference on Com- puter Vision (ICCV), pages 2146–2153. IEEE, 2009.

[89] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks BIBLIOGRAPHY 215

for human action recognition. Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013.

[90] Doreen Jirak, Pablo Barros, and Stefan Wermter. Dynamic gesture recognition using echo state networks. In European Symposium on Artificial Neural Net- works (ESANN), pages 475–480, 2015.

[91] Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 19–28. IEEE, 2017.

[92] J-K Kamarainen, Ville Kyrki, and Heikki Kalviainen. Invariance properties of gabor filter-based features-overview and applications. Transactions on Image Processing, 15(5):1088–1099, 2006.

[93] Juho Kannala and Esa Rahtu. Bsif: Binarized statistical image features. In Inter- national Conference on Pattern Recognition (ICPR), pages 1363–1366. IEEE, 2012.

[94] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- thankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1725–1732. IEEE, 2014.

[95] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheen- dra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.

[96] Ho-Joon Kim, Joseph S Lee, and Hyun-Seung Yang. Human action recognition BIBLIOGRAPHY 216

using a modified convolutional neural network. In International Symposium on Neural Networks, pages 715–723. Springer, 2007.

[97] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980, 2014.

[98] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1):59–69, 1982.

[99] Philipp Krahenb¨ uhl,¨ Carl Doersch, Jeff Donahue, and Trevor Darrell. Data- dependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015.

[100] Alex Krizhevsky and . Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[101] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105. Curran Associates, Inc., 2012.

[102] Hildegard Kuehne, Hueihan Jhuang, Est´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In International Conference on Computer Vision (ICCV), pages 2556–2563. IEEE, 2011.

[103] Tian Lan, Yang Wang, and Greg Mori. Discriminative figure-centric models for joint action localization and recognition. In International Conference on Computer Vision (ICCV), pages 2003–2010. IEEE, 2011.

[104] Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005. BIBLIOGRAPHY 217

[105] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008.

[106] Fabien Lauer, Ching Y Suen, and Gerard´ Bloch. A trainable feature extractor for handwritten digit recognition. Pattern Recognition, 40(6):1816–1824, 2007.

[107] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169–2178. IEEE, 2006.

[108] Lei Le, Andrew Patterson, and Martha White. Supervised autoencoders: Im- proving generalization performance with unsupervised regularizers. In Ad- vances in Neural Information Processing Systems (NIPS), pages 107–117, 2018.

[109] Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. Learning hierar- chical invariant spatio-temporal features for action recognition with independent subspace analysis. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3361–3368. IEEE, 2011.

[110] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.

[111] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recog- nition with a back-propagation network. In Advances in Neural Information Processing Systems (NIPS), pages 396–404. Morgan-Kaufmann, 1990.

[112] Yann LeCun, Leon´ Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. BIBLIOGRAPHY 218

[113] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convo- lutional deep belief networks for scalable unsupervised learning of hierarchi- cal representations. In International Conference on Machine Learning (ICML), pages 609–616. ACM, 2009.

[114] Breiman Leo, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and regression trees. Chapman and Hall/CRC, 1984.

[115] Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for ob- ject recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3367–3375. IEEE, 2015.

[116] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

[117] Tsung-Han Lin and HT Kung. Stable and efficient representation learning with nonnegativity constraints. In International Conference on Machine Learning (ICML), pages 1323–1331. JMLR, 2014.

[118] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar,´ and C Lawrence Zitnick. Microsoft coco: common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.

[119] Dimitri A Lisin, Marwan A Mattar, Matthew B Blaschko, Erik G Learned- Miller, and Mark C Benfield. Combining local and global image features for object class recognition. In Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 47–47. IEEE, 2005. BIBLIOGRAPHY 219

[120] Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C Kot. Global context- aware attention lstm networks for 3d action recognition. In Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 1647–1656. IEEE, 2017.

[121] David G Lowe et al. Object recognition from local scale-invariant features. In International Conference on Computer Vision (ICCV), pages 1150–1157. IEEE, 1999.

[122] Shangzhen Luan, Chen Chen, Baochang Zhang, Jungong Han, and Jianzhuang Liu. Gabor convolutional networks. Transactions on Image Processing, 27(9):4357–4366, 2018.

[123] S Marcelja.ˆ Mathematical description of the responses of simple cortical cells. JOSA, 70(11):1297–1300, 1980.

[124] Stephen Marsland, Jonathan Shapiro, and Ulrich Nehmzow. A self-organising network that grows when required. Neural networks, 15(8-9):1041–1058, 2002.

[125] Thomas Martinetz and Klaus Schulten. A “neural-gas” network learns topolo- gies. In International Conference on Artificial Neural Networks (ICANN), pages 397–402, 1991.

[126] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas imma- nent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.

[127] Luiza Mici, Xavier Hinaut, and Stefan Wermter. Activity recognition with echo state networks using 3d body joints and objects category. In European Sympo- sium on Artificial Neural Networks, Computational Intelligence and Machine

Learning (ESANN), pages 465–470, 2016. BIBLIOGRAPHY 220

[128] Luiza Mici, German I Parisi, and Stefan Wermter. Recognition and prediction of human-object interactions with a self-organizing architecture. In International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.

[129] Marvin Minsky and Seymour Papert. Perceptrons: An introduction to computa- tional geometry. MIT Press, 1969.

[130] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsu- pervised learning using temporal order verification. In European Conference on Computer Vision (ECCV), pages 527–544. Springer, 2016.

[131] Ehsan Mohebi and Adil Bagirov. A convolutional recursive modified self orga- nizing map for handwritten digits recognition. Neural Networks, 60:104–118, 2014.

[132] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.

[133] Natalia Neverova, Christian Wolf, Graham Taylor, and Florian Nebout. Mod- drop: adaptive multi-modal gesture recognition. Transactions on Pattern Anal- ysis and Machine Intelligence, 38(8):1692–1706, 2016.

[134] Cong Jie Ng and Andrew Beng Jin Teoh. Dctnet: a simple learning-free ap- proach for face recognition. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 761–768. IEEE, 2015.

[135] Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W Koh, Quoc V Le, and Andrew Y Ng. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1279–1287. Curran Associates, Inc., 2010. BIBLIOGRAPHY 221

[136] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 427–436. IEEE, 2015.

[137] Timo Ojala, Matti Pietikainen,¨ and David Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern recognition, 29(1):51–59, 1996.

[138] Timo Ojala, Matti Pietikainen,¨ and Topi Maenp¨ a¨a.¨ Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Transac- tions on Pattern Analysis and Machine Intelligence, (7):971–987, 2002.

[139] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holis- tic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.

[140] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and trans- ferring mid-level image representations using convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1717– 1724. IEEE, 2014.

[141] Tom Le Paine, Pooya Khorrami, Wei Han, and Thomas S Huang. An anal- ysis of unsupervised pre-training in light of recent advances. arXiv preprint arXiv:1412.6597, 2014.

[142] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.

[143] German Ignacio Parisi and Stefan Wermter. Hierarchical som-based detection BIBLIOGRAPHY 222

of novel behavior for 3d human tracking. In International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2013.

[144] Yao Peng, Richard Hankins, and Hujun Yin. Data-independent feature learning with markov random fields in convolutional neural networks. Neurocomputing, In press 2019.

[145] Yao Peng and Hujun Yin. Markov random field based convolutional neural networks for image classification. In International Conference on Intelligent Data Engineering and Automated Learning (IDEAL), pages 387–396. Springer, 2017.

[146] Ramprasad Polana and Randal Nelson. Low level recognition of human motion (or how to get your man without finding his body parts). In Workshop on Motion of Non-Rigid and Articulated Objects, pages 77–82. IEEE, 1994.

[147] Ronald Poppe. A survey on vision-based human action recognition. Image and Vision Computing, 28(6):976–990, 2010.

[148] Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. Efficient learning of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems (NIPS), pages 1137–1144, 2007.

[149] Kishore K Reddy and Mubarak Shah. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981, 2013.

[150] Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach a spatio- temporal maximum average correlation height filter for action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008. BIBLIOGRAPHY 223

[151] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.

[152] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning rep- resentations by back-propagating errors. Nature, 323(6088):533, 1986.

[153] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

[154] Sreemanananth Sadanand and Jason J Corso. Action bank: A high-level repre- sentation of activity in video. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1234–1241. IEEE, 2012.

[155] Wojciech Samek, Alexander Binder, Gregoire´ Montavon, Sebastian La- puschkin, and Klaus-Robert Muller.¨ Evaluating the visualization of what a deep neural network has learned. Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 2017.

[156] Syed Shakib Sarwar, Priyadarshini Panda, and Kaushik Roy. Gabor filter as- sisted energy efficient fast learning convolutional neural networks. In Interna- tional Symposium on Low Power Electronics and Design (ISLPED), pages 1–6. IEEE, 2017.

[157] Pierre Sermanet, David Eigen, Xiang Zhang, Michael¨ Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. BIBLIOGRAPHY 224

[158] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carls- son. Cnn features off-the-shelf: an astounding baseline for recognition. In Con- ference on Computer Vision and Pattern Recognition Workshops, pages 806– 813. IEEE, 2014.

[159] Laurent Sifre and Stephane´ Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1233–1240, 2013.

[160] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneer- shelvam, Marc Lanctot, et al. Mastering the game of go with deep neural net- works and tree search. Nature, 529(7587):484, 2016.

[161] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), pages 568–576, 2014.

[162] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[163] Berkan Solmaz, Shayan Modiri Assari, and Mubarak Shah. Classifying web videos using a global video descriptor. Machine Vision and Applications, 24(7):1473–1485, 2013.

[164] Khurram Soomro and Amir R Zamir. Action recognition in realistic sports videos. In Computer Vision in Sports, pages 181–208. Springer, 2014.

[165] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. BIBLIOGRAPHY 225

[166] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Ried- miller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

[167] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus- lan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[168] Kartick Subramanian and Sundaram Suresh. Human action recognition using meta-cognitive neuro-fuzzy inference system. International Journal of Neural Systems, 22(06):1250028, 2012.

[169] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning (ICML), pages 1139–1147. ACM, 2013.

[170] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9. IEEE, 2015.

[171] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Er- han, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

[172] Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Tech-

niques, pages 242–264. IGI Global, 2010.

[173] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar BIBLIOGRAPHY 226

Paluri. Learning spatiotemporal features with 3d convolutional networks. In International Conference on Computer Vision, pages 4489–4497. IEEE, 2015.

[174] Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. Con- vnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.

[175] Md Azher Uddin, Joolekha Bibi Joolee, Aftab Alam, and Young-Koo Lee. Hu- man action recognition using adaptive local motion descriptor in spark. IEEE Access, 5:21157–21167, 2017.

[176] Eladio Alvarez Valle and Oleg Starostenko. Recognition of human walking/run- ning actions based on neural network. In International Conference on Electri- cal Engineering, Computing Science and Automatic Control (CCE), pages 239– 244. IEEE, 2013.

[177] Koen EA Van de Sande, Jasper RR Uijlings, Theo Gevers, Arnold WM Smeul- ders, et al. Segmentation as selective search for object recognition. In Interna- tional Conference on Computer Vision (ICCV), pages 1879–1886, 2011.

[178] Joost Van Doorn. Analysis of deep convolutional neural network architectures. In Twente Student Conference on IT, Enschede, The Netherlands, pages 1–7, 2014.

[179] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. Conference on Computer Vision and Pattern Recognition (CVPR), pages 511–518, 2001.

[180] Raimar Wagner, Markus Thom, Roland Schweiger, Gunther Palm, and Albrecht BIBLIOGRAPHY 227

Rothermel. Learning convolutional neural networks from few samples. In In- ternational Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2013.

[181] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regular- ization of neural networks using dropconnect. In International Conference on Machine Learning (ICML), pages 1058–1066. ACM, 2013.

[182] Heng Wang and Cordelia Schmid. Action recognition with improved trajec- tories. In International Conference on Computer Vision (ICCV), pages 3551– 3558. IEEE, 2013.

[183] Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. Evaluation of local spatio-temporal features for action recog- nition. In British Machine Vision Conference (BMVC), pages 124–1. BMVA Press, 2009.

[184] Jian-Gang Wang, Jun Li, Chong Yee Lee, and Wei-Yun Yau. Dense sift and gabor descriptors-based face representation with applications to gender recog- nition. In International Conference on Control Automation Robotics and Vision, pages 1860–1864. IEEE, 2010.

[185] LiMin Wang, Yu Qiao, and Xiaoou Tang. Mining motion atoms and phrases for complex action recognition. In International Conference on Computer Vision (ICCV), pages 2680–2687. IEEE, 2013.

[186] Xuanhan Wang, Lianli Gao, Jingkuan Song, and Hengtao Shen. Beyond frame- level cnn: saliency-aware 3-d cnn with lstm for video action recognition. Signal Processing Letters, 24(4):510–514, 2017. BIBLIOGRAPHY 228

[187] Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. Cnn: single-label to multi-label. arXiv preprint arXiv:1406.5726, 2014.

[188] Daniel Weinland, Remi Ronfard, and Edmond Boyer. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2):224–241, 2011.

[189] Junji Yamato, Jun Ohya, and Kenichiro Ishii. Recognizing human action in time-sequential images using . In Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 379–385. IEEE, 1992.

[190] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyra- mid matching using sparse coding for image classification. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1794–1801. IEEE, 2009.

[191] Hu Yao, Li Chuyi, Hu Dan, and Yu Weiyu. Gabor feature based convolutional neural network for object recognition in natural scene. In International Con- ference on Information Science and Control Engineering (ICISCE), pages 386– 390. IEEE, 2016.

[192] Hujun Yin. The self-organizing maps: background, theories, extensions and applications. In Computational intelligence: A compendium, pages 715–762. Springer, 2008.

[193] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transfer- able are features in deep neural networks? In Advances in Neural Information Processing Systems (NIPS), pages 3320–3328, 2014. BIBLIOGRAPHY 229

[194] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep net- works for video classification. In Conference on Computer vision and Pattern Recognition (CVPR), pages 4694–4702. IEEE, 2015.

[195] Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.

[196] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), pages 818– 833. Springer, 2014.

[197] Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In International Conference on Machine Learning (ICML), pages 612–621. ACM, 2016.

[198] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Tor- ralba. Places: A 10 million image database for scene recognition. Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2017.