Unsupervised Image Feature Learning for Convolutional Neural Networks

UNSUPERVISED IMAGE FEATURE LEARNING FOR CONVOLUTIONAL NEURAL NETWORKS A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN THE FACULTY OF SCIENCE AND ENGINEERING 2019 By Richard Hankins Candidate Department of Electrical and Electronic Engineering School of Engineering Contents Abstract 12 Declaration 13 Copyright 14 Acknowledgements 15 List of Publications 16 1 Introduction 17 1.1 Background . 17 1.2 Aims and Objectives . 21 1.3 Scope . 21 1.4 Structure . 23 2 Classical Methodologies 25 2.1 Feature Representations and Learning . 26 2.1.1 Hand-crafted Feature Representations . 28 2.1.2 Unsupervised Feature Learning . 34 2.2 Classifiers . 37 2.2.1 Logistic Regression . 39 2.2.2 Support Vector Machines . 40 2.2.3 k-nearest Neighbours and Decision Trees . 42 2.3 Datasets . 43 2.3.1 Image Datasets . 44 2.3.2 Video Datasets . 47 2 3 Deep Neural Networks 52 3.1 Introduction . 52 3.1.1 Feedforward Networks . 54 3.2 Related Work . 57 3.3 Convolutional Neural Networks . 65 3.3.1 Layers and Architectures . 67 3.3.2 Other Layers . 72 3.3.3 Optimisation and the Backpropagation Algorithm . 75 3.3.4 Pre-processing . 79 3.3.5 Issues . 81 3.4 2D Convolutional Neural Networks Case Studies . 82 3.4.1 Image Classifcation on the MNIST Dataset . 83 3.4.2 Action Classification on the Weizmann Dataset . 86 3.5 3D Convolutional Neural Networks Case Study . 92 3.5.1 Action Classification on the UCF Sports Dataset . 92 4 Self-Organising Map Network 104 4.1 Introduction . 104 4.2 Related Work . 106 4.3 Methodology . 108 4.3.1 Convolutional Self-Organising Map . 108 4.3.2 Discrete Cosine Transform (DCT) . 109 4.3.3 Markov Random Field . 109 4.3.4 Self-Organising Map Network (SOMNet) . 111 4.3.5 Markov Random Field Self-Organising Map Network (MRF- SOMNet) . 114 4.3.6 Computational Complexity . 115 4.4 Experiments and Discussion . 116 4.4.1 Comparison of Different Features and Encodings . 116 4.4.2 Evaluation on the MNIST Dataset . 117 4.4.3 Optimising Parameters on the CIFAR-10 Dataset . 120 4.4.4 Evaluation on the CIFAR-10 Dataset . 123 4.5 Conclusions and Future Work . 125 5 SOMNet with Aggregated Channel Connections 127 5.1 Introduction . 127 3 5.2 Related Work . 130 5.3 Methodology . 131 5.3.1 Proposed Method . 131 5.4 Experiment and Discussion . 133 5.4.1 Whitening . 133 5.4.2 Digit Classification on the MNIST Dataset . 134 5.4.3 Object Classification on the CIFAR-10 Dataset . 142 5.5 Conclusion and Future Work . 149 6 Filter Replacement in Convolutional Networks using Self-Organising Maps 151 6.1 Introduction . 151 6.2 Related Work . 153 6.3 Methodology . 156 6.3.1 Proposed Method . 156 6.3.2 Self-Organising Maps . 157 6.3.3 Convolutional Neural Networks . 157 6.3.4 Filter Replacement with Self-Organising Maps . 160 6.4 Object Classification Experiments and Discussion on the CIFAR-10 and CIFAR-100 Datasets . 163 6.4.1 Convolutional Neural Networks . 163 6.4.2 Filter Replacement with Self-Organising Maps . 164 6.5 Action Classification Experiments and Discussion on the UCF-50 Dataset181 6.5.1 Convolutional Neural Networks . 181 6.5.2 Filter Replacement with Self-Organising Maps . 184 6.6 Conclusion and Future Work . 190 7 Conclusions and Future Work 193 7.1 Conclusions . 193 7.2 Future Work . 198 7.2.1 SOMNet . 199 7.2.2 Supervised Channel Pooling . 200 7.2.3 Combining Supervised and Unsupervised Learning . 200 7.2.4 Temporal Models . 202 4 Bibliography 204 Word Count: 40269 5 List of Tables 3.1 Validation and test error as well as intra-class error using different subjects as the test set on Weizmann. 91 3.2 Absolute misclassification for each class using different subjects as the test set on Weizmann. 91 3.3 3D CNN architecture for UCF Sports . 95 3.4 3D baseline: accuracy on UCF Sports . 97 3.5 3D bounding box: accuracy on UCF Sports . 100 3.6 Accuracy on UCF Sports . 103 4.1 Computational Complexity . 116 4.2 Comparing features and encodings . 117 4.3 Error rate on MNIST . 120 4.4 Variations in block size and overlapping ratio of SOMNet on CIFAR-10 122 4.5 Variations in feature numbers of SOMNet on CIFAR-10 . 122 4.6 Accuracy on CIFAR-10 . 124 5.1 FAC layer: error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 136 5.2 SAC layer: error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 138 5.3 GAC layer: error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 140 5.4 Error rate on MNIST (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 142 6 5.5 FAC layer: accuracy on CIFAR-10 (the p-value represent the comparison between each method with the method labelled not applicable (N/A)). 144 5.6 SAC layer: accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A)). 145 5.7 GAC layer: accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A)). 146 5.8 Accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 148 6.1 Baseline 2D CNN architecture for CIFAR-10/CIFAR-100 . 159 6.2 Baseline 3D CNN architecture for UCF-50 . 161 6.3 Proposed 2D CNNNIN+SOM architecture for CIFAR-10/100 . 162 6.4 Baseline 2D CNN accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A)). 164 6.5 Baseline 2D CNN accuracy on CIFAR-100 (p-values represent the comparison between each method with the method labelled not applicable (N/A)). 164 6.6 2D CNNNIN+SOM accuracy on CIFAR-10 . 166 6.7 2D CNNNIN+SOM accuracy on CIFAR-10 . 168 6.8 2D baseline CNN vs CNN+SOM subset accuracy on CIFAR-10 . 170 6.9 2D baseline CNN vs CNN+SOMTI subset accuracy on CIFAR-100 . 170 6.10 2D CNNNIN+SOMTI accuracy on CIFAR-10 (p-values represent the comparison between each method with the method labelled not applicable (N/A) in each column). 172 6.11 2D CNNNIN+SOMTI using 3 × 3 filters accuracy on CIFAR-100 (p- values represent the comparison between each method with the method labelled not applicable (N/A) in each column). 172 6.12 2D CNNNIN+SOMTI using 5 × 5 filters accuracy on CIFAR-100 (p- values represent the comparison between each method with the method labelled not applicable (N/A) in each column). 173 6.13 2D CNNNIN+SOMTI using 7 × 7 filters accuracy on CIFAR-100 (p- values represent the comparison between each method with the method labelled not applicable (N/A) in each column). 173 7 6.14 Accuracy on CIFAR-10 and CIFAR-100 (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 180 6.15 Baseline 3D CNN accuracy on UCF-50 (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section in each column). 183 6.16 3D CNN+SOMTI accuracy on UCF-50 . 185 6.17 3D CNN+SOMTI accuracy on UCF-50 (p-values represent the comparison between each method with the method labelled not applicable (N/A) for each section). 187 6.18 Accuracy on UCF-50 . 190 8 List of Figures 2.1 Logistic sigmoid function . 39 2.2 A selection of examples from the MNIST dataset . 45 2.3 Frames from example videos from each class of the Weizmann dataset. 47 2.4 Frames from example videos from each class of the UCF Sports dataset. 48 2.5 Frames from example videos from each class of the UCF-50 dataset. 49 2.6 Frames from example videos from each class of the UCF-101 dataset. 50 3.1 Perceptron model (adapted from Rosenblatt 1958 [151]) . 55 3.2 Multilayer perceptron with a single hidden layer . 56 3.3 Non-linear sigmoidly functions . 57 3.4 Simple cells (adapted from Hubel 1995 [83]) . 68 3.5 Complex cells (adapted from Hubel 1995 [83]) . 68 3.6 Non-linear activation functions . 70 3.7 LeNet-5 architecture (adapted from LeCun et al.1998 [112]). 73 3.8 First layer convolutional filters at different stages of training on MNIST. 86 3.9 Confusion matrix for the MNIST experiment . 87 3.10 Comparison of 2D (a) and 3D (b) convolutions (the temporal depth of the 3D filter is equal to 3). The colours indicate shared weights (adapted from Ji et al. 2013 [89]) . 93 3.11 Confusion matrix for the 3D baseline experiment on UCF Sports. 98 3.12 Confusion matrix for the.

Unsupervised Image Feature Learning for Convolutional Neural Networks

Automatic Feature Learning Using Recurrent Neural Networks

Exploring the Potential of Sparse Coding for Machine Learning

Text-To-Speech Synthesis

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Sparse Feature Learning for Deep Belief Networks

Comparison of Feature Learning Methods for Human Activity Recognition Using Wearable Sensors

Convex Multi-Task Feature Learning

Visual Word2vec (Vis-W2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

Deep Learning with a Recurrent Network Structure in the Sequence Modeling of Imbalanced Data for ECG-Rhythm Classiﬁer

Deep Graphical Feature Learning for the Feature Matching Problem

Unsupervised, Backpropagation-Free Convolutional Neural Networks for Representation Learning

Sparse Estimation and Dictionary Learning (For Biostatistics?)