Traditional Computer Vision Vs. Deep Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Comparing Deep Neural Networks and Traditional Vision Algorithms in Mobile Robotics Andy Lee Swarthmore College [email protected] ABSTRACT In this paper, we consider the problem of object detection on We consider the problem of object detection on a mobile a mobile robot. Specifically, we compare and contrast two robot by comparing and contrasting two types of algorithms types of algorithms for object detection. The first approach is for computer vision. The first approach is coined ”traditional coined ”traditional computer vision” and refers to using com- computer vision” and refers to using commonly known fea- monly known feature descriptors (SIFT, SURF, BRIEF, etc.) ture descriptors (SIFT, SURF, BRIEF, etc.) for object detec- for object detection alongside common machine learning al- tion. The second approach uses Deep Neural Networks for gorithms (Support Vector Machine, K-Nearest Neighbor) for object detection. We show that deep neural networks perform prediction. In contrast, the second approach uses Deep Neu- better than traditional algorithms, but discuss major trade offs ral Networks architectures. surrounding performance and training time. Our emphasis will be on analyzing the performance of two approaches in a mobile-robotic setting characterized by Author Keywords a real-time environment that constantly changes (lighting, Computer Vision; Deep Neural Network; SIFT,SURF; nearby objects are constantly moving). Specifically, we wish Caffe; ROS; CS81; Adaptive Robotics to detect objects while the mobile agent is moving, differing from the performance on still-image tasks – the focus of ob- INTRODUCTION ject detection papers. Computer vision is an integral part of many robotic appli- Our paper is broken up as follows. cations. In the 90s, we saw the rise of feature descriptors (SIFT, SURF) as the primary technique used to solve a host • Background knowledge on traditional vs deep neural net- of computer vision problems (image classification, object de- work approaches. tection, face recognition). Often these feature descriptors are combined with traditional machine learning classification al- • Survey of relevant litearture in computer vision. gorithms such as Support Vector Machines and K-Nearest • Discussion of experiments Neighbors to solve the aforementioned computer vision prob- lems. • Appendix: Implementation details for mobile robotics (navigation, ROS, hardware) Recently there has been an explosion in hype for deep-neural networks. In fact, there has been a wide-spread adoption of COMPUTER VISION BACKGROUND various deep-neural network architectures for computer vi- Computer vision can be succinctly described as finding sion because of the apparent empirical success these archi- telling features from images to help discriminate objects tectures have seen in various image classification tasks. In and/or classes of objects. While humans are innately en- fact, the seminal paper ImageNet Classification with Deep dowed with incredible vision faculties; it is not clear which Convolutional Neural Networks has been cited over 3000 features humans use to get such strong performance on vi- times.[2] This is also the model used as a standard out-of- sion tasks. Computer vision algorithms in general work by the-box model for a popular deep neural network framework extracting feature vectors from images and using these fea- called Caffe. ture vectors to classify images. Our experiment considers two different approaches to com- puter vision. On the one hand are traditional approaches to computer vision. These approaches date back to the Permission to make digital or hard copies of all or part of this work for past 10-20 years and are characterized by extracting human- personal or classroom use is granted without fee provided that copies are engineered features (edges, corners, color) deemed to be rel- not made or distributed for profit or commercial advantage and that copies evant in vision tasks. One can say these techniques lean to- bear this notice and the full citation on the first page. To copy otherwise, or republish, please contact the author. wards a human-driven approach. 1 Figure 1. Example showing scaling changes detection of a corner. On the other end of the spectrum are techniques stemming from deep neural networks, which are quickly gaining popu- Figure 2. A picture of SIFT feature keypoints from OpenCV. larity for their success in various computer vision tasks. Once again the goal is to extract features that help discriminate ob- jects; however, this time these features are learned via an au- standardizing the scale. Unfortunately, this approximation tomated procedure using non-linear statistical models (deep scheme is slow. SURF is simply a speeded-up version of nets). The general idea of deep-neural networks is to learn a SIFT. SURF works by finding a quick and dirty approxima- denser and denser (more abstract) representation of the train- tion to the difference of Gaussians using a technique called ing image as you proceed up the architecture. In this section, box blur. A box blur is the average value of all the images we provide a brief but more detailed introduction to these values in a given rectangle and it can be computed efficiently. techniques. [1,5] Traditional Vision BRIEF (Binary Robust Independent Elementary Features) We now describe some techniques used in traditional vision. In practice, SIFT uses a 128 dimension vector for its feature The idea behind each of these techniques is to formulate some descriptors and SURF uses a minimum of 64 dimension vec- way of representing the image by encoding the existence tor. Traditionally, these vectors are reduced using some di- of various features. These features can be corners, color- mensionality reduction method like PCA (Principal Compo- schemes, texture of image, etc. nents Analysis) or ICA (Independent Components Analysis). BRIEF takes a shortcut by avoiding the computation of the SIFT (Scale-Invariant Feature Transform) feature descriptors that both SIFT and SURF rely on. In- The SIFT algorithm deals with the problem that certain im- stead, it uses a procedure to select n patches of pixels and age features like edges and corners are not scale-invariant. In computes a pixel intensity metric. These n patches and their other words, there are times when a corner looks like a corner, corresponding intensities are used to represent the image. In but looks like a completely different item when the image is practice this is faster than computing a feature descriptor and blown up by a few factors. The SIFT algorithm uses a series trying to find features. of mathematical approximations to learn a representation of the image that is scale-invariant. In effect, it tries to standard- Deep Neural Networks ize all images (if the image is blown up, SIFT shrinks it; if the The deep networks we examine in this paper are convoul- image is shrunk, SIFT enlarges it). This corresponds to the tional neural networks. For those familiar with artificial neu- idea that if some feature (say a corner) can be detected in an ral networks, these are simply multi-level neural networks image using some square-window of dimension σ across the with a few special properties in place (pooling, convolution, pixels, then we would if the image was scaled to be larger, we etc.). The basic idea is that we will take a raw RGB image and would need a larger dimension kσ to capture the same corner perform a series of transformations on the image. On each (see figure 1). The mathematical ideas of SIFT are skipped, transformation, we learn a denser representation of the image. but the general idea is that SIFT standardizes the scale of the We then take this denser representation, apply the transforma- image then detects important key features. The existence of tion and learn an even denser representation. It turns out by these features are subsequently encoded into a vector used to using this procedure, we learn more and more abstract fea- represent the image. Exactly what constitutes an important tures with which to represent the original image. At the end, feature is beyond the scope of this paper. we can use these abstract features to predict using some tradi- SURF (Speeded-Up Robust Features) tional classification method (and it works surprisingly well). The problem with SIFT is that the algorithm itself uses a We now discuss the architecture of the convolutional neural series of approximations using difference of Gaussians for network. Notice that each of the layers we describe below, 2 multiple representations of the same image, thereby encour- aging generalization. While the exact details are given in the paper, we simply state that the CNN described in this paper achieves an error rate of 37:5 as compared to a more tradi- tional approach of using SIFT + FV features, which achieves an error rate of 45:7. Figure 3. CNN training process. CNN Features off-the-shelf: an Astounding Baseline for Recognition transform the input in some fashion and allow us to reach for The authors take a different approach to deep-learning.[4] more abstract features. Rather than use a deep neural network for prediction, the au- thors extract features from the first fully-connected layer of The implementation of the CNN and some of its techniques the OverFeat network which was trained to perform object are discussed in the below section. Because of space con- classification on ILSVRC13. In other words, the deep net straints, we also skip a detailed discussion on the various lay- step tells us which features are important to consider when ers (convolution, rectified linear units, max-pooling) because doing learning on image data; think of this step as dimen- this was thoroughly explained in class already. sionality reduction to learn condensed representations of the RELEVANT LITERATURE image. The next step is to determine how effective this con- In this section, we survey 2 related pieces of literature.