Gist: a Mobile Robotics Application of Context-Based Vision in Outdoor Environment
Total Page:16
File Type:pdf, Size:1020Kb
Gist: A Mobile Robotics Application of Context-Based Vision in Outdoor Environment Christian Siagian Laurent Itti Department of Computer Science Department of Computer Science University of Southern California University of Southern California Los Angeles, CA 90089 Los Angeles, CA 90089 Abstract Another way to obtain navigational input is by using the primary sensory modality in human: vision. Within the We present context-based scene recognition for mobile Computer Vision field, a large portion of approach towards robotics applications. Our classifier is able to differenti- scene recognition is object-based [1, 24, 9]. That is, a phys- ate outdoor scenes without temporal filtering relatively well ical location is recognized by identifying surrounding ob- from a variety of locations at a college campus using a set jects (and their configuration) as landmarks. This approach of features that together capture the “gist” of the scene. We involves intermediate steps such as segmentation, feature compare the classification accuracy of a set of scenes from grouping, and object detection. The last step can be com- 1551 frames filmed outdoors along a path and dividing them plex because the necessity of object matching from multi- to four and twelve different legs while obtaining a classifi- ple views [11, 10]. It should be pointed out that this ap- cation rate of 67.96 percent and 48.61 percent, respectively. proach is mostly used indoors for the simplicity of selecting We also tested the scalability of the features by comparing a small set of anchor objects that keep re-occuring in the en- the classification results from the previous scenes with four vironment. In the generally spacious outdoors, objects tend legs with a longer path with eleven legs while obtaining a to be farther away from each other. Because of inherently classification rate of 55.08 percent. In the end, some ideas noisy camera data - objects are more obscured and smaller are put forth to improve the theoretical strength of the gist in comparison to image size - such layered approach pro- features. duces errors that are carried over and amplified along the stream of processing. 1. Introduction A different set of approaches within the Vision domain looks for regions and their relationship in an image as a sig- The field of mobile robotics, which hinges on solving tasks nature of a location. Katsura [6] utilize top-down knowl- such as localization, mapping, and navigation, can be sum- edge for recognizing the regions such as the sky being at marized into one central question: Where are we? This the top part of an image and buildings at the east and west fundamental problem can be approached from several an- of the frame. A scalability problem arises from the inability gles. A significant number of mobile robotics implementa- to chracterize the region more than simply its centroid. Mat- tions use sonar, laser, or other range sensors [3, 7, 23]. In sumoto [10] uses template-matching to recognize a set of re- the outdoors these sensors become less effective in solving gions and, to a degree, encapsulates a richer set of features. data association problem because, unlike the indoor envi- However, the need for pixel-wise comparison may be too ronment, the existence of spatial simplifications such as flat specific to allow flexibility of different views of the same walls and narrow corridors cannot be assumed. Contrasting location. Murrieta-Cid [11] goes a step further in that it uses with the regularity of indoor scenes, the surface shape of the regions, particularly isolated-blob region, as an interme- the outdoors varies tremendously, especially when consid- diate step to locate predetermined set of anchor landmarks. ering environments with different terrains. It is very hard Here, there is a need for robust segmentation phase as well to predict the sensor input given all the protrusions and sur- as identification of landmarks from multiple views. face irregularities[8]. A slight pose change can result in a The context-based approach, on the other hand, bypasses large jump in range reading because of tree trunks, moving the traditional processing steps. Instead, it analyzes the branches and leaves. In addition, a flat surface for a robot scene as a whole and extract a low-dimensional signature to travel is an absolute must, otherwise we introduce a third for it. This signature in the form of a scene-level feature (and mostly unaccountable) dimension. vector embodies the cognitive psychologists’ idea of the 1 “gist” of a scene [17]. The hypothesis at the foundation of ponency features. We incorporate information from the ori- this technique is that it should produce a more robust solu- entation channel, which uses Gabor filters, at four different tion, because random noise that locally may be catastrophic angles and at four spatial scales for a subtotal of sixteen tends to average out globally. By identifying scenes, and sub-channels. The color channel (in two color opponency: not objects, we do not have to deal with noise in isolated re- red-green and blue-yellow) is composed of twelve differ- gions. Our approach, which is biologically inspired, mim- ent center-surround scale combination while the intensity ics the ability of human vision to collect coarse yet concise channel (dark-bright opponency) is composed of six differ- contextual information about an image in a short amount of ent center-surround scale combinations. That is a total of time [16, 27, 22]. This gist information includes a rough thirty-four sub-channels altogether. The following image 1 spatial layout [12], but often lacks in fine-scale detail about illustrates the features used by the gist model. specific objects [13]. Gist can be identified in as little as 45 - 135 ms [19], faster than a single saccadic eye movement. Input image Such quick turnaround time is remarkable considering that it extracts quintessential characteristics of an image which Linear filtering at 8 spatial scales can be useful for tasks such as semantic scene categoriza- tion (e.g., indoors vs. outdoors; beach vs. city), scale selec- colors intensity orientations tion, region priming and layout recognition [25]. R, G, B, Y ON, OFF 0, 45, 90, 135Ê The challenge to discover a compact and holistic repre- sentation has been a research of various works. Renniger Center-surround differences and normalization [18] and Malik use a set of texture information and keep track of them using a histogram to create an overall pro- 12 (RG and BY) 6 On-Off 16 Gabor Feature Feature Filter Vectors Vectors Feature file of an image. Ulrich and Nourbakhsh [28] build color Vectors histogram and perform matching using a voting procedure. Gist Feature Vectors Although an approach by Takeuchi [21] is meant to look for red buildings, the actual implementation uses a histogram of red texture pixels. In contrast to the previous approaches, PCA/ICA Dimension Reduction which does not encode spatial information, Oliva [15] and Torralba performs Fourier Transform analysis to individ- ual sub-region divided using a regularly-spaced grid, which Place Classifier then is correlated to several scene categories. Later on Tor- ralba [26] uses steerable wavelet pyramid in the same man- Most Likely Location ner. The core of our present research focuses on a process of extracting the gist features of an image from several do- Figure 1: Visual Feature Channel Used in the Gist Model mains that do not focus on specific locations of the image but still take into account a coarse spatial information. For each of the thirty-four sub-channels, we extract the corresponding gist vector of 21 features from its filter out- 2. Design and Implementation put/feature map. We hypothesize that, because of the speed in which gist is calculated, they come from a series of sim- Part of our contribution is that we present a more biolog- ple feedforward operations. We thought that accumulation ically plausible model which utilizes the rich image fea- or averaging operations (as oppose to competition) are more tures from the visual cortex. The gist model is built us- plausible in finding general information of a set of values. ing the Vision toolkit developed by Itti at al. [4] which Our 21 gist features can be visualized by a pyramid of sub- feature the Saliency model and can be freely downloaded section average-outputs (figure 2). The first, the tip of the on the web. In the Saliency model, the image is processed pyramid, represents the mean of the entire feature map. The through a number of low-level visual “channels” at mul- next four values are means of each of the quarter sub-region tiple spatial scales. Within each channel, the model per- of the map - delineated by an evenly spaced two-by-two forms a center-surround operations between different scales grid. The last sixteen values are the means of each of the to detect conspicuous regions for that channel. These re- sub-region delineated by a four-by-four grid. This approach sults then is linearly combined to yield a saliency map. is similar to Torralba [26] with the use of wavelet pyramid. One of the goals here is to re-use the same intermediate The number of features could be increased by adding other maps, so that gist is computed almost for free once we domains such as motion and stereo, or using even smaller have all the maps for attention. Our gist model make use sub-regions (the next level would be an eight-by-eight grid: of the already available orientation, color and intensity op- an additional 64 features) for more localized gist informa- 2 tion. (SAL).” The path takes place in an outdoor environment that expands several city blocks.