An Examination of Feature Detection for Real-time Visual Odometry in Untextured Natural Terrain

Kyohei Otsu1, Masatsugu Otsuki2, Genya Ishigami2, and Takashi Kubota2

1 The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan, [email protected] 2 ISAS/JAXA, 3-1-1 Yoshinodai, Chuo, Sagamihara, Kanagawa, Japan

Abstract. Estimating the position of a robot is an essential requirement for autonomous mobile robots. Visual Odometry is a promising localiza- tion method in slippery natural terrain, which drastically degrades the accuracy of Wheel Odometry, while relying neither on other infrastruc- ture nor any prior knowledge. Visual Odometry, however, suffers from the instability of feature extraction from the untextured natural terrain. To date, a number of feature detectors have been proposed for stable fea- ture detection. This paper compares commonly used detectors in terms of robustness, localization accuracy and computational efficiency, and points out their trade-off problems among those criteria. To solve the problem, a hybrid algorithm is proposed which dynamically switches be- tween multiple detectors according to the texture of terrain. Validity of the algorithm is proved by the simulation using dataset at volcanic areas in Japan.

Keywords: Visual odometry, Outdoor environment, Feature detection

1 Introduction

Exploring ultimate environments such as planetary surfaces and deep sea is a challenging but beneficial task for human beings. Several missions have at- tacked such environments, e.g., the Mars Science Laboratory (MSL) program by NASA3 . Due to the severe conditions of the environments, autonomous mobile robots are regarded as an effective method for such missions. One of the essential techniques of autonomous mobile robots is localizing themselves. Especially in such environments, robots have to estimate their position without any external infrastructure (such as GPS satellites) nor prior knowledge about the location. To date, numerous localization methods for mobile ground vehicles have been proposed and implemented in mobile ground vehicles. The most popular meth- ods are Wheel Odometry (WO) and Inertial Measurement Unit (IMU) or the combination of them. These methods offer high resolution with low cost sen- sors. Even so, these approaches have several challenges: WO is vulnerable to

3 http://mars.jpl.nasa.gov/msl/ wheel slips, and inertial sensors are prone to drift. These shortages can be cru- cial when exploring the environment containing loose terrain and steep slopes. Doppler sensors are used as velocity sensors insensitive to wheel slips, while they are applicable only for fast-moving robots. Active ranging sensors, such as ultra- sonic sensors and Laser Range Finders (LRF), are also typically used to localize robots. These sensors measure the distance from the robot to objects, and the robot estimates its current position from the physical relationship with the ob- jects. However, these active sensors have a drawback to consume much electric power, so they are not feasible under the energy-limited environment. Recently, the focus is on another powerful sensor, vision sensor, which is less energy-consuming but provides rich information about the environment. A tech- nique to estimate motion by using visual input is called Visual Odometry (VO). It is regarded as a promising localization method with the help of the rapid im- provement of computational resources in recent years. The basic principle of VO is an iteration of estimating the camera relative pose by finding the feature point correspondences between images, which is a key technique of two-view Structure from Motion (SfM) problem. VO is immune to wheel slips and also more stable for the drift error since it can cancel the drift by vision approaches (e.g., Bun- dle Adjustment [1] or loop-closing technique of Simultaneous Localization and Mapping (SLAM) problem [2]). VO can also be easily installed into the system since visual sensors are recently mounted on most robots due to the variety of their possible usage. SLAM is, on the other hand, another powerful localization method actively researched in the robotic field. Radio sensors and vision sensors tend to be used for the input to SLAM algorithms. The method is very useful for the robot navigation since it creates a map as well as localizing the robot. However, the algorithm is complex and requires much computational resources, which may bring difficulty in installing into low-performance onboard computers. From the viewpoint, VO focuses on calculating a robot trajectory, which requires far less computational power and promotes easy installation onto a system. VO is becoming more and more popular due to these advantages. Still, it has several challenges evolved from properties of vision: 1. Stability: feature point tracking should be robust to the terrain appearance. VO becomes stable if every pair of images exhibits adequate correspondence of features. 2. Accuracy: the robot should be accurately localized even if the algorithm uses error-prone images. The accuracy can be improved by using statistical methods. 3. Computational efficiency: most onboard computers equipped on mobile robots are not computationally powerful. To execute real-time VO the algo- rithm efficiency is an issue to be concerned. Generally speaking, these criteria depend on the appearance of ground and can- not be fully estimated beforehand. In addition, these are trade-off in many cases. Several implementations of real-time VO have been presented (e.g., [3–5]). Regardless of these successful results, VO in an outdoor environment has a cru- cial problem on detecting feature points from the untextured terrain. VO assumes that the terrain exhibits rich texture so that the feature points are easily tracked. However, in contrast to indoor environments, particular outdoor scenery makes the point tracking difficult. In fact, the VO localization in the Mars Exploration Rover (MER) mission by NASA/JPL revealed that the rovers found many ar- eas which have little visual features on the ground surfaces in the real ultimate environment [6]. In order to address the challenge of VO in untextured terrain, roughly two approaches have been proposed. One simple but effective approach is to use a proper feature detection algorithm. A number of detectors have been proposed to detect points with intended properties. The detail of common detectors will be discussed in Sect. 2. Since these detectors focus on the different characteristics of the image, the proper detector for certain scenery depends on the terrain appearance and the intended properties. The other approach is to divide images into several blocks and find the most characteristic point in each region [7, 8]. This method enables feature detection even in feature-less terrain. However, this approach has several shortages, e.g., forcing extraction from extremely low textured region causes low matching rate because of using too weak characteristic points. The proposed method adopts the former approach, i.e., using the effective feature detector. The rest of the paper is organized as follows: in Sect. 2, the com- monly used detectors are discussed and evaluated by using dataset at volcanic fields. In Sect. 3, a hybrid algorithm of several detectors is introduced, which is designed to overcome shortages of the common detectors. In Sect. 4, the compar- ative study of the common and proposed detectors is presented. Finally, Sect. 5 concludes the paper.

2 Conventional Feature Detectors

Detector Description The main focus of this paper is to detect stable features from the smooth terrain which shows difficulty in tracking features. A lot of feature detectors have been proposed. Generally speaking, these can be divided into two groups: Corner detectors These detectors find corners in given images, since corners tend to exhibit invariance to the change of the view. This group involves the methods such as Harris [9], Shi-Tomasi [10], and FAST [11, 12] detectors. Scale-space feature detectors These detectors can obtain scale invariant fea- tures. This characteristic benefits VO as it is robust to scale changes and enables longer tracking. However, these invariance may degrades the compu- tational efficiency to some extent. SIFT [13], SURF [14], and STAR based on CenSurE [15] can be classified into this group. The algorithms mentioned above are implemented in OpenCV Library [16] and widely used in various applications including VO. Typically, corner detec- tors such as Harris and FAST are used in VO, since they are high-speed and Fig. 1. The average number of corre- Fig. 2. The percentage of correct spondences between features matches in the all extracted features

Table 1. Average runtime per frame (320x240 grayscale images on Intel Core 2 Quad 2667MHz CPU)

Detectors Harris Shi-Tomasi FAST SIFT SURF STAR Ave. runtime [ms] 11.87 14.36 1.32 54.98 27.90 9.86

accurately located. Yet, these corners are sometimes difficult to find in untex- tured natural terrain. The quantity of features can be improved by changing parameters such as threshold, but it can be prone to increase noise and outliers. SIFT and SURF are also used if scale change is a big concern. These algorithms require large computational time, and lose pixel-level accuracy in exchange for scale invariance. Agrawal et al. [15] proposed a novel detector called CenSurE, that is scale invariant but has a better computational property, as a feature de- tector in their real-time VO implementation [4]. STAR detector is implemented based on CenSurE.

Performance test A performance test for these detectors is conducted by using the dataset including more than 900 stereo image pairs of volcanic areas (See examples in Fig. 3). Statistical results are shown in Fig. 1 and 2. The accurate VO localization requires a certain number of persistent feature correspondences in order to compensate errors statistically. More than 20-30 correct matches are typically regarded to be enough for estimation. Harris corner detector shows better performance than the others. However, Harris and similar Shi-Tomasi detectors cannot present high matching rate in matching process, which can affects the matching efficiency and accuracy. The timing result on a laptop machine is presented in Table 1. In terms of the average detection time for an image, the corner detectors are superior to the scale-space feature detectors except STAR detector. STAR detector shows high computational efficiency, while it is not stable at least for this parameter setting and dataset. FAST detector is the most efficient detector of all. Its detection is fast, and repeatability and distinctiveness is high. Even so, its high performance depends on terrain, i.e., FAST detector is not robust to all kind of terrain. Feature-less Feature-rich

Fig. 3. The examples of terrain types

Texture Asssessment T or F Feature Feature Poor( F) Rich( T) Detector Switcher T F FF F A A B B TTT

Detector A Detector B High speed Low speed Less stability More stability

Fig. 4. Saturating counter with four states

These results give an implication. Harris detector shows stability as well as localization accuracy; and FAST detector is highly efficient but not robust. This trade-off problem can be solved by adopting a hybrid method which simultane- ously makes use of these advantages.

3 Adaptive Detector Selection

The proposed algorithm switches feature detectors so as to improve the stability and accuracy while maintaining computational efficiency. The switching rule is described in this section. In the proposed scheme, an appropriate feature detector is selected according to the texture of ground. The examples of the different textures in natural terrain are shown in Fig. 3. For the terrain which has large features on the ground surfaces (referred as ROUGH), high-speed feature detectors are preferred in order to obtain the total performance efficiency. On the other hand, if the robot is on the terrain with little features (referred as SMOOTH), detectors should be sensitive and stable. The terrain type is estimated from the result of feature detection and tracking. For instance, the number of detected features and/or percentage of successful tracking can be referred to be a condition of switching. The proposed method switches among multiple detectors which have different properties. These detectors are selected out of the conventional detectors based on the usability against different terrain. One should be high-speed to improve the total performance, and another should be sensitive to be capable of detecting features even from the SMOOTH terrain. The detectors should be two or three in order not to increase the cost of switching. The cost of switching is explained as follows. In the developing VO system, the features are matched by calculating the normalized correlation between the feature points detected in every pair of succeeding frames. Feature matching cannot be done if the both frames use different feature detectors, since each detectors focus on different characteristics. Therefore, if the selected detector is different between the succeeding frame pairs, it requires excess detection process. This surely deteriorates the computational performance. The simplest method to switch among the detectors may be selecting a high- speed detector for all ROUGH terrain and a sensitive and stable detector for all SMOOTH terrain. However, the strategy performs poorly for a certain terrain, e.g., the intermediate terrain between SMOOTH and ROUGH. Such environ- ment causes frequent switching between the detectors and the efficiency should be decreased by the excessive calculation. One approach to avoid the excessive switching is to use a saturating counter that is used for branch prediction in the field of computer architecture (Fig. 4). This technique is known to be simple but quite effective to make a prediction on future branches. The saturating counter is composed of a state machine with several states. The number of states can be adjusted considering the type of the explored environment. This counter can mitigate the excessive cost of switching.

4 Experiments

4.1 Experimental Setup

Field experiments have been conducted in two off-road environments: Ura-Sabaku desert at Izu-Oshima and Aso Volcano located in Japan. These spots are covered with volcanic products as well as partial rocks and rollings. Two experimental rovers (Fig. 5) developed by JAXA are used to collect the dataset. The camera specifications for rovers are shown in Table 2. The collected images are more than 900 stereo pairs in total. The readers might think the frame rates in Table 2 are too slow. This is because of the hardware constraints related to communication and scheduling problems. However, the traversing speed of the rovers are not so fast (approxi- mately 0.1m/sec), so the low frame rate does not cause a big problem.

4.2 Adaptive Detector Selection

Detector combination To improve the performance of the proposed hybrid detector, finding proper combination of detectors should be essential. According to the guideline of selecting detectors in the previous section, three detectors (Harris, FAST, and SIFT) are chosen and combined to form hybrid detectors. (a) Micro-6 (b) Cuatro

Fig. 5. Appearance of the experimental rovers

Table 2. Camera specification of the experimental rovers Micro-6 Cuatro FOV(degree) 40x30 87x65 Resolution 320x240 640x400 Frame rate(Hz) 0.25 0.69 Baseline(m) 0.270 0.475 Height from ground(m) 1.450 0.770

The performance is compared in Fig. 6 for effective percentage of correct matches and processing time including switching cost. Figure 6(a) shows a sta- tistical result of repeatability and distinctiveness, presenting the sorted percent- age of correct matches for every frame with more than 20 matches. Figure 6(b) shows the averaged time for processing each frame. Apparently, the combination of Harris and FAST is effective: 4 times faster, highly stable as well as having higher matching rate. This combination is determined to be used for this dataset.

The saturating counter The performance evaluation with respect to the num- ber of states in the saturating counter is shown in Fig. 7. The number of states varies from 2 to 8. Note that the saturating counter with 2 states corresponds to the simplest method which uses the high-speed detector for all ROUGH terrain and the sensitive detector for all SMOOTH terrain. The result in Fig. 7 shows that the state greater than 4 is slightly better in stability, and 1.5 times better in efficiency. For simplicity, the saturating counter with 4 states is adopted.

4.3 Comparison with the conventional detectors The proposed method is compared to Harris, Shi-Tomasi, FAST, SIFT, SURF and STAR detectors. Figure 8 presents how much of the detected points are correctly matched for the frames with more than 0, 10, 20, 30 matches respec- tively. In general, 20-30 matches are enough for reducing errors by statistical methods. A good detector should exhibit high matching rate for the frames with the certain number of matches. The proposed method could successfully obtain the benefits of the both combined detectors. 50 Harris-FAST SIFT-FAST 40 SIFT-HARRIS

30

20

10 Correctly matched [%] 0 0 200 400 600 800 1000 Frames (min. match=20) (a) Effective percentage of correct (b) Processing time match

Fig. 6. Performance evaluation over detector combination

50 State 2 State 4 40 State 6 State 8 30

20

10 Correctly matched [%] 0 0 200 400 600 800 1000 Frames (min. match=20) (a) Effective percentage of correct (b) Processing time match

Fig. 7. Performance evaluation over the number of state in the saturating counter

Figure 9 shows the percentage of frames with fewer than N correct matches, which means how many frames will fail if a threshold for inliers are given to assure the reliability of motion estimation. Harris, Shi-Tomasi and the proposed detector have less than 10% missed frames with minimum of 20 inliers even for the extremely untextured dataset. The result for the processing time also supports the superiority of the pro- posed method. In the same setup in Table 1, the proposed method could detect from an image with 9.63 msec. This can vary with dataset; for example, the pro- posed method would perform better for the dataset containing ROUGH images at a high rate, since it choose the high-speed detector in most images. Finally, the comprehensive analysis is shown in Table 3. The detectors are compared in terms of the criteria: stability, accuracy, and efficiency. This table shows the effectivity of the proposed method for this dataset with more than 900 images in the volcanic areas.

5 Conclusions

This paper compares the conventional feature detectors in untextured natural terrain, in terms of stability, localization accuracy, and computational efficiency. In order to address the trade-off problems which is clarified by the examination, 50 50 Harris Harris Shi-Tomasi Shi-Tomasi 40 FAST 40 FAST SIFT SIFT 30 SURF 30 SURF STAR STAR proposed proposed 20 20

10 10 Correctly matched [%] Correctly matched [%] 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Frames (min. match=0) Frames (min. match=10) (a) Minimum match=0 (b) Minimum match=10

50 50 Harris Harris Shi-Tomasi Shi-Tomasi 40 FAST 40 FAST SIFT SIFT 30 SURF 30 SURF STAR STAR proposed proposed 20 20

10 10 Correctly matched [%] Correctly matched [%] 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Frames (min. match=20) Frames (min. match=30) (c) Minimum match=20 (d) Minimum match=30

Fig. 8. Statistical results on stability and accuracy: the sorted percentage of correct matches for minimum of N match frames

90 Harris 80 Shi-Tomasi FAST 70 SIFT 60 SURF STAR 50 proposed 40 30 20 10

Frames less than N inlieres [%] 0 5 10 15 20 25 30 35 40 45 50 Num of inliers

Fig. 9. Missed frames: the percentage of frames as a function of the number of correct matches

a new hybrid detector is proposed. The algorithm of the detector is switching among multiple detectors according to the texture of terrain. In the process of dynamic detector switching, a saturating counter for predicting future branches is adopted in order to mitigate the excessive cost of switching.

The proposed algorithm has been verified by using datasets collected at vol- canic areas, covered with feature-less volcanic products and a few rocks. The method is validated as a robust and efficient algorithm through the comprehen- sive analysis. Table 3. Comprehensive evaluation

Detector Type Stability Accuracy Efficiency Harris Corner +++ +++ ++ Shi-Tomasi Corner +++ + ++ FAST Corner + ++ ++++ SIFT Scale-space ++ +++ + SURF Scale-space + ++ ++ STAR Scale-space + ++ +++ Proposed Corner +++ +++ +++

References

1. B. Triggs, P. McLauchalan, R. Hartley, A. Fitzgibbon, “Bundle adjustment – a modern synthesis,” Vision algorithms: theory and practice, pp. 153–177, 2000. 2. Open source software for SLAM and loop-closing: http://openslam.org 3. D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry for ground vehicle applications,” J. of Field , vol. 23, pp. 3–20, 2006. 4. K. Konolige and M. Agrawal, “Large-scale visual odometry for rough terrain,” in International Symposium on Research in Robotics (ISRR ’07), vol. 66, 2007. 5. A. Howard, “Real-time stereo visual odometry for autonomous ground vehicles,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS ’08), pp. 3946– 3952, IEEE, 2008. 6. M. Maimone, Y. Cheng, and L. Matthies, “Two years of visual odometry on the mars exploration rovers,” J. of Field Robotics, vol. 24, no. 3, pp. 169–186, 2007. 7. A. E. Johnson, S. B. Goldberg, Y. Cheng, and L. H. Matthies, “Robust and efficient stereo feature tracking for visual odometry,” in IEEE International Conference on Robotics and Automation (ICRA ’08), pp. 39–46, 2008. 8. Y. Tamura, M. Suzuki, A. Ishii, and Y. Kuroda, “Visual odometry with effective feature sampling for untextured outdoor environment,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS ’09), pp. 3492–3497, IEEE, 2009. 9. C. Harris and M. Stephens, “A combined corner and edge detector,” in Alvey vision conference, vol. 15, pp. 147–151, Manchester, UK, 1988. 10. J. Shi and C. Tomasi, “Good features to track,” in IEEE Conference on and Pattern Recognition (CVPR ’94), pp. 593–600, IEEE, 1994. 11. E. Rosten and T. Drummond,“Fusing points and lines for high performance track- ing,” in IEEE International Conference on Computer Vision (ICCV ’05), vol. 2, pp. 1508–1515, IEEE, 2005. 12. E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in European Conference on Computer Vision (ECCV ’06), pp. 430–443, 2006. 13. D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. j. of computer vision (IJCV ’04), vol. 60, no. 2, pp. 91–110, 2004. 14. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “SURF: Speeded-up robust fea- tures,” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008. 15. M. Agrawal, K. Konolige, and M. Blas, “CenSurE: Center surround extremas for realtime feature detection and matching,” in European Conference on Computer Vision (ECCV ’08), pp. 102–115, Springer, 2008. 16. G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000