REALTIME MAPPING AND SCENE RECONSTRUCTION BASED ON MID-LEVEL GEOMETRIC FEATURES

A Dissertation Submitted to the Temple University Graduate Board

In Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY

by Kristiyan Georgiev August 2014

Examining committee members:

Dr Rolf Lakaemper, Advisory Chair, Dept. of Computer and Information Sciences Dr Alexander Yates, Department of Computer and Information Sciences Dr Longin Jan Latecki, Department of Computer and Information Sciences Dr M. Ani Hsieh, External Member, Drexel University Copyright c 2014 by Kristiyan Georgiev ABSTRACT

Robot mapping is a major field of research in robotics. Its basic task is to combine (register) spatial data, usually gained from range devices, to a single data set. This data set is called global map and represents the environment, observed from different locations, usually without knowledge of their positions. Various approaches can be classified into groups based on the type of sensor, e.g. Lasers, Microsoft Kinect, Stereo Image Pair. A major disadvantage of current methods is the fact, that they are derived from hardly scalable 2D approaches that use a small amount of data. However, 3D sensing yields a large amount of data in each 3D scan. Autonomous mobile robots have limited computational power, which makes it harder to run 3D robot mapping algorithms in real-time. To remedy this limitation, the proposed research uses mid-level geometric features (lines and ) to construct 3D geometric primitives (planar patches, cylinders, spheres and cones) from 3D point data. Such 3D primitives can serve as distinct features for faster registration, allowing real-time performance on a mobile robot. This approach works in real-time, e.g. using a Microsoft Kinect to detect planes with 30 frames per second. While previous approaches show insufficient performance, the proposed method operates in real-time. In its core, the algorithm performs a fast model fitting with a model update in constant time (O(1)) for each new data point added to the model using a three stage approach. The first step inspects 1.5D sub spaces, to find lines and ellipses. The next

iii stage uses these lines and ellipses as input by examining their neighborhood structure to form sets of candidates for the 3D geometric primitives. Finally, candidates are fitted to the geometric primitives. The complexity for point processing is O(n); additional time of lower order is needed for working on significantly smaller amount of mid-level objects. The real-time performance suggests this approach as a pre- processing step for 3D real-time higher level tasks in robotics, like tracking or feature based mapping. In this thesis, I will show how these features are derived and used for scene regis- tration. Optimal registration is determined by finding -feature correspondence based on mutual similarity and geometric constraints. Our approach determines the plane correspondence in three steps. First step computes the distance between all pairs of planes from the first scan to all pair of planes from the second scan. The distance function captures angular, distance and co-planarity differences. The result- ing distances are accumulated in a distance matrix. The next step uses the distance matrix to compute the correlation matrix between planes from the first and second scan. Finally plane correspondence is found by finding the global optimal assignment from the correlation matrix. After finding the plane correspondence, an optimal pose registration is computed. In addition to that, I will provide a comparison to existing state-of-the-art algorithms. This work is part of an industry collaboration effort sponsored by the National Institute of Standards and Technology (NIST), aiming at performance evaluation and modeling of autonomous navigation in unstructured and dynamic environments. Additional field work, in the form of evaluation of real robotic systems in a robot test arena was performed.

iv To my parents Rosen and Rositsa. To my grandparents Ivanka, Katya and Penko. To Winnie.

This dissertation is a result of their continuous love and support.

In loving memory of my grandfather Kostadin Georgiev.

v ACKNOWLEDGEMENTS

I am sincerely thankful to my advisor Dr Rolf Lakaemper, whose encouragement, guidance and support from the initial to the final level enabled me to gain un- derstanding of the subject and develop as a researcher. His enormous knowledge, enthusiasm and extraordinary attention to details were of great inspiration through- out my studies and had a major influence in shaping my critical thinking in scientific research. During my graduate studies I was fortunate to work as a teaching assistant for Prof. Rolf Lakaemper, Prof. Ola Ajaj, Prof. Pei Wang. Their experience and enthusiasm about the topics being taught were of great inspiration throughout my years of teaching. I am also thankful to all my students at the Department of Computer and Information Sciences. I would also like to thank Dr. Alexander Yates, Dr. Longin Jan Latecki and Dr. M. Ani Hsieh for serving on my dissertation committee and for providing their invaluable feedback. I am very thankful for the opportunity to work closely with my lab colleagues Motaz AlHami and Ross Creed. Special thanks to Dr. Raj Madhaven for his scientific pursuit and dedication during out time at Robotics Technology Park (RTP) near Huntsville, AL. Special thanks to Jan Elseberg and Dorit Borman for clarifying the use of SLAM6D ICP implementation, as well as Dirk Holtz and Moataz Elmasry for their PCL and

vi ROS compatible implementation of my plane extraction algorithm. I thank Dr. Maurice Wright and Dr. Rolf Lakaemper for allowing me to help and experience the production the robot opera GALATEA RESET. The author gratefully acknowledge the financial support from Temple University and ARRA-NIST-10D012 of the National Institute of Standards and Technology (NIST). Last, but not least a big thank you to all my friends through all these years.

vii TABLE OF CONTENTS

ABSTRACT iii

ACKNOWLEDGEMENTS vi

LIST OF TABLES xi

LIST OF FIGURES xiii

1 INTRODUCTION1

1.1 Mobile Robots...... 1

1.2 Robot Sensors...... 2

1.2.1 3D Sensors...... 2

1.3 Robot Mapping...... 10

1.3.1 3D Data Acquisition...... 16

2 MID-LEVEL GEOMETRY FEATURE (MLGF) REPRESENTA- TION 21

2.1 Introduction...... 21

2.2 Related Work...... 28

2.3 Points to -Segments to Planes...... 31

2.3.1 Line Segments...... 31

2.3.2 Segment Neighborhood N and Connected Components.... 33

2.3.3 Planes...... 35

2.3.4 Plane Update...... 38

viii 2.3.5 Region Growing...... 39

2.3.6 Line Segment Extraction...... 40

2.4 Extraction...... 41

2.4.1 Motivation...... 41

2.4.2 System Limitations...... 44

2.5 3D Object Extraction...... 46

2.6 Experiments...... 48

2.6.1 Improving the System using Kalman Filtering...... 49

2.6.2 Basic Test: Object detection, detection speed...... 50

2.6.3 Comparison to RANSAC...... 50

2.6.4 Comparison to RANSAC for Ellipse Fitting...... 53

2.6.5 Robustness of Mid-Level Geometric Features to Noise..... 54

2.6.6 Accuracy...... 57

2.6.7 Occlusion...... 58

2.6.8 Object Tracking...... 59

2.7 Conclusion and Future Work...... 59

3 CORRESPONDENCES BASED ON MID-LEVEL GEOMETRIC FEATURES 62

3.1 Introduction...... 62

3.2 Related Work...... 65

3.3 Methodology...... 68

3.3.1 Plane correspondence...... 69

3.3.2 Pose-registration using planar patches...... 72

3.4 Plane-Correspondence Experiments...... 73

3.4.1 Comparison to ICP...... 73

3.4.2 Comparison to PRRUS...... 74

ix 3.5 Conclusion...... 78

4 ADDITIONAL PRACTICAL EXPERIENCE 79

4.1 Benchmarking and Standardization...... 79

4.1.1 NIST Test Arena...... 82

4.2 Robot Opera...... 84

5 CONCLUSION 92

BIBLIOGRAPHY 93

x LIST OF TABLES

1.1 Range sensors overview...... 4

2.1 Overall segmentation results on all 30 test images assuming 80% pixel overlap as described by [23]...... 53

2.2 Accuracy test. Left table: without Kalman filter. DG: ground truth distance, DM : mean measured distance, Dσ: standard deviation of distance measurements, RM : mean measured radius, Rσ: standard deviation of radius measurements, N: number of measurements. Right table: with Kalman filter, labels accordingly...... 55

2.3 Occlusion experiment of a cylinder(R = 0.2m) at different distances (DG), with and without Kalman filtering. O: occlusion percentage (10-40), the algorithm does not handle occlusions ≥ 50%. R: mea- sured radius, Rσ: standard deviation radius, RK , RKσ: with Kalman filter, N: number of measurements...... 55

2.4 Observations of a moving cylinder. Estimated velocity VM is computed using the observed change in distance over time and is compared to the ground truth velocity (VG), Vσ denotes standard deviation in velocity. Similar computation is shown using the Kalman filter (VK , VK σ). N: number of measurements...... 56

2.5 Observations of a moving flat sheet of cardboard. Estimated velocity VM is computed using the observed change in distance over time and is compared to the ground truth velocity (VG), Vσ denotes standard deviation in velocity. Similar computation is shown using the Kalman filter (VK , VK σ). N: number of measurements...... 56

2.6 Accuracy test. Left table: without Kalman filter. DG: ground truth distance, DM : mean measured distance, Dσ: standard deviation of dis- tance measurements, KM : mean distance corrected with Kalman Fil- ter, Kσ: standard deviation of Kalman corrected distance measurements,N: number of measurements...... 56

xi 3.1 Plane-matching registration comparison between ICP (point based), PRRUS [51] and our approach based on the crashed car park data set [50]...... 78

xii LIST OF FIGURES

1.1 From Left to Right: Roomba robotic vacuum cleaner (Source: roomba.com). See- grid’s robotic fork lift(Source: seegrid.com). Google’s self driving car (Source: google.com) ...... 2

1.2 A diagram of a stereo vision system with two cameras (pinhole camera model [21]). The image of a point in 3D is the intersection point of the light ray connecting the point and the center of projection, and the image plane. left and right cameras

have image planes πl and πr and center of projections in the left and right camera.

The 3D object is seen by the two cameras at points pl and pr. In an Epipolar geometry pair of cameras, any point P in 3-D space, defines the epipolar plane

πP , which goes through P and the centers of projection of the two cameras. ...5

1.3 Kinect projection pattern captured with camera with IR filter...... 6

1.4 Top: Kinect Sensor (Source: xbox.com) . Bottom: Device Field of View (Source: [9])...... 8

1.5 A 2D map of the Stanford´sGates Building [27]...... 12

1.6 High-Resolution Self-Portrait by Curiosity Rover Arm Camera at the end of its

arm. Source: http : //www.nasa.gov/missionpages/msl/multimedia ...... 12

1.7 3D Scan obtained using Dr. Faust in a class room. Left: Top view, Right: view from the side ...... 14

1.8 3D Map of Bremen. Single data was obtained using Riegl VZ400 Scanner. The map was produced using slam6D Source: Jacobs University Bremen gGmbH, Ger- many ...... 15

1.9 Robot ’Dr. Faust’. A SICK LMS200 [SICK] 2D laser scanner is mounted on a Schunk PW70 tiltable arm to generate 3D laser scans. The white semi-transparent planes represent single 2D sub-scans. The line extraction operates on these sub-scans. 16

1.10 3D Scan Diagram and I must expand here ...... 20

xiii 2.1 Mid-Level Geometric Features(MLGFs) are introduced in our approach (bottom). Other approaches work directly on a point cloud to extract 3D objects...... 23

2.2 Flow diagram. 1) Input; 2) Mid-Level features generated from 1); 3) Scene de- scription using features from 2)...... 25

2.3 Flow diagram showing: 1) Input data, 2) Result of ellipse fitting, 3) After removal of elongated ellipses, 4) Ellipse candidate sets based on proximity. The blue line shows connected components. 5) Resulting sets of ellipses forming different conical objects 6) Fitted 3D models...... 26

2.4 Results of the algorithm. Top: 3D point cloud from MS Kinect (307200 points); Middle: extracted ellipses(purple) and line segments (blue); Bottom: at fitted conic objects and planar patches together with the 3D point cloud. The cylinder in the picture is a cylindrical trash can with radius=0.2m and height=0.7m. .. 27

2.5 Left: original point data of a simulated scan of a geometric object. Right: single 2D sub-scan with points and extracted line segments. Disconnection occurs, since points exceeding a distance threshold are not assigned a line segment...... 32

2.6 Black lines: the set of all lines from 2D subscans (compare fig. 2.5, right). The red lines visualize the ’vertically’ (see text) connected components with respect to the neighborhood relation N. They connect center-points of neighboring line segments. Note that single connected components consist of segments belonging to multiple planes...... 33

2.7 Left: Final result of the region growing labeling process based on the connected components of figure 2.6. Same color denotes same label. Right: related planes. The holes in the figure are a result of point regions with insufficient point density. In those regions, no line segments were approximated ,compare figure 2.5, right. 40

2.8 Illustration of conic section cases [74]...... 42

2.9 Object detection. Top to bottom: RGB image(not used); 3D point cloud from MS Kinect (307200 points); extracted ellipses after the pre-filter; to the right are the fitted conic objects; at the bottom are fitted conic objects and 3D point cloud. The cylinder in the picture is a cylindrical trash can...... 45

2.10 A tilted (rotation around x-axis, tilted towards camera) parking cone, and ellipses detected (red). Left: tilt=5 degrees, Middle: tilt=20 degrees, Right: tilt=40 degrees. In all cases, the algorithm is able to fit a cone model (not explicitly shown in figure)...... 46

2.11 The three curves represent ellipse radii (R-axis) along the center lines (Z-axis) of a cylinder (blue), cone (red) and sphere (green half circle) in (z,r) space. .... 47

xiv 2.12 A flat sheet of paper is placed in front of the Kinect sensor at a known distance to test the accuracy of the extracted planar patch. Results are shown in Table 2.6. 47

2.13 Results of the algorithm showing fitted 3D geometric models. The sphere placed on top of a cylinder is successfully recognized as a separate object...... 49

2.14 Cubicle (top left), Laser range scan (top right), RANSAC plane detection (bottom left), our algorithm (bottom right).The cubicle has a small table, and 3 posters at the back wall. Our algorithm generates a more connected, perceptually consistent plane representation of the scene than RANSAC...... 51

2.15 Result of ellipse extraction in noisy, simulated data. Blue dots are input data, the fitted ellipse and its center (marked red) and supporting points (marked black). Starting from top to bottom, random noise is added with σ: 0.00, 1.0,

2.0, Rmajor = 40,Rminor = 20 ...... 52

2.16 Simulated data points (black) with fitted ellipse (red), supporting points (green). Top: Our Approach; Middle: RANSAC with constraint for radii ratio rminor ≥ 0.2; rmajor Bottom: RANSAC with stronger radii ratio constraint rminor ≥ 0.2...... 57 rmajor

2.17 Result of ellipse extraction in noisy, simulated data. Amount of added random noise to the data (x-axis) vs standard deviation of fitted ellipse model over 50 trials. 58

2.18 Experiment setup to measure speed of a moving cylinder mounted on a robot using a Kinect sensor...... 60

2.19 Results of tracking a ball. The blue line connects center points of spheres. ... 61

2.20 A Pioneer DX robot tracking a ball in real-time, using 3D range data from MS Kinect...... 61

3.1 Structure diagram comparison of three SLAM algorithms (a - The approach of Nuchter et. al. [47]; b - Weingarten et. al. [73]; c - our approach and Pathak et. al. [51]). The ICP step in Weingarten’s approach has been removed in our approach...... 65

3.2 Correspondence is determined in 3 steps. 1) Compute all distances between pairs of planes 2) Accumulate the distances into the plane correspondence matrix. 3) Find global optimal min assignment for it. 69

3.3 Results of a successful registration between two scans using our ap- proach. Planar patches are drawn semi-transparent to show the inter- nal details...... 70

xv 3.4 An example of unsuccessfully registration between two scans using our approach. In a correct registration the lower right corners should overlap. This occurred because wrong plane-features correspondences were found, which resulted in an incorrect registration...... 71

3.5 Sample test of our algorithm using Kinect data. Left: Correspon- dence matrix C for the two scans. Corresponding planar patches are market by the same color. The planar patch with label 21 has low correspondence values for every plane from scan B as a result the it has no corresponding plane in the other scan...... 72

3.6 All values are computed by summing the translation and angular dif- ference from the ground truth...... 75

3.7 Incorrect ICP result...... 76

3.8 Correct ICP result...... 77

3.9 3D scene composed of 4 planes (P1,P2,P3,P4) P1 and P4 are parallel. 77

4.1 Setting up the test arena at the Alabama Robotics Technology Park (RTP) ... 85

4.2 Schematic diagrams for the R1(easy)-R4(hard) evaluation scenarios. The robot must reach each of the goals in order marked with green circles...... 86

4.3 Testing in progress...... 87

4.4 Promotional flyer for the robotic opera...... 88

4.5 Galatea RESET on the cover of Temple University Outlook Magazine...... 89

4.6 Top: Galatea stage design. Bottom: A top view of the stage with the localized robot...... 90

4.7 Galatea control system...... 91

4.8 The entire crew of the GALATEA RESET. Front row (left to right): Dr. Wright, Dr. Lakaemper ...... 91

xvi CHAPTER 1 INTRODUCTION

1.1 Mobile Robots

Technological progress has been the man-made extension to evolution for centuries. In the late 20th century, during the Information Age, the computer and the Internet were invented. That allowed for significant increase in computational power and data storage. The rapid technological progress inspired writers like Isaac Asimov to dream and write about intelligent machines, with the ability to think, act and feel like humans, which was termed Robots by Karel Capek [5]. He imagined robots as human companions, helping with mundane and labor intensive human tasks. Many people saw and still see robots as the power tools which replaced hand-tools of the industrial age. There is a strong desire to harvest the potential power of robots and automation. As a result many attempts have been made to build a robotic system that can perform human tasks. These tasks include mowing the lawn, vacuuming the living room, to operating a forklift and driving a car, as seen in Figure 1.1. Such tasks, performed by autonomous mobile robots have one thing in common, the robot needs an internal representation of the physical environment to be able to navigate through the environment efficiently. In order to experience/sense the physical environment,

1 mobile robots require sensors. There are different types of sensors used on mobile robots. The next section gives an overview of the most commonly used sensors.

Fig. 1.1 From Left to Right: Roomba robotic vacuum cleaner (Source: roomba.com). Seegrid’s robotic fork lift(Source: seegrid.com). Google’s self driving car (Source: google.com)

1.2 Robot Sensors

Robots use sensors to collect information about their surroundings. Some of these sensors can be used to measure a robot’s position and orientation (pose) at each point of time. One such sensor is the wheel odometer, a device similar to the odometer in a modern car, but with a higher precision. Besides knowing its accurate pose, a robot has to be able to measure distances to avoid obstacles and explore unknown environ- ments. Different sensing technologies can be used to measure distances depending on the accuracy, range, and frequency required. Most commonly used sensors are laser range finder (LRF), also referred as laser scanners, stereo image sensors, time of flight (ToF) cameras, etc. The next section gives a brief explanation of different visual sensor types and provides a technical comparison.

1.2.1 3D Sensors

Some mobile robot implementations rely on 2D sensors for creating maps and self localization. To some extent this is justified for most applications on ground robots, which inherently move in a 2D space, but in some cases an obstacle may not be

2 detected from the 2D data (ex. leg of a table), causing risks to mobile robots. 3D data point clouds provide richer information about the scene than a 2D data scan of the same scene. Table 1.1 list of 3D sensors that are frequently used in robotics. Laser sensors are very accurate and offer a wide field of view and extended range. In contrast, other range sensors are less accurate and offer narrower field of view. However, they provide a much higher frame rate than lasers scanners, which makes them more suitable for different set of applications, such as motion detection, object manipulation and obstacle avoidance. In addition, these devices are more affordable than laser scanners.

3 Table 1.1 Range sensors overview. Point Grey Microsoft SwissRanger Velodyne Raytrix R5 Bublebee2 Kinect SR4000 HDL-64E S2

Category Stereo Camera Structured Light ToF Camera Laser Scanner Light Field Camera Field of View 97 ◦ × 72 ◦ 57 ◦ × 43 ◦ 69 ◦ × 56 ◦ 360 ◦ × 26.8 ◦ variable Resolution 1024 × 768 640 × 480 176 × 144 1024 × 64 max 1MP Data 15.7 19.2 1.2 1.3 max 180 4 Acq. Rate (Mpts/s) Scanning 20 Hz 30 Hz 50 Hz 20 Hz 180 Hz Rate Fig. 1.2 A diagram of a stereo vision system with two cameras (pinhole camera model [21]). The image of a point in 3D is the intersection point of the light ray connecting the point and the center of projection, and the image plane. left and right cameras have image planes πl and πr and center of projections in the left and right camera. The 3D object is seen by the two cameras at points pl and pr. In an Epipolar geometry pair of cameras, any point P in 3-D space, defines the epipolar plane πP , which goes through P and the centers of projection of the two cameras.

The Bumblebee2 [PointGrey] device by Point Grey belongs to the class of stereo cameras. Stereo cameras are two calibrated cameras with an established baseline. RGB images of the same scene are captured by both cameras at the same time. Correspondences are established using Computer Vision algorithms on the image pair, i.e. to establish which segment of one image correspond to the other image. An given object observed by both cameras will appear in different positions in the two images. By utilizing the camera calibration, these positions can be transformed into pair of angles from which the object is seen relative to each camera position. Finally, the distance of the object to each stereo camera is determined by performing triangulation. This is illustrated in Figure 1.2. The Microsoft Kinect sensor, a game controller for the Xbox 360 console [65], is a horizontal bar connected to a small base with a motorized pivot. The device features an RGB camera, depth sensor and multi-array microphone running propri- etary software, which provide full-body 3D motion capture, facial recognition and

5 voice recognition capabilities. The depth sensor consists of an infrared laser projec- tor combined with a monochrome CMOS sensor, which captures video data in 3D under any ambient light conditions. The Kinect sensor uses a 3D scanner system called Light Coding (also called structured light technology) and employs a variant of image-based 3D reconstruction [63]. This technology can be described as a projector that projects a known pattern on the environment. The pattern is deformed accord- ing to the geometry of the objects in the scene. Then the camera image is analyzed to determine how the pattern is deformed. The geometry from Figure 1.4 applies. Similarly, the position of an object is found by correlating projected patterns and triangulation. The Kinect uses a single infrared pattern, projected by an infrared laser diode. The sensor is designed for indoor use and will fail to acquire accurate depth information when exposed to sunlight (or other strong sources of IR light). Fig. 1.3 shows an infrared image of the laser grid Kinect uses to calculate depth information.

Fig. 1.3 Kinect projection pattern captured with camera with IR filter.

The default RGB video stream uses 8-bit VGA resolution (640 480 pixels). The monochrome depth sensing video stream is in VGA resolution (640 480 pixels) with

6 11-bit depth, which provides 2,048 levels of sensitivity. The Kinect sensor has a practical ranging limit of 1.2-3.5 m (3.9-11.5 ft) distance. The sensor has an angular field of view of 57 ◦ horizontally and 43 ◦ vertically (see Fig. 1.4). The horizontal field of the Kinect sensor at the minimum viewing distance of 0.8 m (2.6 ft) is therefore 87 cm (34 in), and the vertical field is 63 cm (25 in), resulting in a resolution of just over 1.3 mm (0.051 in) per pixel. The effective field of view corresponds to the 3D measurement volume shown in Fig. 1.4. Since its release, open-source drivers helped the adoption and surge in development in the robotics and automation community. Many autonomous robots are using Kinect as a main or primary vision sensor.

7 Fig. 1.4 Top: Kinect Sensor (Source: xbox.com) . Bottom: Device Field of View (Source: [9]).

The technology employed in the SwissRanger SR4000 [SwissRanger] continuously sends out intensity-modulated light, similar to a phase-based laser. The reflected light is gathered by the lens of the camera and is focused onto a light sensitive array. The phase difference between the emitted and received signal is measured for each pixel in the array. Consequently, the distance to the target can be computed for each pixel. This is not referred to as laser scanning, because instead of the light being generated by a laser beam, the SR4000 employs a simple LEDs to shine infrared

8 light. These cameras are affected by the same problem as the Kinect. Bright external lighting that enters the sensor will obfuscate the emitted light pulse. This leads to noisy or no range measurements in sunlight or in other similar conditions. It should be noted that one of the devices listed in Table 1.1, the Velodyne HDL- 64E S2 [Velodyne] is a pulse based laser scanner. It achieves its high acquisition rate by 64 discrete laser emitters and receivers that rotate continuously around the main axis of the scanner. Unlike the previous two sensors, the Velodyne has much less problems with sunlight. As a consequence, it is frequently used for autonomous driving cars [44]. Finally, the Raytrix R5 is one in a line of cameras that are commonly known as light field cameras. From the manufacturer website [Raytrix] ”Standard cameras only record a flat 2D projection of the 3D scene from only one fixed point of view. In contrast to that the raytrix cameras make use of digital cameras with high resolution to record the complete 4D lightfield of a 3D scene. The 4D lightfield consists of all lightray intensities passing through our 3D space and not only one flat 2D image- projection. By recording this 4D lightfield with only a single shot, raytrix cameras store more information of the 3D scene comparable by taking many shots at the same time but from different point of views.” A more relevant aspect is that any light field camera is in effect a range sensor as well. Assuming the camera is calibrated, finding the perfect focus plane for each pixel is equivalent to finding the distance of that pixel to the camera. The range can also be computed by considering the image on the sensor as a composite of a continuum of stereo images. This technology allows for very high frame rates when acquiring high resolution RGB/range images both in indoor and outdoor scenes. Unfortunately, the resolution of the range measurements themselves is poor in current light field cameras in comparison to the other range sensors. Modern light field sensors cameras resolve each range measurement at only hundreds of discrete steps, instead of the thousands

9 of discrete steps even cheap hardware like Kinect allows. Furthermore, light field cameras have only just emerged on the consumer market with the Lytro [Lytro], they still remain relatively expensive. Consequently, these cameras are not yet used in robotics in a wide spread manner. All of the previously discussed sensors can be used on a mobile robot. In our experiments we used the laser scanner (LRF) and Microsoft Kinect to collect data. The decision was motivated based on the long range, high precision and robustness of LRF sensors. The Kinect sensor is much cheaper (about $200 USD) than the LRF (about $4000 USD) and offered higher data frequency (30Hz).

1.3 Robot Mapping

Data sent from Earth travels approximately 13 minutes to reach the Mars. Such a long delay, makes it impossible to remotely control the NASA’s Mars rover Curios- ity (Fig. 1.6), residing on Mars’s surface. This is an example of physical necessity of an autonomous behavior, since non-autonomous or half-autonomous (assistance) systems evidently need faster feedback times. In addition, sensors provide insuf- ficient sensory feedback to a human operator. Autonomous robots can eliminate human caused error by taking the human factor out of the loop. Specifically, mobile autonomous robots should (at least) be able to construct and use a robot internal representation of the physical environment (map) and to localize themselves in it. Robotic mapping and self-localization is the study and application of map construc- tion and localization by a robot. Robots use sensors to survey (scan) the environment in order to produce a map (ex. Figure 1.5). Sensors have limited local field of view, therefore in order to obtain a global view, more than one observation (scan) from different positions is required. This introduces the problem of finding and aligning overlapping regions between multiple views. One solution to this problem is to use dead reckoning methods, such as tracking the number of revolutions of wheels (robot

10 odometry). Such an approach corresponds to an idiothetic1 approach and can give the absolute position of the robot, but it is subject to cumulative error which can grow quickly. Another approach corresponds to the sensors of the robot, like a cam- era, a microphone, laser or sonar. This approach is allothetic2 and the problem is ”perceptual aliasing”. For example, in a building, it is nearly impossible to deter- mine a location solely with the visual information, because all the corridors may look similar. Most robots use a combination of of allothetic and idiothetic information to infer spatial information. The result of utilizing these two kinds of information is that they can compensate for cumulative errors resulting from the use of idiothetic (odometry) information. In addition the combined information, can be used to construct a topological or metric map [43]. The metric framework is the most common for humans and considers a two-dimensional space in which it places the objects. The objects are placed with precise coordinates. This representation is very useful, but is sensitive to noise and it is difficult to calculate the distances precisely. The topological framework only considers places and their relations between them. The map can be seen as a graph in which the nodes corresponds to places and edges represents neighborhood relationships.

1 Idiothetic means a navigation system centrad on the subject 2 Allothetic means being centred in people or places other than oneself, Allothetic map indicates a global map not orientated or centred on the subject.

11 Fig. 1.5 A 2D map of the Stanford´sGates Building [27].

Fig. 1.6 High-Resolution Self-Portrait by Curiosity Rover Arm Camera at the end of its arm. Source: http : //www.nasa.gov/missionpages/msl/multimedia

12 Although there are different mapping approaches, they have one thing in common regardless of the sensor modality. It is referred to as the problem of rigid registration. In short, registration is the task of finding the correspondences between overlapping scans and putting them together like a (3D) jigsaw puzzle, where each scan is a piece from that puzzle. Examples of approaches involved in the solution of the registration problem are RANSAC [15], ICP [47]. Relaxation labeling ([29],[81], [55]), 3D Shape Context based [33] which extends 2D Shape Context([2], [1]), also Graph Matching [79] and particle filter approaches for building partial correspondence between shapes [38]. ICP is considered as the defacto standard for scan registration, because it is fast (O(NlogN)) and has an efficient open-source implementation [46]. Chapter 3 discusses further in details related approaches and introduces our feature based contribution to the field.

13 Fig. 1.7 3D Scan obtained using Dr. Faust in a class room. Left: Top view, Right: view from the side 14 Fig. 1.8 3D Map of Bremen. Single data was obtained using Riegl VZ400 Scanner. The map was produced using slam6D Source: Jacobs University Bremen gGmbH, Germany

Mapping is often performed in connection with localization; a difficulty arises when errors in localization are incorporated into the map. This problem is com- monly referred to as Simultaneous Localization and Mapping (SLAM) [7]. In recent years, Robot mapping based on range sensors has become a major field of research in robotics ([28], [64]). As previously mentioned mapping combines scans from sin- gle observations into a ’global map’. The global map represents the environment scanned from different locations, usually without knowledge of their pose (position and heading). The basic mapping task for 2D data sets improved in recent years to a state near ’mission accomplished’, the current focus in mapping therefore has shifted to additional real world problems, e.g. dynamic environments, and the mapping of 3D environments. Especially in the latter problem, extensions of mapping algorithms

15 like ICP ([4, 48]) from 2D to 3D mapping algorithms show promising results, yet reveal a problem that is founded in the nature of 3D data sets: the amount of data increases dramatically (specifically, 3D scanning problems work with meshes of cardinality n2 where n is the usual cardinality of data sets for the 2D version of the same problem). Even with fast computers, this creates runtime problems for any approach with an order of greater than O(N), where N is number of data points, for real-time applications.

Fig. 1.9 Robot ’Dr. Faust’. A SICK LMS200 [SICK] 2D laser scanner is mounted on a Schunk PW70 tiltable arm to generate 3D laser scans. The white semi-transparent planes represent single 2D sub-scans. The line extraction operates on these sub-scans.

1.3.1 3D Data Acquisition

3D data scans can be obtained using costly 3D laser scanners (ex. Velodyne HDL- 64E S2). However, a cheaper alternative is to use a 2D laser scanner mounted on

16 a tilt/rotate arm to generate a 3D data set over time. Figure 1.9 shows a typical mobile robot arrangement for 3D scanning. Robot ’Dr. Faust’ is equipped with a 2D laser scanner, which is mounted on a tilt/rotate arm (Schunk Power Cube PW70 [Schunk]). A single 3D laser scan is gained by three vertical ’sweep’ (a nodding move of the 2D laser) of single, plane 2D scans. Our laser scanner has 180 degrees horizontal angular range, and is placed 11cm from the center of rotation, therefore three sweeps are needed to produce a complete 3D scan as illustrated in Figure 1.10.[48] explains different ways to gain 3D laser scans with similar arrangements. Independent of the actual physical data arrangement, it is always possible to generate a data representation consisting of an ordered sequence (’vertical order’) of 2D sub-scans, which themselves consist of an ordered sequence (’horizontal order’) of data points, see Figure 2.5. One increasingly popular solution is to approximate the point based data set using mid-level geometric objects, such as planes, spheres, cylinders etc. The main advantages of this representation are:

High data compression rate

Representing a wall, for example, as a single rectangular plane segment, means stor- ing four corner points only. This is in contrast to representing the wall by thousands of raw laser reflection points. Replacing coplanar points by plane segments can be seen as psycho-visual redundancy reduction, a computer vision approach leading to highest rates in image compression. Assuming a coplanar structure, only the corner points enhance the perceptual information, thus all other points are redundant.

Higher geometric information level

Storing data using mid level geometric elements simplifies further processing steps. For example, finding boxes or corners in a map is significantly more simple if the

17 map data consists of plane segments, not of raw data points, since basic geometric information (e.g. -directions) is already given.

Transformation of raw data to a higher representation

An estimate of hypothetical properties of the scanned environment, e.g. in the case of plane-segment approximation is the property of piece-wise co-planarity. This pre- processing step augments the information of the data, yet assuming certain conditions of the environment. Note that the mid level representation is an approximation of the original data set, as it comes with loss of precision. The precision of the original data set often represents noise. In case the assumptions of properties (e.g., co-planarity) are justified, precise point-wise representation then equals over-fitting of the hypothetical structures. Avoiding over-fitting justifies the aforementioned redundancy reduction. However, if the precision of the original point data set is needed, plane estimation can be used as labeling: the original data set can be labeled, points belonging to one plane (or another mid-level geometric object) are assigned the same label, yet their original location is kept. The motivation for the plane extraction algorithm is that the environment is assumed to consist of planar pieces (often called ’patches’). This is a common as- sumption in such tasks like indoor mapping. The typical critique is that nature and man-made environments contain non-planar objects. However, typical indoor and outdoor environments contain a significant amount of important planar features, e.g. walls or floors. The remaining non-planar features can be approximated by small sized patches, yet admittedly such an approximation leads to further problems (small patch sizes reduce the effectiveness of representation, planar approximation does not capture the natural properties of rounded objects, and etc.). This problem can be remedied using different geometric objects (spheres, cylinders and cones) as explained in Chapter 2.5. Later in Chapter3 we focus on the use of 3D patches for

18 3D mapping. The goal of our work is real-time plane and conical objects extraction from 3D data-sets for the basis of 3D robot mapping. The speed is gained by incorporating certain assumptions about the data representation. In the next chapter, we specifi- cally explain our algorithm based on data from Kinect and 3D laser scanners. The only property used is that there is a defined neighborhood, or index-order in the data, which represents a perceptual order of points. This seemingly strong condition holds for most of the popular sensors, including 2D and 3D laser range scanners and Microsoft Kinect. It should be noted that in general index structures are only given for single 3D scans, the structure is lost when multiple scans are combined to a ’global scan’ (via alignment/registration), although, for example, kd-trees can introduce nearest neighbor structures (in O(Nlog(N)). We do not work on such structures, but assume our process as a tool for single scans. It aims at further processing, e.g. registration, based on planar elements (see e.g. [51]).

19 20

Fig. 1.10 3D Scan Diagram and I must expand here CHAPTER 2

MID-LEVEL GEOMETRY FEATURE (MLGF) REPRESENTATION

2.1 Introduction

In recent years, most range-sensing based algorithms in computer vision, object recognition, and scene description were based on a low-level representation using raw data points. The drawbacks of such a representation (high amount of data, low geometric information) limits the scalability. In 3D cases, it is more feasible to represent the data by mid-level geometric structures, such as planar patches, spheres, cylinders and cones. This paper introduces a robust, real-time extraction algorithm for 3D geometric objects (planar patches, spheres, cylinders and cones) from 3D data-sets. Such 3D objects can be used in object recognition, tracking, scene description and other 3D feature based tasks in computer vision. The main advantages are:

• High data compression rate, which in the case of representing a ball, for exam- ple, as a sphere means storing center point and radius. This is in contrast to representing a sphere by thousands of 3D data points.

21 • Higher geometric information level, when storing data using mid level geometric elements simplifies further processing steps. For example, finding boxes or corners in a map is significantly more simple if the map data consists of planar patches, not of 3D data points, since basic geometric information (e.g. normal- directions) is already given.

Our algorithm finds geometric models (planes, cylinders, cones and spheres) using mid-level geometric features (MLGFs), which are lower dimensional objects (here: 2D line segments and ellipses). They are serving the purpose of an intermediate step between raw data and targeted 3D objects. This is done in three steps (see Fig. 2.3).

1. Find all mid-level geometric features (MLGFs).

2. Form a neighborhood structure for the extracted MLGFs.

3. Perform a region growing algorithm on the feature sets to find the targeted high level objects

The guiding principle of our work is to look for MLGFs in lower dimensional subspaces and then to extend the data analysis to the missing dimensions. This is advantageous, because the number of MLGFs is typically significantly smaller than the number of raw data points. In addition, modeling of mid level elements in lower dimensions is an easier task. A main advantage of our three step solution is the immediate data reduction of raw data to MLGFs, see Fig. 2.1. The transition to 3D objects is then performed on mid-level representation (line segments and ellipses), which significantly reduces the data elements for analysis, explained in more details in Section 2.3 and 2.5. One feature of our approach is the use of the points’ natural order, as defined by the sensor, to traverse and perform model updates resulting in a total complexity of O(N) for N points. Previous approaches in the attempt to perform scene description

22 Fig. 2.1 Mid-Level Geometric Features(MLGFs) are introduced in our approach (bottom). Other approaches work directly on a point cloud to extract 3D objects. require a nearest neighborhood structure, such as k-d trees, therefore adding more time complexity for N points. Specifically, these algorithms do not make any use of the data’s initial order of points as defined by the 3D sensor, but assume general, unordered point clouds to generate their neighborhood structure (index ordered 3D scans are often referred as ”depth images”, versus ”Point Clouds”, which do not require a specific point order). The index order we use depends on the sensor, which seems to be a strong condition. However, most current 3D sensors do offer a certain, similar order of data. Examples of such sensors are 3D Lasers, stereo cameras and Microsoft Kinect. In contrast to un-indexed approaches, we take advantage of the data order, to gain information about geometric properties of the data points. In general, a 3D laser scan is gained by a ’sweep’ (a move of a 2D laser scanner) of single, plane 2D scans. [48] explains different ways to obtain 3D laser scans with similar arrangements. In a similar way, the output of the MS Kinect is ordered in rows and columns. Independent of the actual physical data arrangement, it is always possible to generate a data representation consisting of an ordered sequence (’vertical order’) of 2D sub-scans, which themselves consist of an ordered sequence (’horizontal order’) of data points. With this assumption our system implements

23 the straightforward approach of substituting data points in 2D sub-scans with 2D geometric features. Using the index-order of raw data as given by the sensor provides a significant speed improvement and defines a neighborhood structure. However such index struc- tures are only given for single 3D scans; the structure is lost when multiple scans are combined to a global scan (via alignment/registration). That can be remedied by re-creating the order of points via virtual re-scanning. Assuming indexed data, features are extracted by traversing each data point once and adding it to a MLGF model in a constant time O(1) until an error threshold is reached. Therefore at the end of a 3D scan traversal, all 2D features (MLGFs) are found in O(N) time complexity. These MLGFs are then used to extract our targeted objects. For example, planes (planes = targeted objects) can be represented as a set of co-planar line segments (line segments=MLGFs), instead of a set of coplanar points (coplanar points=raw data). Hence, patches can be extracted by performing a region growing algorithm on certain candidate sets of line segments, which represent a neighborhood structure expressed by co-planarity between adjacent line segments. Similar to planes, sphere, cylinder and cone detection is performed using ellipses as MLGFs. Motivation for using ellipses to extract targeted objects (sphere, cylinder and cone) is discussed in Section 2.4.1. Depending on the type of object the rules will be different as described in ([18], [17]). Transformation of raw data to a higher representation can be seen as an estimate of hypothetical properties of the scanned environment and is a pre-processing step which augments the information of the data, yet assuming certain conditions of the environment. Note that the mid-level representation is an approximation of the original data set, it comes with loss of precision. The precision of the original data set often represents noise. In case the assumptions of properties (e.g., co-planarity) are

24 Fig. 2.2 Flow diagram. 1) Input; 2) Mid-Level features generated from 1); 3) Scene description using features from 2).

justified, precise point-wise representation can lead to over-fitting of the hypothetical structures, which justifies our approach of approximation. We use a model fitting approach; the speed is gained by an O(1) model update and dimensionality reduction of 2.5D to 1.5D (a 1D height-field, i.e. a single scan row of a range scan). In total, this achieves a complexity of O(N) + O(M), with N number of data points and M being number of 2D ellipses, which in average is √ a low constant k times the number of scan lines, i.e. M = k N 1. This totals to √ a complexity of O(N) + O(k N) = O(N). The speed achieved by this algorithm allows for processing 30 frames per second on 320x240 Kinect data on a 2.8GHz √ 1 Assuming a square height field with N rows and k objects per row.

25 Fig. 2.3 Flow diagram showing: 1) Input data, 2) Result of ellipse fitting, 3) After removal of elongated ellipses, 4) Ellipse candidate sets based on proximity. The blue line shows connected components. 5) Resulting sets of ellipses forming different conical objects 6) Fitted 3D models.

26 Fig. 2.4 Results of the algorithm. Top: 3D point cloud from MS Kinect (307200 points); Middle: extracted ellipses(purple) and line segments (blue); Bottom: at fitted conic objects and planar patches together with the 3D point cloud. The27 cylinder in the picture is a cylindrical trash can with radius=0.2m and height=0.7m. desktop computer, implementation in Java. The paper is organized as follows: after comparison to related work (Sec.2.2) is followed by line and ellipse extraction (Sec.2.3, 2.4). Sec. 2.5, covers the computa- tion of the neighborhood structure, followed by the 3D conic object detection. In Sec.2.6.1, we improve the system by applying kalman filter, that takes as input fast updates of model parameters. A proof of the applicability to higher level robot vision tasks with experiments is presented in Sec. 2.6. Section 2.7 concludes the paper and discusses future work.

2.2 Related Work

Rusu et al presented an approach to object fitting that works directly on 3D point data, described in [57], uses surface normal vectors and forms clusters of points with similar direction. The clustering decomposition in [57] stays in the 3D space of the original data, which is a major difference to our approach. Object detection and tracking has been long performed in robot vision, many approaches process 2D RGB images. We will not compare to these approaches, but limit the related work to detection based on range data. The problem of sphere detection has been solved in various ways and to the best of our knowledge, all previous approaches work on point clouds directly. None of these approaches make use of a two step solution (points to ellipses, ellipses to spheres), which reduces dimensionality of the problem from 3D to 1.5D subsets. In general, data modeling with pre-defined structures can also be solved by the Expectation Maximization (EM) approach. The system in [36] uses a split-and- merge extended version of EM to segment planar structures (not conic elements) from scans. It works on arbitrary point clouds, but it is not feasible for real time operation. This is due to the iterative nature of EM, including a costly plane-point correspondence check in its core. An extension to conic elements would increase the

28 complexity even further. A popular solution to fit models to data is Random Sample Consensus (RANSAC, [12]). Being based on random sampling, it achieves near-optimal results in many ap- plications, including linear fitting (which is the standard RANSAC example). How- ever, RANSAC fails to model detailed local structures if applied to the entire data set, since, with small local data structures, nearly the entire point set appears as outliers to RANSAC. [72] applies RANSAC on regions created by a divide and con- quer algorithm. The environment is split into cubes (a volume-gridding approach). If precise enough, data inside each volume is approximated by planes, and coplanar small neighboring planes are merged. This approach has similarities to ours, in the sense that it first builds small elements to create larger regions. However, their split stays in the third dimension, while our split reduces the dimensionality from 3D to 1.5D, which makes the small-element generation, i.e. ellipse, a faster operation. We have a more detailed comparison to RANSAC in section 2.6.4. 3D versions of Hough transform as in [70] have the problem that, like RANSAC, local structures can not be modeled. [70] remedies this using additional knowledge of the environment, given by ground plans of buildings. For this special application the Hough transform results in impressive models. In our case, we do not assume input from additional data, since it should work with applications such as mapping in a dynamic environment, or any a priori unknown region mapping. Ellipse extraction is a crucial step in our solution. There are many different methods for detection and fitting ellipses. [20] has done an extensive overview of different methods for detection and fitting ellipses. These methods range from Hough transform [76, 78], RANSAC [75], Kalman filtering [56], fuzzy clustering [16], to least squares approach [35]. In principle they can be divided into two main groups: voting/clustering and optimization methods. The methods belonging to the first group (Hough transform, RANSAC, and fuzzy clustering) are robust against outliers

29 yet they are relatively slow, require large memory but have a low accuracy. The second group of fitting methods are based on optimization of an objective function, that characterizes a goodness of a particular ellipse with respect to the given set of data points. The main advantages of these methods are their speed and accuracy. On the other hand, the methods can fit only one primitive at a time (i.e. the data should be pre-segmented before the fitting). In addition, the sensitivity to outliers is higher than in the clustering methods. An analysis of the optimization approaches was done in [14]. There were many attempts to make the fitting process computationally effective and accurate. Fitzgibbon et al. proposed a direct least squares based ellipse-specific method in [14]. Our approach to ellipse and line extraction belongs to the second group as a type of optimization, yet with the advantage of finding multiple objects at one time. It was motivated by the work of [18, 53]. In there, a region growing algorithm is proposed on point clouds, testing an optimal fit of updated planes and conical objects. This paper extends the work of [18] by including conical geometric models (spheres, cylinders and cones) and further improves on a numerically stable ellipse fitting approach by converting it to a fast incremental O(1) model update. Our system lays out a framework that utilizes natural order of data points to form mid-level geometric features that are used to describe 3D geometric objects (planes and conical objects) in real time. Such type of scene description can be used as a real-time pre-processing module for 3D feature based tasks in robotics. This includes scene analysis, SLAM, etc. based on 3D data. The framework, which suggests the two step approach, is currently implemented for planes and conic objects, see Figure 2.2.

30 2.3 Points to Line-Segments to Planes

The problem of plane detection has been solved in various different ways. To the best of our knowledge, none of these approaches make use of a two step (points to segments, segments to planes) solution, which in our opinion is an advantageous solution if the neighborhood constraints on the data set are met. All previous ap- proaches work on point clouds directly, some of them also demand for neighborhood constraints. These techniques show promising results, but contain the extra steps of segment- ing the 3D space and then merging the results found in each one of these segments. However, in contrast to our approach, they do work on general 3D point clouds, since they create their own neighborhood structure. It is important to note that our approach would be extendable to first create a neighborhood structure, but its main purpose, taking advantage of given (or cheap to compute) line segment structures, would be lost.

2.3.1 Line Segments

To simplify the description of the algorithm, we assume the data to come from an arrangement as shown in figure 2.5. This arrangement is created by a sequence of single 2D sub-scans. There are numerous algorithms for line segment extraction from 2D laser scans in the literature, an overview is given in [45]. We therefore do not explain the extraction process itself, but only mention important properties of the line segments with respect to the plane segment extraction. We apply an incremental line extraction which works in O(N), similar to [67], and [45], yet implemented optimally using a line-update technique, for details please see [18]. Please note that the line extraction processes point-sets from single 2D scans only (usually less than 1000 points), hence even algorithms with asymptotic complexity greater than O(N)

31 would be feasible for real-time applications. The following properties of 2D line

Fig. 2.5 Left: original point data of a simulated scan of a geometric object. Right: single 2D sub-scan with points and extracted line segments. Disconnection occurs, since points exceeding a distance threshold are not assigned a line segment.

segments are used for 3D plane extraction: Property 1: Two non-collinear segments of a single sub-scan can not originate from a real plane in the environment: in a sub-scan Si, non collinear line segments are only coplanar in the plane P that constitutes the view direction, or ray direction, of the sub-scan. The plane P represents the impossible case of detection of linear objects in view direction. Trivially, two collinear segments s1, s2 of a single slice are coplanar in the real environment.

Property 2: iff two 3D lines L1,L2, containing s1, s2 respectively, are coplanar, then they are parallel or they intersect.

Based on these properties, a neighborhood relation N(s1, s2) between segments is established.

32 2.3.2 Segment Neighborhood N and Connected Components

We define a neighborhood relation N on segments of sub-scan Si. The neighborhood relation is based on co-planarity and proximity of segments.

Two segments s1 ∈ Si1, s2 ∈ Si2 from two sub-scans Si1,Si2 are neighbors (nota- tion s1 ∼ s2) if all of the following properties hold:

Fig. 2.6 Black lines: the set of all lines from 2D subscans (compare fig. 2.5, right). The red lines visualize the ’vertically’ (see text) connected components with respect to the neighborhood relation N. They connect center-points of neighboring line segments. Note that single connected components consist of segments belonging to multiple planes.

Subsequent sub-scan origin

|i1 − i2| <= k, i.e. s1, s2 originate from sub-scans in the sequence no further away than k. In our experiments, we set k = 1, meaning we are looking at subsequent sub-scans. In special cases this can lead to disconnected planes due to outliers or

33 small structures. Setting k to a larger value can remedy this, at the cost of a more complex neighborhood structure. Note that segments from a single sub-scan can not be neighbors; they are either non-collinear (and therefore not co-planar, see property 2, above), or they are not sufficiently close to define a single line (allowing for ’holes’ in the scans to represent structures like doorways or windows). The subsequent sub-scan property leads to a vertical neighborhood relation (’vertical’ if sub-scans are imagined as ’horizontal’), a relation between segments of (index-) neighboring sub-scans, see figure 2.6.

Co-planarity

Assuming L1 and L2 are the lines containing the segments s1 and s2 respectively, then L1 is either parallel to L2 or L1,L2 intersect. Due to noise, co-planarity is usually hard to define. It is not critical here, since we only pre-select co-planar candidates, i.e. we can handle co-planarity using a generous error threshold. The final co-planarity decision is made by the LSE approximation, see section 2.3.3. N defines pairs of co-planar segments, and helps in solving our final task, which is to find sets of coplanar segments, and to approximate them with a plane. N is not transitive, which means that a connected component in its entirety does not necessarily consist of coplanar segments. A connected component based on such a neighborhood structure contains only candidates for coplanar segment sets. However, if two segments are coplanar, they belong to one connected component. Hence, looking at connected components reduces the search space for mutually coplanar segments drastically (from O(n2) to O(n), n = number of segments). Given the neighborhood relation N, we compute connected components by iter- ating through all segments (O(n)) and checking the neighborhood properties for all segments of the previous sub scan-only. In detail, we traverse in a nested iteration

34 through all segments of a scan si, and inspect all segments of a scan si−1 (= property subsequent scan origin) for proximity and co-planarity. Neighboring segment pairs constitute edges in a neighborhood graph, with segments being the nodes. Strictly spoken, the creation of the neighborhood graph is an O(n2) operation.

1 2 However, looking at a limited set of subsequent scans amounts to a total of m O(n ) with m being the number of sub-scans. Observe that m a significantly high number,

n in practice the relation for normal sized indoor/outdoor scans leads to m < 100 (note again: n is the number of segments, i.e. linear features, in a single subscan). Looking at single-sub-scan neighbors therefore practically reduces the connected component search to O(n). As mentioned before, connected components collect segments which are not neces- sarily mutually co-planar. In the follow up step, the region growing step, we examine each connected component to possibly split it up into multiple planes. Planes are built using a least square approximation to sets of segments. Before we explain the region growing, we will describe details about optimal plane approximation. The optimal plane assignment given a set of segments, is an extension of the well known LSE (orthogonal) point-plane fitting problem. During the region growing, a plane fitting test is performed for a previously computed plane Pk (based on a set

{s1, .., sk} of k sufficiently co-planar segments) and a new single segments ¯, i.e. a

new plane Pk+1 has to be computed for the set {s1, .., sk, s¯}. It is important, that

this computation can be managed by an O(1) update of Pk. The following sections will explain details.

2.3.3 Planes

We represent planes in Hesse normal form, i.e. a plane P is defined by a normal vector ~n and the distance d to the origin. In Hesse Normal Form, ~x · ~n = d, ∀~x ∈ P . Given a set of segments, the task is to find the (orthogonal) least square error fitting plane.

35 The solution is well known for the related approximation problem for data points,

see e.g. [53] or literature in linear algebra: given a set pi of data points, the best LSE approximating plane is defined by ~n·~x = ~n· ~m, with ~n being the normal vector which is given by the Eigenvector to the minimal Eigenvalue of the 3 × 3 covariance matrix

of pi. ~m is the center of mass of pi. In our case, we approximate planes to segments, not points to segments. We therefore have to integrate over the segments. The center

Pk i=1 lici of mass m = Pk for k segments with li, ci denoting the length and center of line i=1 li

[i] [i] segment si, respectively defining a line segment si = λps + (1 − λ)pe , λ ∈ [0..1] with

[i] [i] endpoints ps , pe . And a covariance matrix

Γ(xλ, xλ) Γ(xλ, yλ) Γ(xλ, zλ)

C = Γ(yλ, xλ) Γ(yλ, yλ) Γ(yλ, zλ) (2.1)

Γ(zλ, xλ) Γ(zλ, yλ) Γ(zλ, zλ)

with Γ(φλ, ψλ) =

k Z 1 Z 1 Z 1 Z 1  X [i] [i] [i] [i] φλ ψλ dλ − mφ ψλ dλ − mψ φλ dλ + mφmψdλ (2.2) i=1 0 0 0 0

[i] [i] [i] [i] [i] and φλ = λφs + (1 − λ)φe , where φs , φe denote (the respective dimensions of) start and endpoint of line segment si (same for ψ).

Plane Update Derivation

A given plane is updated by adding a line segment, represented by start and end points. To perform this task in a constant time, independent of the number of line segments already added to that plane model, we treat the line segment as infinite number of points on a line and compute the Mean Square Error (MSE) after adding the line. Our derivation begins by computing the distance Dp from a point p to a ~ plane P with normal vector N and center c, as Dp = ~n · (c − p), similarly for n points

36 Pn ~ as Dnp = i (N · (c − pi). The same equation is written as

n X   i  D = N~ · C − S + (S − E) (2.3) np n i

where S = StartP oint, E = EndP oint. After adding a segment line to a plane model, MSE is computed as follows

n 2 1 X   i  MSE = N~ · C − S − (S − E ) (2.4) line n p n p p i

~ i The next step defines a distance Ti = N · (C − S − n (S − E)) (point to plane) for a point on the line (continuous, [S:E]) segment. As a result the start point of that ~ ~ segment has a distance T0 = N · (C − S), similarly the end point TN = N · (C − E).

Switching to a simple notation, let k1 = C − S, k2 = S − E, k3 = C − E, and substitute, then differentiate both sides of the equation yields.

N~ ·k2 dT = n di (2.5)

1 Pn 2 2 MSE = n i=0 N Ti (2.6)

~ = 1 R N·k3 T 2di (2.7) n N~ ·k1

~ = 1 R N·k3 T 2dT. n (2.8) n N~ ·k1 nk2

~ = 1 T 3 |N(k3) (2.9) Nk~ 2 3 N~ (k1)

~ 3 ~ 3 = (k3·N) −(k1·N) (2.10) 3N~ ·k2   2 2 (k3N~ −k1N~ ) (k3N~ ) +(k3N~ )(k1N~ )+(k1N~ ) = (2.11) 3N~ ·k2

~ ~ ~ It is easy to show that (k3N) − (k1N) = (k2N), after substitution results in

(k · N~ )2 + (k · N~ )(k · N~ ) + (k · N~ )2 MSE = 3 3 1 1 (2.12) 3 37 As a result, each term in Eq. 12 is updatable in constant time, therefore the whole equation is updatable in constant time.

Algorithm 1 planeGrow(Segment S, Label L)

if PLabel.isEmpty then PLabel.addSegment(S) else if ¬S.hasLabel then PL.addSegment(S) Ω ← PLabel.calcMSE() if Ω < E then S.Label ← L else PLabel.removeSegment(S) Label ← L + 1 end if else if S.Label == L then return {recursion STOP} end if M ← mergeP lane(PL,PS.Label) Ω ← M.calcMSE() if Ω < E then PL ≡ PS.Label ≡ M else return {recursion STOP} end if end if end if

2.3.4 Plane Update

Equation 2.2 offers an important aspect: computation of center point and integral are clearly separated, suggesting a simple update strategy when computing the optimal plane for the set {s1..k} ∪ sk+1 given the optimal approximative plane Pk for sk. This update is a segment based version of the update derived in [53], where such an optimal plane-update is described for point sets. In order to perform the update, we store the terms:

Pk R 1 [i] [i] Pk R 1 [i] T 1 = i=1 0 φλ ψλ dλ T 2 = i=1 0 ψλ dλ

Pk R 1 [i] Pk R 1 T 3 = i=1 0 φλ dλ T 4 = i=1 0 mφmψdλ

Pk Pk T 5 = i=1 lici T 6 = i=1 li 38 Given these single terms, equation 2.2 can be formed. Updating the terms from k to k + 1 is done by simple addition of the appropriate integrals. The integrals are straightforward, we will not define them here. The simple O(1) update is optimal for computational speed: without it, each computation of the covariance matrix would cost O(n), every time a new segment is added. Note that in our case n denotes the (still relatively uncritical) number of segments, not data points. Hence, in contrast to [53], who reports this update as the most significant speedup step, in our case the speedup is mainly given by the principle of dividing the task into the steps of first finding segments, then planes.

2.3.5 Region Growing

The general algorithm is a straightforward region growing approach which determines co-planar segments and their best planar approximation. It labels segments (same label = co-planar), and computes the optimal plane approximation. Two things to note that are important to this step are: a) we operate on line segments, b) we operate on a limited candidate set.

Given an initial set of k mutually coplanar segments si=1..k and its approximating

plane pk, it recursively checks if neighboring segments sk+1 are coplanar to pk. In

order to do so, pk is updated to pk+1, and the least square error LSE(pk+1) is

computed. If LSE is below a certain threshold, the updated plane pk+1 is kept and

the segment sk+1 is labeled accordingly, otherwise a new plane and label is assigned

to sk+1. It is crucial, that the LSE can be updated, instead of being recomputed, analog to the update of planes. The approach is exactly the same: The LSE is split into single, updatable terms which are kept in memory. The derivation of the segment based LSE is analog to the one in section 2.3.4. The stop case of the recursion is reached when a neighboring segment is already

assigned a plane pa. In this case, it has to be determined if the planes pk and pa

39 are (sufficiently) equal, which then leads to merging. Merging is a simple re-labeling process, yet requires to recompute planes approximating the newly merged set of segments after the re-labeling is finished (O(n)). Region growing is performed for each connected component. Each recursive region growing process terminates when all segments of the current connected component are labeled. The labeling algorithm terminates when all connected components are processed. Figure 2.7 shows the result of the region growing process. The connected components are split correctly, only co-planar segments are labeled alike. The order of magnitude for the algorithm is O(N) for segment extraction, where N is the number of data points.

Fig. 2.7 Left: Final result of the region growing labeling process based on the connected compo- nents of figure 2.6. Same color denotes same label. Right: related planes. The holes in the figure are a result of point regions with insufficient point density. In those regions, no line segments were approximated ,compare figure 2.5, right.

2.3.6 Line Segment Extraction

We apply an incremental line extraction which works in O(N), similar to [67], and [45], yet implemented optimally using a line-update technique, see Algorithm 1. Please note that the line extraction processes point-sets from single 2D scans only (usually less than 1000 points), hence even algorithms with asymptotic complexity

40 greater than O(N) would be feasible for real-time applications.

2.4 Ellipse Extraction

2.4.1 Motivation

The scanning plane intersection with a cone or cylinder can form several geometric curves. One of them is hyperbola, only formed when the intersecting plane (line of sight) is parallel to the cone’s axis. Similarly, parabola can be formed when the intersecting plane is parallel to the tangent plane of the cone (see Fig. 2.8. All other cases the intersecting plane forms a circle of an ellipse, explained in detains by [22]. A 3D scan of a cone (facing up) has a scanning plane, which intersection contains a parabola. Unless the sensor is looking down, above the top of the cone, in which case producing a hyperbola. The latter is a very specific case, which is not the focus of this paper. The case when of parabola intersection is approximated with an ellipse, which as explained later in this paper has insignificant error. A circle is a general ellipse with equal radii. Therefore the task of modeling the 3D geometric primitives reduces to just finding ellipses as MLGFs. For each scan line we traverse its data points iteratively, and try to fit ellipses. We determine maximal connected subsets of each scan line, which fit ellipses within a certain radius interval, and under an error threshold T. This process is performed in an iterative way, updating currently found ellipses in O(1) by adding the next

s candidate point. The ellipses Ei = (c, r1, r2) in scan line s are stored as center point c and radius r1 and radius r2. Traversing each scan line therefore leaves us with

s a set of ellipses E = {Ei }. The set of all ellipses E is decomposed into subsets of connected components based on ellipse center proximity. After ellipse detection, it holds for cylinders and cones, that the center points of participating ellipses’ center points are collinear, and concyclic for spheres. It is therefore sufficient, to scan the connected components for being collinear or concyclic

41 in order to find model candidates. Please note, that this approach reduces the effort, again, to a simple line or ellipse fitting, yet now even in the significantly smaller space of center points. Such lines and ellipses from the center point space are again found with an iterative O(1) update approach. Cones, spheres and cylinders are characterized by the change of radius along the vertical axis. Hence, we can determine if a connected component of ellipses constitutes one of the models.

Fig. 2.8 Illustration of conic section cases [74].

A well-known non-iterative approach for fitting ellipses on segmented data (all points belong to one ellipse) is Fitzgibbon’s approach described in [14]. It is an opti- mization approach, which uses a (6x6) scatter matrix S as shown in Eq. 2.13, to solve the problem of finding the six parameters which describe an ellipse. A more detailed explanation can be found in [14] and is out of the scope of this paper. However, it is

42 PN p q important that S has entries Sxpyq = i=1 xi yi , where p and q are integers. S has to be decomposed into eigenvalues and eigenvectors. The 6D vector-solution to describe an ellipse is the eigenvector corresponding to the smallest eigenvalue. Though the algorithm is robust and effective, it has two major flaws. First, the computation of the eigenvalues of Eq. 2.13 is numerically unstable. Second, the localization of the optimal solution is not correct all the time as described by [20].

Algorithm 2 extractEllipses(Points, THRESHOLD) while (nextP oint = P oints.next()) 6= NULL do addP oint(nextP oint) fitModel() calculateError() if error > T HRESHOLD then removeLastP oint() saveCurrentModel() startNewModel() end if end while

To overcome the drawbacks [20] splits S into three matrices (S1,S2,S3), which are (3x3), see Eq. 2.14a,b. With this decomposition, [20] achieves the same theoretical result, but with improved numerical stability and without the localization problem. However, both algorithms, [20] and [14] only find a single ellipse fit on a given set of pre-segmented points. We propose a solution based on Fitzgibbon’s approach and the numerical improvement made by [20] to be able to handle non segmented data, i.e. data containing points not necessarily belonging to an ellipse and possibly containing disjoint subsets of points belonging to different ellipses. Our approach finds multiple ellipses in a single pass. We achieve this by performing an ellipse model update in O(1) and making use of the natural point order defined by the range sensor, see Algorithm 1. To perform the O(1) update we store each component of the sum

(Sxpyq ) required to form matrices S1,S2,S3. This means when adding a new point to the existing model, we only have to perform addition to each of the sum terms, which is a constant time operation. For a set of N points the model update will be

43 executed exactly N times, our algorithm can find multiple ellipses in non segmented data in O(N).

  Sx4 Sx3y Sx2y2 Sx3 Sx2y Sx2  Sx3y Sx2y2 Sxy2 Sx2y Sxy2 Sxy   T Sx2y2 Sxy3 Sy4 Sxy2 Sy3 Sy2  S = D D =   (2.13)  Sx3y Sx2y Sxy2 Sx2 Sxy Sx     Sx2y Sxy2 Sy3 Sxy Sy2 Sy  Sx2 Sxy Sy2 Sx Sy S1

    Sx4 Sx3y Sx2y2 S1S2 S = T S1 =  Sx3y Sx2y2 Sxy3  (2.14a) S2 S3 Sx2y2 Sxy3 Sy4     Sx3 Sx2y Sxy2 Sx2 Sxy Sx S2 = Sx2y Sxy2 Sy3  S3 = Sxy Sy2 Sy (2.14b) Sx2 Sxy Sy2 Sx Sy S1

2.4.2 System Limitations

Being based on finding ellipses from the conical intersection of a scanning plane with cylinders, cones and spheres our system is limited with respect to tilting angles that lead to non-elliptical intersection patterns. The appearance of such patterns depend on the size of the object (cylinder) and the size and internal angle (cone). In practice, our system performs robustly on tilting angles of > 45 degrees, see Fig. 2.9, yet is naturally limited when the angle becomes significantly larger. One remedy of this problem (for tilts resulting from rotation around the z axis), is to also use ellipses found by intersection of vertical instead of horizontal scanning planes, i.e a 90 degree rotated sight of the scene, which turns tilts of > 45 into tilts of < 45. An example of tilted objects is illustrated in Fig. 2.9 and Fig. 2.10.

44 Fig. 2.9 Object detection. Top to bottom: RGB image(not used); 3D point cloud from MS Kinect (307200 points); extracted ellipses after the pre-filter; to the right are the fitted conic objects; at the bottom are fitted conic objects and 3D point cloud. The cylinder in the picture is a cylindrical trash can. 45 2.5 3D Object Extraction

Once all ellipses are extracted, further processing is performed on these ellipses only; the original point data is not used any longer. The ellipses are used to form 3D conical objects. In this case, the centers of supporting ellipses are co-linear and lie inside of an error margin φ, the maximum allowed distance between neighboring ellipses. Only the first k neighbors are compared to eliminate outlier ellipses due to noise. This allows for selective skipping vertically and choosing better supporting ellipses for the object.

More formally: The 1.5D ellipse extraction creates a set E = {Ei}, Ei = (xi, yi, zi,R1i,R2i) of ellipses, with center point ci = (xi, yi, zi). For our target objects, conic 3D objects, we are looking for sequences Sj = (ci1, ci2, .., cik) of collinear or concyclic centerpoints ci. The order of extraction imposes an order on ellipses Ei ∈ Sj ⊂ E, they are ordered in the z-dimension:

i1 < i2 ⇒ zi1 < zi2 (2.15)

Fig. 2.10 A tilted (rotation around x-axis, tilted towards camera) parking cone, and ellipses detected (red). Left: tilt=5 degrees, Middle: tilt=20 degrees, Right: tilt=40 degrees. In all cases, the algorithm is able to fit a cone model (not explicitly shown in figure).

Due to this order, we can extract collinear/concyclic center-points using the O(|E|) line/ellipse fitting algorithm from Section 2.3, which in its core again has an O(1) update. Similarly, ellipse models fitting the center points are extracted, using the method described in Sec. 2.4.

46 Once sequences of collinear center-points Sj are extracted, the corresponding radii r1i, r22 can be analyzed to detect conic objects. It is sufficient to examine the radius orthogonal to scan-lines, r2. Hence, object detection is performed in the (z, r2), or short: (z, r), space, see Fig. 2.5.

Fig. 2.11 The three curves represent ellipse radii (R-axis) along the center lines (Z-axis) of a cylinder (blue), cone (red) and sphere (green half circle) in (z,r) space.

Fig. 2.12 A flat sheet of paper is placed in front of the Kinect sensor at a known distance to test the accuracy of the extracted planar patch. Results are shown in Table 2.6.

47 In Fig. 2.5, this is the straight line parallel to the Z-axis. In practical application, we allow for 5 degrees deviation from parallel due to noise.

For a cylinder, fcylinder(z) = r, r is constant, see the blue line in Fig. 2.5. For

a cone, fcone(z) = αz, r is linearly dependency on z, see the red line in Fig. 2.5.. However, in the practical application, the slope of the RZ line should be greater than a threshold θ to be considered a cone (for Kinect data θ = 5deg). Similarly, for a

p 2 2 sphere, fcircle(z) = r − (z − zi) , the radii describes a circle, see the green line in Fig. 2.5. The detection of objects and collinear sequences is performed in parallel, again using incremental line (and ellipse) fitting procedures, yet in different spaces: while the center-collinearity is detected by a line fitting algorithm in the (x, y, z) space, the detection of 3D objects is performed by a line fitting (for cylinders and cones) and ellipse fitting (for spheres) in the (z, r) space. The incremental order in z reduces the collinearity finding to a 1.5D problem. Please note that the same concept of incremental line/ellipse fitting is utilized to solve tasks not only in different phases of the algorithm (first, ellipse finding in scan rows, then line/ellipse finding on ellipse center points), but also in different spaces, the decomposed 1.5D data space, and the (z,r) space. Using the O(1) update principle in all cases is the main reason for the fast performance of the algorithm. In addition, the nature of the fitting procedures automatically separates objects in composed scenes, see e.g. Fig. 2.13, cylinder and sphere.

2.6 Experiments

We performed our experiments on single data sets, and on sequences of data sets, the latter one showing the advantages of real time performance of our algorithm. In the sequence experiments, we also utilized Kalman filtering in attempt to further improve the accuracy. The motivation was, that the data from the Microsoft Kinect

48 Fig. 2.13 Results of the algorithm showing fitted 3D geometric models. The sphere placed on top of a cylinder is successfully recognized as a separate object. suffers in accuracy.

2.6.1 Improving the System using Kalman Filtering

The data from the Microsoft Kinect sensor suffers from limited accuracy. The work in [30] shows that there is a direct relation between the sensor error and the mea- sured distance ranging from few millimeters (at low range about 1m) up to 4 cm (at far range, about 5m). To improve the data accuracy, we use a Kalman filter as a supporting tool to reducing the amount of errors in the measurement observa- tions. Kalman filtering is a well known approach to reduce noise from a series of measurements observed over time for a dynamic linear system. In our experiments, we utilize the Kalman filter to increase precision, and to be able to do object track- ing. Although Kalman filtering is not directly related to the main contribution of this paper (model fitting), it emphasizes the aspect of real-time performance of our algorithm. Only with a fast performing algorithm, improvement based on feedback filtering (e.i Kalman) is feasible. Our system, due to its real-time behavior, is able to feed data into the Kalman filter at a high frequency (about 30Hz), which makes

49 the combination of Kalman filtering with our approach feasible and useful.

2.6.2 Basic Test: Object detection, detection speed

We acquire 3D range data from a Microsoft Kinect sensor to demonstrate the al- gorithm’s basic functionality and object recognition using the methods described in Sec. 2.5. The scene consist of a trash can (cylinder), parking cone and exercise ball (sphere) on the floor next to each other, see Fig. 2.9. The algorithm successfully finds the correct conical objects in the scene with a rate of 30fps for 320×240 Kinect depth data, implementation in JAVA on Dell 2.8GHz desktop with 8GB Ram.

2.6.3 Comparison to RANSAC

A widely used technique to detect planes is RANSAC [72], being fast and simple to implement. In contrast to RANSAC, which detects planes by looking at a data set in its entirety (globally), our approach detects planar elements looking at connected regions. Figure 2.14 shows a comparison between RANSAC (bottom left) and our algorithm (bottom right) on a data set showing a cubicle. RANSAC was used in an iterative way: after detecting a single plane, the supporting points were deleted. RANSAC was then used again on the reduced data set. Since our algorithm takes into account local connectivity, our result appears less cluttered, only locally connected regions are represented by single planes. RANSAC does not have the connectivity information, and therefore assigns disconnected areas to the same plane, leading to a result that is less perceptually consistent. This is especially visible at the posters and the small board underneath. An interesting case is the (non planar) curtain under the table: while our algorithm does not detect any planar structures, RANSAC tries to fit multiple cluttered planes, which are disconnected but have sufficient support due to the wavelike nature of the curtain. In a recent publication, Holz et. al. [23] performed an experimental comparison to

50 Fig. 2.14 Cubicle (top left), Laser range scan (top right), RANSAC plane detection (bottom left), our algorithm (bottom right).The cubicle has a small table, and 3 posters at the back wall. Our algorithm generates a more connected, perceptually consistent plane representation of the scene than RANSAC.

our approach and other state-of-the-art plane detection algorithms. As described in [23], the authors evaluated the correctness and efficiency of different approaches using publicly available data set SegComp2 for which ground truth plane segmentations are available. The data was separated into two parts: 10 training and 30 test images. For all images ground truth segmentations and plane parameters are available. The eval- uation tool averages over all test images correctly segmented planes, over-segmented

2 The SegComp data set is available at: http : //marathon.csee.usf.edu/seg − comp

51 Fig. 2.15 Result of ellipse extraction in noisy, simulated data. Blue dots are input data, the fitted ellipse and its center (marked red) and supporting points (marked black). Starting from top to bottom, random noise is added with σ: 0.00, 1.0, 2.0, Rmajor = 40,Rminor = 20 planes, under-segmented planes and false detections (labelled as ’noise’). In addi- tion, the tool assesses the average deviation of the estimated plane orientations from ground truth. The following comparative evaluations include segmentation results gathered by Gotardo et al. [19] and Oehler et al. [49], achieved results using publicly available implementations of Georgiev et al. [18] and Trevor et al. [66] and Holtz et. al. [24], citeholz2014. All implementation are available at the Point Cloud Library PCL located at http : //pointclouds.org. Table 2.1 shows their experimental results on the publicly available SegComp PERCEPTRON [25] data set. The results of that comparison show that our approach [18] has lower accuracy than [23].

52 Table 2.1 Overall segmentation results on all 30 test images assuming 80% pixel overlap as described by [23]. Approach correctly orientationover- under missed noise detected devia- segmented (not de- (nonex- tion seg- mented tected) istent) USF [19] 8.9 (60.9%) 2.7◦ 0.4 0.0 5.3 3.6 WSU [19] 5.9 (40.4%) 3.3◦ 0.5 0.6 6.7 4.8 UB [19] 9.6 (65.7%) 3.1◦ 0.6 0.1 4.2 2.8 UE [19] 10.0 (68.4%) 2.6◦ 0.2 0.3 3.8 2.1 UFPR [19] 11.0 (75.3%) 2.5◦ 0.3 0.1 3.0 2.5 ◦ Lit. Reference Oehler et. al. [49] 11.0 (75.3%) 2.5 0.3 0.1 3.0 2.5 Georgiev et. al. [18] 6.5 (44.2%) 3.4◦ 2.3 0.03 5.8 5.0 Trevor et. al. [66] 8.3 (57.0%) 2.9◦ 0.6 0.13 5.2 2.1 Holtz et. al. [24] 9.6 (65.7%) 3.1◦ 0.6 0.1 4.2 2.8 Holtz et. al. [23] 11.0 (75.3%) 2.6◦ 0.4 0.2 2.7 0.3 Implemented

2.6.4 Comparison to RANSAC for Ellipse Fitting

In this section, our approach for finding ellipses (in 2D) is compared to the popular RANSAC algorithm. We perform the test on a simulated data set that represents a 2D slice of a stretched cylinder in front of a wall. We compared the results with no prior assumptions (i.e. with fixed parameters). As a stop criterion for RANSAC, we chose the number of iterations k giving us a p = 0.95 confidence of finding the correct model. In our data set, the inlier/datapoints ratio w is w = 0.3, n = 4 points are chosen to define an ellipse. As derived in [12], this results in a number of iterations

log(1−p is k = log(1−wn) < 369. We observed two issues with RANSAC:

1. RANSAC is significantly slower than our approach, due to the number of at- tempts to find inliers (although being a kO(n) = O(n) algorithm, the constant k in this case is significant (in our approach, k = 1).

2. in our data set, the outliers are not strictly randomly distributed, but they are structured (wall), which is a valid assumption for indoor and outdoor environ- ments. Without restricting RANSAC to specific ellipses with constrained ratio

53 of radii (minor/major axis), it finds ellipse support in the outliers (in our case, the linear structures), leading to degenerated results, see Fig. 2.16(b), center. This is a RANSAC-inherent problem, since the approach does not take into ac- count data point neighborhood constraints. To constrain RANSAC to choose points from smaller neighborhoods leads in turn to less robustness to noise. In comparison, our approach is able to find ellipses with minimal constraints on ratio of radii, and is more robust to noise. Naturally, RANSAC improves when the ratio of radii is specified with stronger constraints, see Fig. 2.16(c), bottom.

In summary, the basic RANSAC approach is less feasible for the purpose of finding ellipses in data for reasons of real time performance and parameter flexibility.

2.6.5 Robustness of Mid-Level Geometric Features to Noise

Robustness to noise is tested towards noise by generating a data set (see fig. 2.15), and changing the amount σ of Gaussian noise. For each σ, a series of 50 data sets is generated and analyzed by our algorithm to detect the ellipse. The results, see Fig. 2.17, show a benign behavior towards noise, even with a very high noise level. The variance in radius matches the variance in noise as expected (the figure shows the results without Kalman filtering. With Kalman filtering, the ellipses converge to the ground truth ellipse, as expected). Note that even with noise of σ = 2.0 the

ellipse (Rmajor = 40,Rminor = 20) is still detected. The error threshold for each noise experiments is set proportional to the amount of noise. In real world data, this parameter can be determined by the sensor’s technical specifications, and is therewith fixed.

54 (a) No Kalman

DG DM Dσ RM Rσ N 1.0 1.03 1.08e-3 0.20 1.08e-3 56 1.5 1.52 1.15e-3 0.17 7.76e-4 54 2.0 2.02 2.11e-3 0.16 1.25e-3 57 2.5 2.47 2.62e-3 0.16 1.29e-3 63 3.0 3.05 5.16e-3 0.15 3.00e-3 54

(b) With Kalman

DG DK DKσ RK RKσ N 1.0 1.03 1.29e-4 0.20 1.47e-7 56 1.5 1.52 5.94e-5 0.17 1.27e-7 54 2.0 2.02 8.08e-5 0.16 7.43e-7 57 2.5 2.47 4.46e-4 0.16 6.84e-7 63 3.0 3.05 1.31e-3 0.16 1.91e-6 54

Table 2.2 Accuracy test. Left table: without Kalman filter. DG: ground truth dis- tance, DM : mean measured distance, Dσ: standard deviation of distance measure- ments, RM : mean measured radius, Rσ: standard deviation of radius measurements, N: number of measurements. Right table: with Kalman filter, labels accordingly.

DG O% R Rσ RK RKσ N 1.0 10% 0.20 9.89e-4 0.20 1.78e-4 61 1.0 20% 0.17 7.39e-4 0.17 2.17e-4 57 1.0 30% 0.14 6.58e-4 0.14 1.11e-4 61 1.0 40% 0.11 3.16e-3 0.11 3.71e-4 61 1.5 10% 0.16 5.72e-4 0.16 1.10e-4 58 1.5 20% 0.16 4.03e-2 0.16 4.59e-3 67 1.5 30% 0.16 3.26e-2 0.17 4.21e-3 63 1.5 40% 0.13 1.27e-2 0.13 1.88e-3 62 2.0 10% 0.16 9.03e-4 0.16 4.74e-4 57 2.0 20% 0.16 7.08e-4 0.16 2.60e-4 59 2.0 30% 0.13 1.21e-2 0.14 2.88e-4 68 2.0 40% 0.13 1.77e-2 0.13 3.02e-3 52

Table 2.3 Occlusion experiment of a cylinder(R = 0.2m) at different distances (DG), with and without Kalman filtering. O: occlusion percentage (10-40), the algorithm does not handle occlusions ≥ 50%. R: measured radius, Rσ: standard deviation radius, RK , RKσ: with Kalman filter, N: number of measurements.

55 VG VM Vσ VK VKσ N 0.21 0.22 7.79e-2 0.21 4.17e-2 97 0.65 0.64 3.18e-1 0.61 1.81e-1 63 0.71 0.72 2.19e-1 0.72 1.23e-1 27 1.09 0.84 4.46e-1 0.93 4.08e-1 22 1.54 1.25 5.71e-1 1.45 4.69e-1 13 1.64 1.61 9.41e-1 1.36 9.25e-1 6 2.66 2.51 8.79e-1 2.38 8.35e-1 6

Table 2.4 Observations of a moving cylinder. Estimated velocity VM is computed using the observed change in distance over time and is compared to the ground truth velocity (VG), Vσ denotes standard deviation in velocity. Similar computation is shown using the Kalman filter (VK , VK σ). N: number of measurements.

VG VM Vσ VK VKσ N 0.16 0.16 7.86e-002 0.15 6.06e-002 165 0.49 0.45 2.20e-001 0.48 1.82e-001 50 1.09 0.93 3.78e-001 1.04 3.33e-001 22 1.20 1.03 4.34e-001 1.19 3.50e-001 22 1.44 1.22 6.35e-001 1.43 5.08e-001 18 3.00 2.10 6.85e-001 2.71 7.94e-001 8

Table 2.5 Observations of a moving flat sheet of cardboard. Estimated velocity VM is computed using the observed change in distance over time and is compared to the ground truth velocity (VG), Vσ denotes standard deviation in velocity. Similar com- putation is shown using the Kalman filter (VK , VK σ). N: number of measurements.

DG DM Dσ KM Kσ N 1.00 0.99 7.28e-004 0.99 3.39e-004 113 1.50 1.49 1.07e-003 1.49 1.36e-004 123 2.00 2.00 2.68e-004 2.00 1.13e-004 141 2.50 2.52 6.66e-004 2.52 5.51e-004 145 3.00 3.02 1.54e-001 3.01 2.07e-002 136

Table 2.6 Accuracy test. Left table: without Kalman filter. DG: ground truth dis- tance, DM : mean measured distance, Dσ: standard deviation of distance measure- ments, KM : mean distance corrected with Kalman Filter, Kσ: standard deviation of Kalman corrected distance measurements,N: number of measurements.

56 (a) Our Approach

(b) RANSAC 0.2

(c) RANSAC 0.4

Fig. 2.16 Simulated data points (black) with fitted ellipse (red), supporting points (green). Top: Our Approach; Middle: RANSAC with constraint for radii ratio rminor ≥ 0.2; Bottom: RANSAC rmajor with stronger radii ratio constraint rminor ≥ 0.2. rmajor

2.6.6 Accuracy

The accuracy of the used approach is measured by placing a cylinder (trash can) in front of the Kinect sensor at different distances. Several observations (N) were made to compute the mean and standard deviation (Dσ) for the cylinder distance to the sensor, see Table 2.2. The results in Table 2.22.1(a) show a relatively close measured (DM ) to the ground truth distance with a small amount of uncertainty, and at the same time as the distance from the cylinder to kinect sensor increase,

57 Fig. 2.17 Result of ellipse extraction in noisy, simulated data. Amount of added random noise to the data (x-axis) vs standard deviation of fitted ellipse model over 50 trials.

the error in observing the radius (RM ) increase. Table 2.22.1(b) shows the effect of Kalman filtering on the observations, the distance with Kalman filtering (DK ) is close to the one obtained by observations only but with lower variation (Rσ). The same situation occurs for the radius. The Kalman filter in this experiment reduces the standard deviation, making the observations more certain.

2.6.7 Occlusion

A cylinder of radius r = 20cm was obstructed by different amounts at various dis- tances to measure the robustness of this approach w.r.t. distance and occlusion, see Table 2.3. As the occlusion amount (O%) increases, the observed radius R de- creases for the observations with and without Kalman. This is due to the limited and noisy ellipse data, naturally fitting slightly smaller ellipses. In this experiments, the Kalman filter effects the results by reducing the variation amount of the observed radius, as expected.

58 2.6.8 Object Tracking

This section includes three experiments. Two simple qualitative experiments, track- ing a bouncing ball, and a quantitative experiment to measure object speeds. In the first qualitative experiment, we tested if the algorithm can track a bouncing ball (radius = 10cm) see Fig. 2.19. Our system was able to track the ball’s trajectory. For the second qualitative experiment, we mounted a Microsoft Kinect sensor on a Pioneer mobile robot to enable navigation towards conical objects (Cylinder, Cone and Sphere). All computations were done on a notebook (2.3GHz, 2GB Ram) on the robot. The setup for the experiment is as follows: the robot has to track a ball and navigate towards it until the sphere is a meter away. In the first test the ball was held by a walking human. During the second test the ball was thrown over the robot bouncing in the desired direction. In both tests the robot detected and navigated towards the ball while moving, see Fig. 2.20. In the quantitative test, we placed the Kinect perpendicular to a moving cylinder on a Pioneer robot, see Fig. 2.18 (low speed, <= 1m/s ) and a moving platform dragged by a human (higher speed, > 1m/s), see Table 2.4. The experiment shows promising results with an increase in error at high velocity. The higher amount of error in high velocity is due to the low amount of observations (limited range of stationary Kinect). Using the Kalman filter, we could, naturally, reduce uncertainty in observations, but not necessarily improve accuracy.

2.7 Conclusion and Future Work

We presented an algorithm for real-time extraction of conical objects from 3D range data using a three step approach (points to ellipses to objects). It provides fast and robust performance for extracting all 2D best fit ellipses in O(n), n number of points. The algorithm is well behaved towards noise and can aid higher level tasks,

59 Fig. 2.18 Experiment setup to measure speed of a moving cylinder mounted on a robot using a Kinect sensor. for example, autonomous robot navigation by providing robust and fast landmark features. To improve robustness, the algorithm can be augmented using a Kalman filter, which is feasible due to the algorithm’s fast performance. Future work consist of improving the algorithm to predict models when conical objects are significantly occluded.

60 Fig. 2.19 Results of tracking a ball. The blue line connects center points of spheres.

Fig. 2.20 A Pioneer DX robot tracking a ball in real-time, using 3D range data from MS Kinect.

61 CHAPTER 3

CORRESPONDENCES BASED ON MID-LEVEL GEOMETRIC FEATURES

3.1 Introduction

In recent years the topic of Robot Mapping, has increasingly gained popularity, for example with the introduction of 2D Simultaneous Localization and Mapping (SLAM) [7], and later with various attempts to extend the methodology to 3D. A major obstacle in robot mapping and its straightforward implementation based on odometry is, that odometric sensors, if present, do not provide a sufficient accu- racy. To solve that problem, 3D mapping employs finding the realtime pose change between two successive sensor observations (scans) using visual scan-matching. Non- odometry based, yet visual feature-based scan-matching processes (allothetic) can be separated into two algorithmic steps. The first step determines the correspondence between features from each scan, followed by a second step, which uses the estab- lished correspondence to determine the optimal pose change (in terms of relative rotation and translation). Registering two range images with known corresponding 3D features can be done using a quaternion-based non-linear optimization method

62 as described in [26]. The lowest level of feature correspondence is raw point correspondence, as pre- sented in the Iterative Closest Point (ICP) algorithm [3], [6], [80]. The ICP algorithm is often used to minimize the difference between two clouds of points, to reconstruct 2D or 3D environment representations from different scans. However, all ICP-type methods require the scans to be spatially close with respect to each other in order for an initial set of correct closest point correspondences to be established. General speaking, scan matching algorithms can be broadly classified in two main classes: The first class is represented by point based, e.g. ICP ([4], [47], [73]) and voxel-based 3D-NDT ([42], [8]). The second class of algorithms are based on higher features than points, usually geometric features representing planar patches ([10], [13], [31], [15]). In this work planar patches are used with a proper distance function, but that approach can be extended to different geometric features, e.g. conical objects (sphere, cylinder and cone), which can be detected as presented in previous chapters. However, the following will focus on planar features only. The difference of the feature based approach to the first class is, that the fea- ture correspondences do not change under transformations between the scans. This means, that correspondences between features are independent of the pose differences between scans, i.e. the scans do not have to originate from similar poses (naturally, there still must be a sufficient feature overlap between the scans). The described approach in this chapter falls into the second category. It uses planes as geometric features. Both classes of algorithms have advantages and disadvantages. Point and voxel based algorithms do not make the assumption of any structures existence in the environment. Therefore they are less restrictive and can be used with data that contains very few planar structures (e.g. outdoor environments, trees, caves). The fastest algorithm in this class is ICP (O(N log N), which has a robust open source

63 implementation 3DTK ([46], [47]). In comparison, our approach extracts all planar patches in O(N), where N is the number of points in the scan. Then our approach computes the registration

4 between the two scans in O(np), where np is the number of planes, and is a low order of magnitude (usually less than 10 features lead to a robust correspondence). Therefore the first class of algorithms is computationally more expensive than our algorithm. In addition, they are more sensitive to data points outliers, and may converge to a local minimum when there is not a significant overlap between the scenes [69]. As stated in [51] and mentioned before in Chapter1 maps created using point/voxel based approaches have several disadvantages:

1. They are composed of large number of points;

2. They have no high-level geometric information;

3. They are hard to visualize.

On the contrary, planar patches offer higher-level geometric description of the scene and are easy to visualize. In addition, planar patches also offer an impressive compression of data (typically 95%) compared to point-clouds. Representing a wall, for example, as a single rectangular plane segment, means storing four corner points only. This is in contrast to representing the wall by thousands of raw laser reflection points. Replacing coplanar points by plane segments can be seen as psycho-visual redundancy reduction, a computer vision approach leading to highest rates in image compression. Assuming a coplanar structure, only the corner points enhance the perceptual information, thus all other points are redundant [18]. The smaller number of features allows for faster solutions of the feature-correspondence problem as shown in section 3.4. The main focus of our approach explained in this chapter is

1. Finding correct correspondence between planar patches

64 2. Computing optimal pose prediction using planes correspondence, marked with a rectangle in Fig 3.1.

Fig. 3.1 Structure diagram comparison of three SLAM algorithms (a - The approach of Nuchter et. al. [47]; b - Weingarten et. al. [73]; c - our approach and Pathak et. al. [51]). The ICP step in Weingarten’s approach has been removed in our approach.

The rest of the paper is organized as follows: after comparison to related work (Sec.3.2) is followed by detailed explanation of the approach (Sec.3.3). An experi- mental comparison to the best algorithm from first and second class are presented in Sec. 3.4. And lastly Section 3.5 concludes the paper and discusses future work.

3.2 Related Work

Up to date, there are several registration approaches, which make different assump- tions. It is often assumed, that scans obtained by mobile robots are pre-aligned using the robots’ odometry. Sequential scans are taken frequently and have a significant overlap. A widely adopted solution that assumes the latter assumptions is ICP [47]. In its core, ICP matches scans by iterative attraction to a local minimum of the registration-cost-function. It relies on a good initial guess of pose change, such as the vehicle odometry, for finding the correct registration. Matching two scans in the absence of any initial guess is a hard task, especially if the pose change is significant

65 compared to the scanned area. As a result, ICP can converge to a local minimum and fail to find the global optimal solution. There are many SLAM algorithms us- ing registration based on points (first class) as well as feature-based (second class). The feature-based 3D SLAM approach in [73] uses planar segments with Extended Kalman Filter (EKF-SLAM) to build a three-dimensional map of the environment and track the robot’s pose. The plane extraction step is also common in both ap- proaches. As previously discussed in Chapter2, our plane extraction algorithm can be utilized here. The work of Kohlhep et al [11], [32], [31] is also related to ours. In [32] surfaces are extracted from range images obtained by a rotating laser range finder and registered. A local module determines the correspondences and computes the transforms, and a global module detects loop-closure and distributes uncertainty using an elastic graph. For this chapter, only the local module is relevant which is discussed again in [31]. Kohlhep et al use feature-directions to compute rotation between successive scans. However, for estimating translation, [31] decomposes planar patches to points and then performs registration using ICP, while keeping the same assumptions. Other registration approaches include 3D Normal Distribution Transform (NDT) ([42], [8]). Other solutions are based on Random Sample Consensus Algorithm (RANSAC) [53], [15] in effort to perform registration on point based 3D data. RANSAC is an iterative method for estimating parameters of a mathematical model from a set of observed data containing outliers. It is a non-deterministic algorithm that produces a rea- sonable result with a certain probability. RANSAC achieves its goal by iteratively selecting a random subset of the original data. The data is hypothetical inliers and the hypothesis is then tested as follows: 1) A model is fitted to the sample of hypo- thetical inliers, i.e. all free parameters of the model are fitted from this sample. 2) All other data are then tested against the fitted model and those points that fit the estimated model well are considered as part of the consensus set. 3) The estimated

66 model is reasonably good if a sufficient amount of points have been classified as part of the consensus set. 4) Afterwards, the model may be improved by re-estimating it using all members of the consensus set. A related approach was applied by [61], who discusses the use of a ”blind- hypothesis-and-test” for all combinations of feature pairs in O(M 2N 2), where M,N are number of features in each scan. Relaxation labeling ([29],[81], [55]), a technique originating from computer vi- sion, is used for point-based registration in [39]. The basic elements of the relaxation labeling method are a set of features (e.g. points, planes, edges etc) belonging to an object and a set of labels. The labeling schemes used are probabilistic in that for each feature, weights or probabilities are assigned to each label in the set giving an estimate of the likelihood that the particular label is the correct one for that feature. Probabilistic approaches are then used to maximize (or minimized) the probabil- ities by iterative adjustment, taking into account the probabilities associated with neighboring features. Relaxation strategies do not necessarily guarantee convergence, and thus, we may not arrive at a final labeling solution with a unique label having probability one for each feature. Klein et al [33] extends 2D Shape Context([2], [1]) to 3D shape correspondence by comparing histograms of 3D shape context, which could be adapted for 3D scene cor- respondence. They make use of the fact that ”the shape context at a point captures the distribution over relative positions of other shape points and thus summarizes global shape in a rich, local descriptor”. Later, Lakaemper et al [38] presented a particle filter approach for building partial correspondence between shapes (from 2D point data). That approach can be used for 3D scene registration if a suitable plane distance measure is developed. 3D point data have higher density in closer proximity to the sensor, than more distant regions. However, registration algorithms assume an equal point data density,

67 which, can produce incorrect results. To remedy this problem in [77], Latecki et al introduces synthetic points ”Ghost Points”. By contrast our approach uses only plane-features and does not return to the domain of point-features. Similar to our work, [73] uses plane-features, but their approach requires ICP step for the pose change prediction. Pathak et al [51], similar approach, uses extracted planar features to first determines the feature-correspondence, then using matching called Plane Registration based on Rotation of a Unit Sphere (PRRUS) and finally ”Minimal Uncertainty Max Consensus” determines the pose change(rotate and translation) from the computed correspondences. In comparison, our approach uses a different method to determine the plane correspondence, which also considers mutual distance and coplanarity.

3.3 Methodology

Our approach finds the optimal scan registration in three steps.

1. Plane extraction, which is explained in details in Chapter2.

2. Find optimal correspondence between two scans, discussed in details in Section 3.3.1. A detailed diagram explaining the correspondence algorithm is given in Fig. 3.2.

3. Then pose registration is achieved by using the plane correspondences found in 2, then find the optimal rotation and translation, Explained in Section 3.3.2.

Figure 3.3 shows a successful result of a registration between two scans using this approach is displayed with 3D polygons. The Front and Side view shows further details. Figure 3.4 shows an example of an unsuccessful result of a registration between other pair of scans using this approach. A detailed experimental comparison is provided in Section 3.4.

68 Fig. 3.2 Correspondence is determined in 3 steps. 1) Compute all distances between pairs of planes 2) Accumulate the distances into the plane correspondence matrix. 3) Find global optimal min assignment for it.

3.3.1 Plane correspondence

Finding the correct plane correspondence is a key step before computing the opti- mal scan registration. Our approach determines the plane correspondence in several steps. First, the distance between all pairs of planes is computed from the first scan fij to all pair from the second scan gkl. This operation considers local dissimilari- ties between the two scans, It is measured the distance function in Eq. 3.1, which compares a pair of planes fij from scan one to a pair of planes gkl from the other scan.

2 2 2 (Cf −Cg ) (α −αg ) (Mf −Mg ) ij kl fij kl ij kl − 2 − − 2 2σ 2σ2 2σ D(fij, gkl) = e C e α e M (3.1)

where Cfij is the center point distance between plane i and plane j from the same scan. σC is the standard deviation of the center point distances. αfij is the angle between the normal vectors from plane i and plane j from the same scan. σα is

69 Fig. 3.3 Results of a successful registration between two scans using our approach. Planar patches are drawn semi-transparent to show the internal details.

the standard deviation of of the angle between planes. Mfij is the MSE error after merging plane i and plane j from the same scan. σM is the standard deviation of the plane MSE. The distance function contains three components, which contribute to the com- puted distance. If a pair of planes perfectly matches another pair of planes, which means the angles between the planes are equal, center point distances are the same and the MSE of the combined planes are equal, hence the distance will equal 1. If either of these parameters is not equal the distance will be larger. Corresponding pair distances are accumulated and stored in a correspondence matrix Cnxm using Al- gorithm3. This step makes use of local constraints existing between pairs of planes within the scan. The algorithmic complexity of this stage is O(N 2M 2), for N,M

70 Fig. 3.4 An example of unsuccessfully registration between two scans using our approach. In a correct registration the lower right corners should overlap. This occurred because wrong plane-features correspondences were found, which resulted in an incorrect registration. number of plane-features in each scan (usually N,M ¡¡20).

Algorithm 3 Compute the correspondence matrix Cnxm for i=1 to N do for j=i+1 to N do for k=1 to M do for l=k+1 to M do W = D(fij, gkl) Cik− = W Cil− = W Cjk− = W Cjl− = W end for end for end for end for

The second step uses only the correspondence matrix to determine a minimum global optimal assignment. For that we use the Hungarian algorithm [34] on C. Without a loss of generality, Scan A (first scan) is considered to have more planes than Scan B (second scan). As a result the extra planes from the first scan will not have

71 correspondences in the second scan, which solves the many-to-one problem (see Fig. 3.5). It is important to note that exactly three planes non-coplanar corresponding pairs of planes are needed to find the unique rotation and translation.

Fig. 3.5 Sample test of our algorithm using Kinect data. Left: Correspondence matrix C for the two scans. Corresponding planar patches are market by the same color. The planar patch with label 21 has low correspondence values for every plane from scan B as a result the it has no corresponding plane in the other scan.

3.3.2 Pose-registration using planar patches

The pose registration step, refers to the problem of finding the optimum rotation R and translation T between two scans with given sets of corresponding (noisy)

planes P1 and P2. The rigid transformation that optimally aligns the two scans also minimizes the least square distance (the angle) between the normal vector of corresponding planes from

n X 2 (Rpi + T − qi) (3.2) i=1

where pi is a normal vector from P1, and similarly qi is a normal vector from

P2. This optimization problem can be solved using the Wahba’s Problem ([71], [59]), originally applied for estimating satellite attitude. The solution finds the optimal

rotation, given the feature correspondences and their weights wi by computing matrix B from Eq. 3.3. All weights used in our approach are equal to 1. The reason for that

72 is that the optimal assignment used (Hungarian Algorithm) gives discrete values, e.g. corresponding=1, not Corresponding=0.

n X T piwiqi (3.3) i=1

The singular value decomposition (SVD) of results in B = USV T . The rotation matrix is computed R = UFV T (3.4) where F = diag([11det(U)det(V )]. Translation is computed in a similar fashion, by minimizing the least square solution. The solution for T is shown in Eq. 3.5, where

MNx3 is the result after applying rotation Mi = Rpi and dNx1 is the distances from each center point to the projection on its corresponding plane.

T = (M T W 2M)−1MW 2d (3.5)

A detailed closed form solution using quaternions is shown in [26][51].

3.4 Plane-Correspondence Experiments

We tested our approach with the two state-of-the-art approaches ICP, belonging to the point based class, and PRRUS [51], planar based registration approach. The criterion for comparison was accuracy and speed.

3.4.1 Comparison to ICP

We used the open source ICP implementation 3DTK [46]. The experimental data was collected using the robot ”Dr. Faust” (Fig. 1.9). To establish a ground truth, a known rotation and translation is applied to one of the 3D scans. Then it is compared to the original 3D scan using ICP. ICP is used to compute scans with

73 different amount of rotation and translation, the results are compared to the ground truth. Then a difference between the ground truth and the ICP result is computed as

D = (TICP − Tgt) + (αICP − αgt), where T is translation and α is rotation angle. The results are shown in Fig. 3.6. The erroneous results from scans with larger distances is expected, since ICP converges to a local minimum, which is very sensitive to the initial pre-alignment position. The 3D point cloud data used in this experiment (Fig. 3.7) shows a single result of an ICP registration, which has failed to find the global optimal translation and rotation (corresponds to red fields in Fig. 3.6). A correct ICP registration result is shown in Fig. 3.8. By comparison our approach found the exact correspondences, hence the correct rotation and translation for all rotations and translations performed on ICP. That is expected, since our approach is coordinate system independent, i.e. the algorithm finds the same correspondences regardless of the amount of translation and rotation added to the second scan. Clearly, our approach performs better than ICP, given the assumption that there are (enough) planar patches that could be found. On average ICP executed in 0.545 seconds (without GPU), compared to our approach that executed in 0.014 seconds. Both algorithms were run on the same desktop computer (Dell Optiplex980, Windows 7).

3.4.2 Comparison to PRRUS

PRRUS [51] finds the normal correspondences that have the most geometric con- sistency. First, they find a pair of corresponding planes that have the most similar plane-pairs. That step requires O(N 2M 2), where N and M are the number of planes in each scan. The second step aligns the scans based on the pair of corresponding planes, only leaving one degree of freedom (rotation around the normal of one of the pair-planes). They use mean clustering over the angles of the corresponding planes to determine the remaining rotation angle. By contrast our approach, computes the distance (not only angular distance) between all pairs of planes still in O(N 2M 2).

74 Fig. 3.6 All values are computed by summing the translation and angular difference from the ground truth.

The benefit of using distance information together with angular is to help in am- biguous cases. An example of such a case is shown in Fig. 3.9, where two of the planes are parallel or almost parallel to each other, i.e. both have normal vectors in the same direction. To test that, we created a scene 3.9 with 4 planes, such that two planes (P1 and P4) are parallel or close to parallel. Then change the plane-features order ex. swap P1 with P4 and consider that configuration as the second scan. Next we ran PRRUS using the original scan and the one with the different plane order. As a result PRRUS finds the wrong correspondence, since it doesn’t consider mu- tual plane distance. By contrast, our approach considers the plane distance and finds the correct correspondences. We were not able to obtain an implementation of PRRUS, therefore we had to implement it ourselves in MATLAB. Both approaches have O(N 2M 2) time complexity, but the PRRUS implementation in MATLAB is

75 Fig. 3.7 Result of registration between two 3D point clouds using ICP after 50 iterations. The second point cloud is an identical copy of the first with added rotation (60deg) and translation(2 meters). We expect to see both scans to overlap, but clearly ICP fails to find the correct pose transformation, because it is stuck in a local minimum. slower than our method’s JAVA implementation. For that reason we consider both approaches equally fast. In a separate paper [50] Pathak et. al. compared point based ICP and his PRRUS approach. They used a data obtained by a disaster scenarios that were mapped at the NIST 2008 Response Robot Evaluation Exercise in Disaster City, Texas1. The data used for the comparison is obtained from a collapsed car park and it has 26 3D scans. We used the same data with our approach to produce a more comprehensive experimental comparison, see Table 3.1 Our approach found correct registration 19 out of 25 times. Where, ICP only managed to find 14 out of 25, as shown in [50]. The results of these experiments demonstrate that our approach outperforms point-based ICP mapping with respect to speed and robustness. However, Pathak et. al. has 25

1 The data set is available at: http : //robotics.jacobs − university.de/node/293

76 Fig. 3.8 Result of registration between two 3D point clouds using ICP. The second point cloud is an identical copy of the first with added rotation (0.3rad) and trans- lation (0.2 meters). As expected both scans completely overlap, which means ICP found the correct pose transformation. out of 25 correct registration at an average time of 6.07 sec vs. our approach 0.038 sec. It is clear that [50] is more accurate registration algorithm, but it is clear that it is significantly slower than our approach.

Fig. 3.9 3D scene composed of 4 planes (P1,P2,P3,P4) P1 and P4 are parallel.

77 Table 3.1 Plane-matching registration comparison between ICP (point based), PRRUS [51] and our approach based on the crashed car park data set [50].

3.5 Conclusion

In conclusion we have presented a well-behaved scan registration algorithm based on 3D plane matching. We have compared our approach to state-of-the-art point based (ICP) and feature based registration algorithms. The experiments demonstrated that our approach is well-behaved in cases with high amount of ambiguity. The presented approach can be used for robot mapping.

78 CHAPTER 4

ADDITIONAL PRACTICAL EXPERIENCE

4.1 Benchmarking and Standardization

The current landscape for robotics standardization in the United States is char- acterized by promising yet isolated efforts. Although some existing initiatives are attempting to bring together scientists, engineers, and end-users of robot technolo- gies for evaluation and standardization purposes, research and industry have still not yet acknowledged the synergistic opportunities of common underlying principles to an adequate extent [41]. One common complaint about standardization is that too early standardization hinders the use of more recent technologies/techniques that would admit desirable performance (and price etc.). For specific technologies for devices and communica- tion, it is easier to know when the technology is mature and should be standardized, even if later upgrades are likely (as with Ethernet and related network technologies), but for systems (for navigation, motion control, etc.) that is much harder. The other problem with standards, even if they are done at the right time, is that the complex (sub)system interactions lead to standards that are too ’thick’, and therefore are

79 hard to apply; consequently they tend to be obsolete before being passed as a formal standard. While standards are needed for interoperability, our experience and those of others have shown that loosely coupled systems permit reuse without normative standards and thus should precede standardization. Once initial efforts gain traction and are eventually accepted within the research and industrial community, it can lead towards the establishment of de facto standards, which can then be propagated through existing standards organizations, resulting in a widely accepted ’standard’. In addition, we believe that substantiating evidence resulting from concomitant re- search can facilitate the acceptance of emerging standards which in turn will expedite the worldwide acceptance of normative standards. It is not hard to see that the ability to build a world model (map) of the working environment within/using which subsequent operations can be planned is a desirable feature in many domains of interest. For example, AGVs can use an environmental model on factory floors for autonomous navigation and in a disaster scenario con- cerning extrication of victims, a robot-generated map will serve as in invaluable tool for the responders. Not surprisingly, the development of efficient world modeling schemes has received their due attention from roboticists. A myriad of approaches have been proposed and implemented, some with greater success than others. The capabilities and limitations of these approaches vary significantly depending not only on the operational domain, and onboard sensor suite limitations, but also on the re- quirements of the end user: Will a 2D map suffice as an approximation of a 3D environment? Is a metric map really needed or is it enough to have a topological representation for the intended tasks or do we need a hybrid metric-topological map? It is thus essential for both developers and consumers of robotic systems to under- stand the performance characteristics of employed methodologies which will allow them to make an informed decision. To our knowledge, there is no accepted benchmark for quantitatively evaluat-

80 ing the performance of robotic mapping systems against user-defined requirements. Currently, the evaluation of robotic maps is based on qualitative analysis (i.e. visual inspection). This approach does not allow for better understanding of what errors specific systems are prone to and what systems meet the requirements. It has become common practice in the literature to compare newly developed mapping algorithms with former methods by presenting images of generated maps. This procedure turns out to be suboptimal, particularly when applied to large-scale maps. It is our belief that end-users’ requirements should drive developers and integrators such that a re- sulting ’intelligent’ system is useful and affordable. Only by involving all of the three parties, viz. users, developers and integrators in a coupled fashion, can meaningful solutions be produced which can stand the ever-varying requirements imposed by: 1) tasks that are either application or environment dependent, 2) hardware and soft- ware advancements/restrictions that affect the development cycle, and 3) budgetary constraints that interrupt and hamper sustained progress. These factors hold true for manufacturing robotic and automation systems where characterization of com- ponents at the system and sub-system levels are needed to develop and deploy fully functional systems that can work around the clock with as much downtime as pos- sible. The so-called mean-time-between-failures is not important for manufacturing automation systems as much as reduced-downtime systems (that maintain produc- tion uptime). In other words, a system that has periodic glitches but can recover in a few minutes is preferable to a system that fails only occasionally but requires to be disconnected from the production cycle in the order of hours. To guarantee such re- quirements and to ensure reliability and robustness, repeatable and reproducible test methods are needed to experimentally verify and validate technical methodologies. To design and develop capable, dependable, and affordable robotic and automa- tion systems, their performance must be measurable. It is our firm belief that in order to facilitate the wide acceptance of intelligent systems’ technologies and to

81 quantify their performance, scientifically sound and statistically significant metrics, measurement, and evaluation methodologies are of paramount importance. The rea- sons for these are manifold: Accepted standards and procedures for quantitatively measuring the performance of robotic systems against user-defined requirements will not only improve the utility of mobile robots in already established application ar- eas but will enable the proliferation of such technologies in other emerging markets. Currently, there is no consensus on what objective evaluation procedures need to be followed to deduce the performance of these systems. The lack of reproducible and repeatable test methods have precluded researchers working towards a common goal from exchanging and communicating results, com- paring robot performance, and leveraging previous work that could otherwise avoid duplication and expedite technology transfer. It is important to develop test arti- facts and measurement methodologies to capture performance data in order to focus research efforts, provide direction, and accelerate the advancement of mobile robot capabilities. Reuse of information and software interoperability, i.e. common ex- change formats for both data and files, is key to providing the research community access to standardized tools, reference data sets, and an open-source library of solu- tions. As a result, researchers and consumers will be able to evaluate the cost and benefits associated with intelligent systems and associated technologies.

4.1.1 NIST Test Arena

Automated Guided Vehicles (AGVs)/Industrial Mobile Robots (IMRs) are robotic platforms that are widely used on factory floors and warehousing applications to transport goods. In order to make them more versatile and useful as they navigate factory floors and warehousing applications to transport goods, it is imperative that they are able to cope with dynamically changing environments with humans and other vehicles such as AGVs and forklifts [37].

82 Within the scope of this project, we are striving to answer the following questions:

1. How can we develop scientifically sound methodologies to evaluate world mod- eling schemes for use in manufacturing environments? It is our view that such methodologies are better developed by taking into account requirements imposed by end-users and domain specific constraints that are grounded in practicality.

2. How can we design experiments and test methods to enable performance eval- uation and benchmarking towards characterizing constituent components of navigation and world modeling systems that provide statistically significant results and quantifiable performance data?

3. Can we leverage lessons learned from (1) and (2) towards standards-defining activities and bring together end-users, developers, researchers, and integrators for adoption of de facto standards?

4. How do we raise awareness and increase collaboration among industry and academia while keeping the general public interested and appreciative of the efforts of the scientific community on the role of robotics and its potential positive impact on their lives?

We have secured a testing facility at the Alabama Robotics Technology Park (RTP; http://www.alabamartp.org/) in Huntsville, AL, in collaboration with AMTEC Corp. and the State of Alabama (see Fig. 4.1). Industrial robot companies partici- pated (see Fig. 4.3) in this testing and evaluation phase of our project to benchmark the performance of navigation systems. Testing took place in mid-July 2012. The developed test methods and metrics emphasis did not favor any particular existing solution(s), or, in other words, ensured that the scope of testing covered a wide spectrum of possibilities that are typical of situations encountered in factories and

83 warehouses. The tests were carried out on company owned robot platform. Since the testing is based on task performance, no internal technical specifications was released. Fig 4.2 shows the first four scenarios (R1-easy, R4-hard). To realize these objectives within the scope of a project entitled World Modeling of Autonomous Navigation in Unstructured and Dynamic Environments 1 and with funding from the Commerce Department’s Measurement Science and Engineering Research Grants undertaking testing and evaluation to experimentally validate ex- isting commercially available navigation systems of IMRs. Towards this, we have developed an evaluation framework, CLEO (Complexity Levels of Environments and Obstacles), which ranks the capabilities of navigation solutions based on different categories of industrial environments. Since CLEO aims at evaluation of the navi- gation portion of IMRs, performance of competencies like path planning or obstacle avoidance are determined2.

4.2 Robot Opera

Robot mapping also was used in GALATEA RESET, a recent robot opera perfor- mance (see Fig. 4.4), to continuously localize all autonomous three robots during a live performance with moving actors. In its core the localization system used line segments extracted using my line extraction algorithm, which was implemented in JAVA. The stage was designed by Dr. Lakaemper, such that the robots’s 2D laser scanner can always detect at least two line segments (see Fig. 4.6). In addition to real-time robot localization the onboard computer of each robot is receiving PD commands used generate (in real-time) sounds and play them using the on-board

1 This project is jointly carried out by Temple University and the University of Maryland, College Park under Award#: ARRA-60NANB10DO12 to ”bolster U.S. scientific and technological infras- tructure, increasing our nation’s ability to innovate, compete, and solve scientific and technological problems”. 2 For more information on our research efforts, please peruse the links on this website (http://sites.google.com/site/templearra/Welcome).

84 robot speaker. The entire performance is run and synchronized using a single laptop computer (PD Base Station) that also controls the video and sends commands via WiFi to each robot. A high-level system diagram is provided in Fig 4.7. I was very fortunate to be part of this collaboration with Dr. Maurice Wright, a Temple a Boyer College of Music and Dance professor and his team (see Fig. 4.8). My humble contribution helped with the computer software and hardware related problems. The opera was sold out, all three performances and was given a very positive feedback from the audience and was featured on the front page of Outlook Magazine (see Fig. 4.5).

Fig. 4.1 Setting up the test arena at the Alabama Robotics Technology Park (RTP)

85 Fig. 4.2 Schematic diagrams for the R1(easy)-R4(hard) evaluation scenarios. The robot must reach each of the goals in order marked with green circles.

86 Fig. 4.3 Testing in progress.

87 Fig. 4.4 Promotional flyer for the robotic opera.

88 Fig. 4.5 Galatea RESET on the cover of Temple University Outlook Magazine.

89 Fig. 4.6 Top: Galatea stage design. Bottom: A top view of the stage with the localized robot.

90 Fig. 4.7 Galatea control system.

Fig. 4.8 The entire crew of the GALATEA RESET. Front row (left to right): Dr. Wright, Dr. Lakaemper

91 CHAPTER 5

CONCLUSION

This dissertation focusses on real-time 3D robot mapping. There are two major con- tributions in this work. The first one is a real-time extraction of mid-level geometric features, here planes, spheres, cones and cylinders. And the second is a method for determining feature correspondences. Together, they form a well-behaved scan registration feature-based algorithm using 3D planes as features for the application of real-time 3D mapping. We have compared our approach to state-or-the-art al- gorithm and have demonstrated that our approach behaves better, when ambiguity exists.

92 BIBLIOGRAPHY

[1] Belongie, S., Malik, J., and Puzicha, J. (2000). Shape context: A new descriptor for shape matching and object recognition. In NIPS, volume 2, page 3.

[2] Belongie, S., Malik, J., and Puzicha, J. (2002). Shape matching and object recog- nition using shape contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(4):509–522.

[3] Besl, P. (1992). N. mckay. a method for registration of 3.d shapes. IEEE PAMI, 14(2).

[4] Besl, P. J. and McKay, N. D. (1992). A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell., 14:239–256.

[5] Capek,ˇ K. (2004). RUR (Rossum’s universal robots). Penguin.

[6] Chen, Y. and Medioni, G. (1992). Object modelling by registration of multiple range images. Image and vision computing, 10(3):145–155.

[7] Dissanayake, G. (2000). H. durrant-whyte and t. bailey. a computationally effi- cient solution to the simultaneous localization and map building (slam) problem. ICRA2000 Workshop on Mobile Robot Navigation and Mapping.

[8] Duckett, T. (2003). A genetic algorithm for simultaneous localization and map- ping. In ICRA, pages 434–439.

93 [9] Dutta, T. (2012). Evaluation of the kinect sensor for 3-d kinematic measurement in the workplace. Applied ergonomics, 43(4):645–649.

[10] Faugeras, O. D. and Lustman, F. (1988). Motion and structure from motion in a piecewise planar environment. International Journal of Pattern Recognition and Artificial Intelligence, 2(03):485–508.

[11] Fischer, D. and Kohlhepp, P. (2000). 3d geometry reconstruction from multiple segmented surface descriptions using neuro-fuzzy similarity measures. Journal of Intelligent and Robotic Systems, 29(4):389–431.

[12] Fischler, M. A. and Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395.

[13] Fisher, R. B. (1990). Geometric constraints from planar surface patch matching. Image and Vision Computing, 8(2):148–154.

[14] Fitzgibbon, A., Pilu, M., and Fisher, R. (1999). Direct least square fitting of ellipses. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(5):476 –480.

[15] Fontanelli, D., Ricciato, L., and Soatto, S. (2007). A fast ransac-based regis- tration algorithm for accurate localization in unknown environments using lidar measurements. In Automation Science and Engineering, 2007. CASE 2007. IEEE International Conference on, pages 597–602. IEEE.

[16] Gath, I. and Hoory, D. (1995). Fuzzy clustering of elliptic ring-shaped clusters. Pattern Recognition Letters, 16(7):727 – 741.

[17] Georgiev, K., Al-Hami, M., and Lakaemper, R. (2013). Real-time 3d scene

94 description using spheres, cones and cylinders. In In Proceedings of the 16th In- ternational Conference on Advanced Robotics (ICAR).

[18] Georgiev, K., Creed, R., and Lakaemper, R. (2011). Fast plane extraction in 3d range data based on line segments. In In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19] Gotardo, P. F., Bellon, O. R. P., and Silva, L. (2003). Range image segmentation by surface extraction using an improved robust estimator. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–33. IEEE.

[20] Halır, R. and Flusser, J. (1998). Numerically stable direct least squares fitting of ellipses. In Proc. 6th International Conference in Central Europe on Computer Graphics and Visualization. WSCG, volume 98, pages 125–132. Citeseer.

[21] Hartley, R. and Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge university press.

[22] Hilbert, D. and Cohn-Vossen, S. (1999). Geometry and the Imagination, vol- ume 87. American Mathematical Soc.

[23] Holz, D. and Behnke, S. (2014). Approximate triangulation and region grow- ing for efficient segmentation and smoothing of range images. Robotics and Au- tonomous Systems.

[24] Holz, D., Holzer, S., Rusu, R. B., and Behnke, S. (2012). Real-time plane segmentation using rgb-d cameras. In RoboCup 2011: Robot Soccer World Cup XV, pages 306–317. Springer.

[25] Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, P. J., Bunke, H., Goldgof, D. B., Bowyer, K., Eggert, D. W., Fitzgibbon, A., and Fisher, R. B. (1996).

95 An experimental comparison of range image segmentation algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 18(7):673–689.

[26] Horn, B. K. (1987). Closed-form solution of absolute orientation using unit quaternions. JOSA A, 4(4):629–642.

[27] Howard, A. and Roy, N. (2003). The robotics data set repository (radish).

[28] Howard, A., Wolf, D. F., and Sukhatme, G. (2004). Towards 3d mapping in large urban environments. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, volume 1, pages 419– 424. IEEE.

[29] Hummel, R. A. and Zucker, S. W. (1983). On the foundations of relaxation labeling processes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (3):267–287.

[30] Khoshelham, K. and Elberink, S. (2012). Accuracy and resolution of kinect depth data for indoor mapping applications. Journal of sensors, 12(1):1437 – 1454.

[31] Kohlhepp, P., Bretthauer, G., Walther, M., and Dillmann, R. (2006). Using or- thogonal surface directions for autonomous 3d-exploration of indoor environments. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 3086–3092. IEEE.

[32] Kohlhepp, P., Pozzo, P., Walther, M., and Dillmann, R. (2004). Sequential 3d- slam for mobile action planning. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, volume 1, pages 722–729. IEEE.

96 [33] K¨ortgen,M., Park, G.-J., Novotni, M., and Klein, R. (2003). 3d shape match- ing with 3d shape contexts. In The 7th central European seminar on computer graphics, volume 3, pages 5–17.

[34] Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97.

[35] L., F. and Bookstein (1979). Fitting conic sections to scattered data. Computer Graphics and Image Processing, 9(1):56 – 71.

[36] Lakaemper, R. and Latecki, L. J. (2006). Using extended em to segment planar structures in 3d. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 3, pages 1077–1082. IEEE.

[37] Lakaemper, R. and Madhavan, R. (2010). Towards evaluating world modeling for autonomous navigation in unstructured and dynamic environments. In Pro- ceedings of the 10th Performance Metrics for Intelligent Systems Workshop, pages 355–360. ACM.

[38] Lakaemper, R. and Sobel, M. (2010). Using the particle filter approach to build- ing partial correspondences between shapes. International journal of computer vision, 88(1):1–23.

[39] Lee, J.-H. and Won, C.-H. (2011). Topology preserving relaxation labeling for nonrigid point matching. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(2):427–432.

[Lytro] Lytro. https://www.lytro.com.

[41] Madhavan, R., Lakaemper, R., and Kalm´ar-Nagy, T. (2009). Benchmarking and standardization of intelligent robotic systems. In Advanced Robotics, 2009. ICAR 2009. International Conference on, pages 1–7. IEEE.

97 [42] Magnusson, M., Lilienthal, A., and Duckett, T. (2007). Scan registration for autonomous mining vehicles using 3d-ndt. Journal of Field Robotics, 24(10):803– 827.

[43] Meyer, J.-A. and Filliat, D. (2003). Map-based navigation in mobile robots:: Ii. a review of map-learning and path-planning strategies. Cognitive Systems Research, 4(4):283–317.

[44] Montemerlo, M., Becker, J., Bhat, S., Dahlkamp, H., Dolgov, D., Ettinger, S., Haehnel, D., Hilden, T., Hoffmann, G., Huhnke, B., et al. (2008). Junior: The stanford entry in the urban challenge. Journal of Field Robotics, 25(9):569–597.

[45] Nguyen, V., G¨achter, S., Martinelli, A., Tomatis, N., and Siegwart, R. (2007). A comparison of line extraction algorithms using 2d range data for indoor mobile robotics. Auton. Robots, 23(2):97–111.

[46] N¨uchter, A. and Lingemann, K. (2011). 3dtk–the 3d toolkit.

[47] Nuchter, A., Lingemann, K., Hertzberg, J., and Surmann, H. (2007). 6d slam-3d mapping outdoor environments. Journal of Field Robotics, 24(8-9):699–722.

[48] Nuechter, A. (2009). 3D Robotic Mapping: The Simultaneous Localization and Mapping Problem with Six Degrees of Freedom. Springer Publishing Company, Incorporated.

[49] Oehler, B., Stueckler, J., Welle, J., Schulz, D., and Behnke, S. (2011). Efficient multi-resolution plane segmentation of 3d point clouds. In Intelligent Robotics and Applications, pages 145–156. Springer.

[50] Pathak, K., Birk, A., Vaskevicius, N., Pfingsthorn, M., Schwertfeger, S., and Poppinga, J. (2010). Online three-dimensional slam by registration of large planar

98 surface segments and closed-form pose-graph relaxation. Journal of Field Robotics, 27(1):52–84.

[51] Pathak, K., Vaskevicius, N., Poppinga, J., Pfingsthorn, M., Schwertfeger, S., and Birk, A. (2009). Fast 3d mapping by matching planes extracted from range sensor point-clouds. In IROS’09: Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems, pages 1150–1155, Piscataway, NJ, USA. IEEE Press.

[PointGrey] PointGrey. http://ww2.ptgrey.com/stereo-vision/bumblebee-2.

[53] Poppinga, J., Vaskevicius, N., Birk, A., and Pathak, K. (2008). Fast plane detection and polygonalization in noisy 3d range images. pages 3378 –3383.

[Raytrix] Raytrix. http://raytrix.de/index.php/technology.html.

[55] Rosenfeld, A., Hummel, R. A., and Zucker, S. W. (1976). Scene labeling by relaxation operations. Systems, Man and Cybernetics, IEEE Transactions on, (6):420–433.

[56] Rosin, P. and West, G. (1995). Nonparametric segmentation of curves into various representations. Pattern Analysis and Machine Intelligence, IEEE Trans- actions on, 17(12):1140 –1153.

[57] Rusu, R., Blodow, N., Marton, Z., and Beetz, M. (2009). Close-range scene segmentation and reconstruction of 3d point cloud maps for mobile manipulation in domestic environments. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 1 –6.

[Schunk] Schunk. http://www.schunk.com.

99 [59] Shuster, M. D. (2006). The generalized wahba problem. The Journal of the Astronautical Sciences, 54(2):245–259.

[SICK] SICK. http://sicktoolbox.sourceforge.net/docs/sick-lms-technical- description.pdf.

[61] Stamos, I. and Leordeanu, M. (2003). Automated feature-based range registra- tion of urban scenes of large scale. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–555. IEEE.

[SwissRanger] SwissRanger. http://www.mesa-imaging.ch/home.

[63] Ten, S. (2010). How kinect depth sensor works–stereo triangulation. Retrieved September, 30:2011.

[64] Thrun, S., Hahnel, D., Ferguson, D., Montemerlo, D., Triebel, R., Burgard, W., Baker, C., Omohundro, Z., Thayer, S., and Whittaker, W. (2003). A system for volumetric robotic mapping of abandoned mines. In Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, volume 3, pages 4270–4275. IEEE.

[65] Totilo, S. (2009). Microsoft: Project natal can support multiple players, see fingers.

[66] Trevor, A., Gedikli, S., Rusu, R., and Christensen, H. (2013). Efficient orga- nized point cloud segmentation with connected components. Semantic Perception Mapping and Exploration (SPME).

[67] Vandorpe, J., Van Brussel, H., and Xu, H. (1996). Exact dynamic map building for a mobile robot using geometrical primitives produced by a 2d range finder. volume 1, pages 901 –908 vol.1.

100 [Velodyne] Velodyne. http://velodynelidar.com/lidar/lidar.aspx.

[69] Viejo, D. and Cazorla, M. (2007). 3d plane-based egomotion for slam on semi- structured environment. In Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, pages 2761–2766. IEEE.

[70] Vosselman, G., Dijkman, E., Reconstruction, K. W. B., Altimetry, L., and Transform, H. (2001). 3d building model reconstruction from point clouds and ground plans. Int. Arch. of Photogrammetry and Remote Sensing, pages 37–43.

[71] Wahba, G. (1965). A least squares estimate of satellite attitude. SIAM review, 7(3):409–409.

[72] Weingarten, J. and Gruener, G. (2003). A fast and robust 3d feature extrac- tion algorithm for structured environment reconstruction. In Reconstruction, 11th International Conference on Advanced Robotics (ICAR).

[73] Weingarten, J. and Siegwart, R. (2006). 3d slam using planar segments. In In- telligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 3062–3067. IEEE.

[74] Weisstein, E. W. (2014). Conic Section. From MathWorld—A Wolfram Web Resource. http://mathworld.wolfram.com/ConicSection.html.

[75] Werman, M. and Geyzel, Z. (1995). Fitting a second degree curve in the pres- ence of error. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(2):207 –211.

[76] Wu, W.-Y. and Wang, M.-J. J. (1993). Elliptical object detection by using its geometric properties. Pattern Recognition, 26(10):1499 – 1509.

101 [77] Yang, X., Bai, X., K¨oknar-Tezel, S., and Latecki, L. J. (2013). Densifying distance spaces for shape and image retrieval. Journal of mathematical imaging and vision, 46(1):12–28.

[78] Yip, R. K., Tam, P. K., and Leung, D. N. (1992). Modification of hough trans- form for circles and ellipses detection using a 2-dimensional array. Pattern Recog- nition, 25(9):1007 – 1022.

[79] Zass, R. and Shashua, A. (2008). Probabilistic graph and hypergraph matching. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Confer- ence on, pages 1–8. IEEE.

[80] Zhang, Z. (1994). Iterative point matching for registration of free-form curves and surfaces. International journal of computer vision, 13(2):119–152.

[81] Zucker, S. W., Hummel, R. A., and Rosenfeld, A. (1977). An application of relaxation labeling to line and curve enhancement. Computers, IEEE Transactions on, 100(4):394–403.

102