Multi-Modal Scene Understanding for Robotic Grasping

Multi-Modal Scene Understanding for Robotic Grasping JEANNETTE BOHG Doctoral Thesis in Robotics and Computer Vision Stockholm, Sweden 2011 TRITA-CSC-A 2011:17 Kungliga Tekniska högskolan ISSN-1653-5723 School of Computer Science and Communication ISRN-KTH/CSC/A–11/17-SE SE-100 44 Stockholm ISBN 978-91-7501-184-4 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i datalogi fredagen den 16 december 2011 klockan 10.00 i Sal D2, Kungliga Tekniska högskolan, Lindstedtsvägen 5, Stockholm. © Jeannette Bohg, November 17, 2011 Tryck: E-Print iii Abstract Current robotics research is largely driven by the vision of creating an intelligent being that can perform dangerous, difficult or unpopular tasks. These can for example be exploring the surface of planet mars or the bottom of the ocean, maintaining a furnace or assembling a car. They can also be more mundane such as cleaning an apartment or fetching groceries. This vision has been pursued since the 1960s when the first robots were built. Some of the tasks mentioned above, especially those in industrial man- ufacturing, are already frequently performed by robots. Others are still com- pletely out of reach. Especially, household robots are far away from being de- ployable as general purpose devices. Although advancements have been made in this research area, robots are not yet able to perform household chores robustly in unstructured and open-ended environments given unexpected events and uncertainty in perception and execution. In this thesis, we are analyzing which perceptual and motor capabilities are necessary for the robot to perform common tasks in a household scenario. In that context, an essential capability is to understand the scene that the robot has to interact with. This involves separating objects from the back- ground but also from each other. Once this is achieved, many other tasks become much easier. Configuration of objects can be determined; they can be identified or categorized; their pose can be estimated; free and occupied space in the environment can be outlined. This kind of scene model can then inform grasp planning algorithms to finally pick up objects. However, scene understanding is not a trivial problem and even state-of-the-art methods may fail. Given an incomplete, noisy and potentially erroneously segmented scene model, the questions remain how suitable grasps can be planned and how they can be executed robustly. In this thesis, we propose to equip the robot with a set of prediction mechanisms that allow it to hypothesize about parts of the scene it has not yet observed. Additionally, the robot can also quantify how uncertain it is about this prediction allowing it to plan actions for exploring the scene at specifically uncertain places. We consider multiple modalities including monocular and stereo vision, haptic sensing and information obtained through a human-robot dialog system. We also study several scene representations of different complexity and their applicability to a grasping scenario. Given an improved scene model from this multi-modal exploration, grasps can be inferred for each object hypothesis. Dependent on whether the objects are known, familiar or unknown, different methodologies for grasp inference apply. In this thesis, we propose novel methods for each of these cases. Fur- thermore, we demonstrate the execution of these grasp both in a closed and open-loop manner showing the effectiveness of the proposed methods in real- world scenarios. iv for Thank you, x y! Games and Cake Erdbeermarmelade, eingelegte Gurken und Sonntagsbraten Jana Oma Opa Nicky Hendrik Claudia Melanie Cheng Axel Kathi Miro Florian Stefan Babak Christian zu Hause Birgit Die guten Dresdner Zeiten Tobias StefanS Andrzej Ioannis Frine John Magnus Unsere alte Freundschaft Coci Lazaros Jeanna Omid Geert Vahid Mama Henry For not being roboticists MartinS Oscar Kristoffer Papa Heydar Moritz FlorianP Jan-Olof Christiane 141-crowd Franziska CVAP Alessandro Yasemin Sebastian Svenska Renaud Zuspruch und Unterstützung Gesa Carsten Hossein Marin Niklas Proof-reading Conny Photography Josephine Alper Ylva Stammtisch und Radeberger Beer Marten Inspiration Urlaubspläne Carl-Henrik Xavi Beard XavisMom Javier Dan Dani Being an amazing roomie Tener un corazon tan grande como el sombrero de un picador For teaching their sons how to cook Marianna JaviersMom Supervision and Support Matt Den Hinweis, wann ich mir mal wieder die Haare kämmen müsste Collaboration Mentoring Martin Kai KaiW MarkusP EU Kaijen Chavo threadpool Susanne Xavier Matei GraspProject Antonio MarkusV Beatriz Gabriel Antonis Thomas chrisarndt Darius Gustavo Nikos Tamim Janne Higinio Walter David StefanU Kate Ville Aitor Karthik Jarmo Heiner Iasonus Contents 1 Introduction 3 1.1 An Example Scenario ........................... 4 1.2 Towards Intelligent Machines - A Historical Perspective ......... 6 1.3 This Thesis ................................ 10 1.4 Outline and Contributions ......................... 12 1.5 Publications ................................ 14 2 Foundations 17 2.1 Hardware .................................. 17 2.2 Vision Components ............................ 19 2.3 Discussion ................................. 24 3 Active Scene Understanding 25 4 Predicting and Exploring Scenes 29 4.1 Occupancy Grid Maps for Grasping and Manipulation .......... 30 4.2 Height Map for Grasping and Manipulation ............... 46 4.3 Discussion ................................. 64 5 Enhanced Scene Understanding through Human-Robot Dialog 67 5.1 System Overview ............................. 68 5.2 Dialog System ............................... 68 5.3 Refining the Scene Model through Human-Robot Interaction ...... 69 5.4 Experiments ................................ 75 5.5 Discussion ................................. 80 6 Predicting Unknown Object Shape 81 6.1 Symmetry for Object Shape Prediction .................. 82 6.2 Experiments ................................ 90 6.3 Discussion ................................. 98 7 Generation of Grasp Hypotheses 101 v vi CONTENTS 7.1 Related Work ...............................103 7.2 Grasping Known Objects .........................116 7.3 Grasping Unknown Objects ........................126 7.4 Grasping Familiar Objects .........................129 7.5 Discussion .................................150 8 Grasp Execution 153 8.1 Open Loop Grasp Execution .......................153 8.2 Closed-Loop Grasp Execution .......................157 8.3 Discussion .................................167 9 Conclusions 169 A Covariance Functions 173 Bibliography 177 Notation Throughout this thesis, the following notational conventions are used. • Scalars are denoted by italic symbols, e.g., a, b, c. • Vectors, regardless of dimension, are denoted in bold lower-case symbols, x = (x,y)T . • Sets are indicated by calligraphic symbols, e.g., , , . The cardinality of these sets is denoted by capital letters such asPN,M,KQ R . As a compact notation for a set containing N vectors x , we will use x . X i { i}N • In this thesis, we frequently deal with estimates of vectors. They are indicated by a hat superscript, e.g., xˆ. They also can be referred to as a posteriori estimates in contrast to a priori estimates. The latter are denoted by an additional superscript. This can either be a minus, e.g., xˆ− for estimates in time, or a plus, e.g., xˆ+ for estimates in space. • Functions are denoted by italic symbols followed by their arguments in paren- theses, e.g., f( ), g( ), k( , ). An exception is the normal distribution where we adopt the standard· · notation· · of (µ, Σ). N • Matrices, regardless of dimension, are denoted as bold-face capital letters, e.g., A, B, C. • Frames of reference or coordinate frames are denoted by sans-serif capital letters, e.g., W, C. In this thesis, we convert coordinates mostly between the following three reference frames: the world reference frame W, the camera coordinate frame C and the image coordinate frame I. Vectors that are related to a specific reference frame are annotated with the according superscript, e.g., xW. • Sometimes in this thesis, we have to distinguish between the left and right image of a stereo camera. We label vectors referring to the left camera system with a subscript l and analogous with r for the right camera system, e.g., xl, xr. 1 2 CONTENTS • A finger on our robotic hand is padded with a tactile sensor matrices. One is on the distal and one on the proximal phalanx. Measurements from these sensors are either labeled with subscript d or p. • The subscript t refers to a specific point in time. • In this thesis, we develop prediction mechanism that enable the robot to make predictions about unobserved space in its environment. We label the set of locations that has already been observed with subscript k for known. The part that is yet unknown is labeled with subscript u. 1 Introduction The idea of creating an artificial being is quite old. Even before the word robot got coined by Karel Čapek in 1920, the concept appeared frequently in litera- ture and other artwork. The idea can already be observed in preserved records of ancient automated mechanical artifacts [206]. In the beginning of the 20th cen- tury, advancements in technology led to the birth of the science fiction genre. Among other topics, the robot became a central figure of stories told in books and movies [217, 82]. The first physical robots had to await the development of the necessary underlying technology

Load more