<<

USING GEOMETRIC PRIMITIVES TO UNIFY PERCEPTION AND ACTION FOR OBJECT-BASED MANIPULATION

by Hunter Brown

A thesis submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of

Master of Science

Department of Mechanical Engineering The University of Utah May 2018 Copyright c Hunter Brown 2018 All Rights Reserved The University of Utah Graduate School

STATEMENT OF THESIS APPROVAL

The thesis of Hunter Brown has been approved by the following supervisory committee members:

Tucker Hermans , Chair March 14, 2018 Date Approved

Mark Minor , Member March 14, 2018 Date Approved

Jake Abbott , Member March 14, 2018 Date Approved

and by Tim Ameel , Chair/Dean of the Department/College/School of Mechanical Engineering and by David B. Kieda, Dean of The Graduate School. ABSTRACT

In this work we consider task-based planning in uncertainty. To make progress in this problem, we propose an end-to-end method that makes progress toward the unification of perception and manipulation. Critical for this unification is the geometric primitive. A geometric primitive is a 3D that can be fit to a single view from a 3D image. Geometric primitives are a consistent structure in many scenes, and by leveraging this, perceptual tasks such as segmentation, localization, and recognition can be solved. Shar- ing this information between these subroutines also makes the method computationally efficient. Geometric primitives can be used to define a set of actions the robot can use to influ- ence the world. Leveraging the rich 3D information in geometric primitives allows the designer to develop actions with a high chance of success. In this work, we consider a pick-and-place action, parameterized by the object and scene constraints. The design of the perceptual capabilities and actions is independent of the task given to the robot, giving the robot more versatility to complete a range of tasks. With a large number of available actions, the robot needs to select which action the robot performs. We propose a task-specific reward function to determine the next-best action for the robot to complete the task. A key insight into making the action selection tractable is reasoning about the occluded regions of the scene. We propose to not reason about what could be in the occluded regions, but instead to treat the occluded regions as parts of the scene to explore. Defining reward functions that encourage this exploration while balancing trying to solve the given task gives the robot more versatility to perform many different tasks. Reasoning about occlusion in this way also makes actions in the scene more robust to scene uncertainty and increases the computational efficiency of the method overall. In this work, we show results for segmentation of geometric primitives on real data, and discuss problems with fitting their parameters. While positive segmentation results are shown, there are problems with fitting consistent parameters to the geometric prim- itives. We also present simulation results showing the action selection process solving a singulation task. We show that our method is able to perform this task in several scenes with varying levels of complexity. We compare against selecting actions at random, and show our method consistently takes fewer actions to solve the scene.

iv For my partner Lucia, who is always supportive, and my Grandmother Linda Rae who has always been there. CONTENTS

ABSTRACT ...... iii LIST OF FIGURES ...... vii CHAPTERS

1. INTRODUCTION ...... 1 2. RELATED WORKS ...... 6 2.1 Segmentation ...... 6 2.2 Recognition and Localization ...... 7 2.3 Robotic Approaches ...... 8 3. METHODS ...... 10 3.1 Problem Definition ...... 10 3.2 Method Overview ...... 13 3.3 Geometric Primitive Segmentation ...... 15 3.4 Action Selection ...... 18 3.5 Actions ...... 18 3.5.1 Grasping ...... 19 3.5.2 Object Placement Locations ...... 21 3.6 Update State Estimation ...... 22 3.7 Object Singulation Task Formulation ...... 24 3.8 Put Away Groceries Task Formulation ...... 25 4. EXPERIMENTS AND RESULTS ...... 28 4.1 Geometric Primitive Segmentation ...... 28 4.2 Geometric Primitive Manipulation Planner ...... 29 5. CONCLUSION ...... 37 6. FUTURE WORK ...... 38 APPENDIX: ALGORITHMS ...... 39 REFERENCES ...... 41 LIST OF FIGURES

1.1 Examples of common tasks robots might need to complete. (A) An example of a bathroom cabinet with medicine the robot must find, (B) example of a pantry with messy shelves the robot must organize, and (C) An example of a table with groceries the robot must put away...... 1 1.2 An example of an over-segmentation created by using the geometric primi- tive. Each color represents a segment of the scene with a different geometric primitive fit to it...... 4 3.1 Examples of a box (A), (B), and (C) geometric primitive...... 10 3.2 Given a cloud of a drill (A) we can fit two and a box to represent this object (B)...... 11 3.3 Given an observation (A) possible true states (B)...... 14 3.4 An example of Geometric Primitive Segmentation results. (A) Initial 3D point cloud data. (B) Segmentation of the scene. Here each color represents all the pixels assigned to a given segment. (C) Geometric primitives fit to the different segments. Blue pixels have a box fit to them, green pixels have a cylinder, and red have a sphere...... 17 3.5 An example of generated grasp candidates for (a) box, (b) sphere, and (c) cylinder...... 19 3.6 Example of how to generate placement locations. (A) Possible state estima- tion the robot could encounter. The table is colored green, the objects purple and occluded regions are black. If the robot tries to find the placement loca- tions for the cylinder in the middle (B) is the footprint of the scene without the cylinder, (C) is the footprint of the cylinder (scaled to be easily visable), and (D) is the convolution where green indicates placement locations without collision and red placement locations with collision...... 21 3.7 An example scene for the put away Groceries taks. (A) Counter full of gro- ceries for the robot to put away. (B) Cabinet the robot must place the groceries into...... 25 4.1 Three examples of segmentations generated using Algorithm 2. The first column is an image of the scene. The second column is the segmentation, where each color is a unique segment in the scene. The third column idicates the type of geometric primitive that was assigned to each segment. Green is a cylinder, red a sphere, and blue a box...... 29 4.2 Two different fits for the same data: (A) is an ideal fit and (B) is an error...... 30 4.3 Examples of each scene type (A) is a large scene and (B) is a small scene. The supporting surface is shown in green and the objects in red...... 31 4.4 The mean and CI for the data as a function of the number of objects in the scene...... 32 4.5 The mean and CI for the data as a function of the size of the scene...... 33 4.6 The mean and CI for the data as a function of the Method...... 34 4.7 The mean and CI for the data as a function of the Method and the scene size. . 34 4.8 The mean and CI for the data as a function of the Method and number of objects in the scene...... 35

viii CHAPTER 1

INTRODUCTION

Robots are becoming more relevant in domains beyond their industrial origin, such as in domestic settings and assistive care. A robot in these domains has to handle a wide assortment of challenging tasks. For example, an assistive-care robot could face many different jobs throughout the day. In the morning it might need to search the bathroom cabinet for a medication (Figure 1.1 A), in the afternoon organize a messy shelf (Figure 1.1 B), and in the evening put away groceries (Figure 1.1 C). While the robot might deal with similar objects and use similar actions, planning to solve a given task may vary significantly between different scenes. For example, when finding the medicine, there may be several new objects in the cabinet, because the previous night someone bought a new toothbrush and deodorant, and the objects that were in the cabinet yesterday could be moved to new locations. To complete many of these tasks the robot will need to manipulate the scene, such as moving the lotion to reveal the medicine behind it. This can be difficult

(A) (B) (C)

Figure 1.1. Examples of common tasks robots might need to complete. (A) An example of a bathroom cabinet with medicine the robot must find, (B) example of a pantry with messy shelves the robot must organize, and (C) An example of a table with groceries the robot must put away. 2 for the robot to do without knocking objects out of the cabinet, running into unseen parts of the scene, and even destroying itself or other objects. For robots to be successful in these new domains, they must be able to safely handle a multitude of tasks, in partially known, and unstructured environments. While there is limited structure in object-based manipulation problems, it is important to leverage the structure that does exist. We can exploit this structure by giving the robot a set of basic perceptual skills. Differentiating objects from each other (i.e, segmenting objects) gives the robot the ability to determine which parts of the scene are objects. Given a segmentation of the scene, the robot should determine the location of the objects in relation to the robot (i.e, localize the object). Segmenting and localizing objects gives the robot critical information about how to manipulate objects in the scene. Finally, being able to recognize familiar objects gives the robot the ability to leverage information about previously known objects. These base-level perceptual skills provide the robot an ability to reason about the scene and its constraints. The robot will also need actions to manipulate or interact with the scene. We consider only a pick-and-place action, but pushing and tapping are other possible action types. When the robot understands how an action will change the scene, it can begin to try to solve a task. This is done by chaining together a series of actions to move the scene from its current state to a state where the task is complete. It is then essential for robots to be able to perform actions and predict the result of a given action. Actions not only have to change the scene, but they can improve the robot’s understanding of the scene by reducing scene uncertainty. For example, if there is a large box on a table, moving it would provide information about what is behind the object, and information about the full extent of the box. Selecting which actions to perform, and in what order, is critical for robots to perform any of the above tasks. Alternatively, a learning-based approach could be used. A learning algorithm could take the robot’s sensors as input and output joint torques or other basic motion primitives. Using primitive perception skills and robot actions has some appealing benefits, in contrast to direct policy learning. First, we are extracting information from the perception skills, which can be used in higher-level planning. For example, if during the machine-learning approach some kind of color feature was used to distinguish objects, this information is lost 3 to the higher-level planner. These methods also tend to suffer when trying to generalize. Generating data required to ensure generalization can be costly or impossible. Many of these perceptual skills have been dealt with in isolation, but how to use these methods together remains unclear. Dealing with cluttered scenes is often treated as a segmentation problem [5]. In robotics, the robot is often allowed to interact with the scene, and when this is done it is called interactive segmentation [20]. Many segmentation problems treat objects using a parts model [20]. In these models, objects are treated as being a collection of parts. When this is done, an over-segmentation, or segmentation that results in more segments than objects, is used and then during a second phase the parts are grouped into objects [20] [17]. Sometimes the segmentation problem is solved by trying to instead singulate the objects. In order to build object models, often times a common extension to the object seg- mentation problem is to perform interactve object modeling [11] [14]. During interactive modeling, models of objects are built by the robot moving the objects. Often times this is done on singulated or singular objects. Creating models is treated as the goal of the robot, making it unclear how these methods could be used in conjunction with another goal. Many methods for object recongition and localization have also been proposed. While there is more literature on using localization in robotics, these methods are often treated as inputs to other task-specific methods. Many of these methods suffer from throwing away information. They use some color, geometric, or features to guide the segmentation or modeling, and then the final segmentation is presented without those features. Often many or all of these features could be useful for manipulation or other parts of a larger task. These methods aslo suffer from treating the task and the subproblem as the same. In real environments any of these subproblems would be done in the context of a larger goal. This makes it unclear how many of these methods could be used to complete a larger task. This motivates the main emphasis of our work: How can we synthesize the action and perception skills together, independent of the given task? Many tasks require this same set of perception skills and actions, such as object singulation, modeling, or search. We unify all the perception skills and actions through the use of the geometric primitive. A geometric primitive is a basic 3D shape fit to objects in the scene, such as a sphere, plane, 4 or cylinder. Sharing this information at each stage of the process provides computational efficiency, by not extracting the same or similar information at multiple stages. By defining our actions as functions of the geometric primitive, we can create interac- tions with the scene that are robust to scene uncertainty. These actions will have higher chances to succeed because we can encode this geometric information into the action. For example, if we know how to grasp a sphere well, then grasping an object that is mostly spherical should be similar. Also, we have reasoned about its extent in the scene. This information gives us a better idea of important physical properties, such as the centroid of the object, and can be used during motion planning. We perform segmentation by fitting geometric primitives to the scene. This creates an over-segmentation of the scene; an example of this is shown in Figure 1.2. As evidence is gained through interaction or physical constraints, the geometric primitives can be com- posed into rigid objects. This can be used to quickly localize moved or known objects, or recognize previously seen objects. This representation also provides an estimate of regions that have not been directly viewed. This is because geometric primitives have extent beyond what the robot’s sensor observes. Reasoning about these occluded regions makes robot interaction less likely to result in collisions or unintended scene disturbances. In order to complete the task at hand, these perception skills and actions are placed into

Figure 1.2. An example of an over-segmentation created by using the geometric primitive. Each color represents a segment of the scene with a different geometric primitive fit to it. 5 a partially observable Markov decision process (POMDP) [8]. This structure helps create a desired sequence of actions that can complete the assigned task. We show that the largest source of uncertainty in this problem is the occluded regions of the scene, and develop a way of reasoning about exploring these regions while still making progress towards the given task. In this work we make several key contributions progressing task-based planning in uncertainty towards solving general object-based tasks. We first present the geometric primitive as a tool for sharing data between perceptual skills and planning robust actions. This geometric primitive provides methods for segmenting and localizing objects, while being useful during motion planning, and action selection. We also present a novel ap- proach to state estimation in clutter scenes that handles occluded regions of the scene in a computationally efficient way, robust to state uncertainty. Finally, we show our method performing the common object-based manipulation task of singulation on simulated data. CHAPTER 2

RELATED WORKS

In this section we will discuss the common approaches to many of the perceptual skills outlined in the previous chapter, and how robots have solved common manipulation tasks. We will first discuss segmentation in Section 2.1. We will then discuss the problems of recognition and localization jointly in Section 2.2. We will then look at how robots use these methods to solve these and other tasks in Section 2.3.

2.1 Segmentation Segmentation is the process of breaking an image into more meaningful segments. Many approaches fall into two broad categories: region-growing techniques and feature- point techniques. In region-growing techniques, the image is seeded and seeds are grown by adding neighboring pixels if they meet a similarity metric. A common example of this is presented by Felzenszwalb et al. [5]. These methods are attractive as they fully segment each pixel into the image, but the segments are hard to localize. Feature-point techniques instead locate points in the image that meet some criteria. These criteria are often related to the difficulty of relocating the points in future images. The premier feature-point is the SIFT feature developed by David Lowe [13]. These type methods are often easy to track, but are single points, and thus don’t segment each pixel in the image. These methods have been expanded to work on 3D data as well. For example, Rab- bani et al. developed a region-growing technique in point clouds that uses a smoothness constraint to find locally flat regions [16]. Similarly, the Fast Point Feature Histograms developed by Ruso et al. provide feature points in 3D data [17]. These methods keep all the same problems inherent with these methods on 2D images. Geometric primitives are a type of model, so a review of segmenting 3D data with model-fitting techniques is in order. Random Sample Consensus (RANSAC) is the stan- dard model-fitting method, and has been used in several 3D segmentation applications [6]. 7

Rusu et al. mapped indoor scenes by fitting planes to the world using RANSAC and fast point feature histograms to segment the world into meaningful parts [18]. Schnabel et al. fit basic to 3D models of art using an efficient RANSAC method [19]. RANSAC can be much slower than region-growing techniques but it gives the full extent of the located models while still segmenting each point. Also, if the underlying models are inherent to the scene, relocating them is much easier. In this work we use RANSAC to fit more complex than the Radu et al. work [18]. This gives a richer representation of the world that can work in more complex environments. Much of the work fitting 3D shapes to scenes involves registering multiple views and then extracting the 3D shapes [6] [19]. In contrast our work uses a single view. This is more representative of the information robots will have when performing tasks in new scenes.

2.2 Recognition and Localization Object recognition is the process of determining if an object is in an image or scene. Localization is determining the pose of a known object in a scene. These problems are often solved together, but this is not always the case. For example, Donahue et al. used features from a pretrained neural network to perform object recognition on the Caltech-101 dataset [4]. Deep methods like this suffer from losing information. Once the recognition is calculated it is not clear how these features can be used to solve other subproblems. A common localization technique is to perform Iterative Closest Point (ICP), which does not alone require object recognition but often needs segmentation to be performed first [1]. ICP requires a good initial estimation of the pose to converge, which can be difficult to estimate in cluttered scenes. Many algorithms are able to perform localization and recognition in a single step. This is sometimes referred to as object pose estimation. Hinterstoisser et al. are able to localize and recognize objects from RGB images, using template-matching techniques [7]. This method suffers from requiring several views of the segmented object. Getting this data is costly and limits how this method can be used in scenes with new objects. Using a combination of standard feature points and their proposed color point pair feature, Choi et al. are able to perform object recognition and localization on complex 3D scenes [2]. 8

Choi et al. made use of both color and 3D information. This method requires a mesh or point cloud model of the object to train the recognition model. To use this method in new scenes would require the robot to segment the object and build a full 3D model online, which is not always practical, and would slow down completing any other tasks. In contrast to these methods, our proposed geometric primitive method provides object localization and recognition, without the need for a mesh or images of the object a priori. The object model can be built entirely on-, providing an ability to handle novel objects. The geometric primitive is also computed during the segmentation, meaning that these data can be shared to perform the localization and recognition.

2.3 Robotic Approaches In contrast to the approaches previously discussed, robots are able to do more than just perceive the environment. Robots are able to interact with and influence the scene. When robots use this capability in conjunction with classic methods it is termed interactive. For example, Hoof et al. performed interactive segmentation by finding SIFT features, and used this to segment parts from the scene [20]. The robot then pushed parts in the scene and tracked which ones moved. Evidence was accumulated over time to determine all unique objects. This work only retains feature point positions over time, so it is unclear how this could be used to manipulate the objects beyond pushing them. Ma et al. per- formed interactive localization, recognition, and model building by treating the problem using SLAM (simaltanious localization and mapping) techniques [14]. This method builds rich 3D models of objects, but it is unclear if it works in cluttered scenes, where the segmentation of the objects is not obvious. It is also unclear how this could be used in a framework solving another task. Katz et al. developed an approach for interactive modeling of articulated objects by tracking SIFT features and looking for patterns in their motion [9]. While it is able to solve the problem, it only works for textured objects. Krainin et al. perform interactive model building by using an ICP variant to localize the object [11]. They then perform next-best view planning to determine the next best place to move the object to improve their model. While this method solves the task, they assume the object is the only object in the scene, and that they can segment the object. All of these methods suffer from one or both of the following problems: they either 9 have the robot work hard to solve one task, or the robot makes an assumption about inputs that simplify the task’s perception requirements. This makes it difficult or impossible for these methods to work in general, cluttered scenes. If the framework to get a segmentation is complex and takes many steps, it can be time consuming for the robot to then perform a second task. In contrast, our method blends the perceptual requirements into solving the task. As the robot makes progress on a task, it is also improving its segmentation and localization of the objects in the scene. In solving tasks beyond perception problems, selecting the next action can be quite difficult. Katz et al. weighted different actions based on the geometry and color to lift irregular objects off a pile [10]. This type of framework for solving the problem doesn’t generalize beyond lifting objects from a pile. Martin et al. proposed a method for on-line interactive perception of articulated objects by linking the low-level perception, such as the feature points, to the higher-level problems and used information at each level to inform lower levels [15]. This makes the method robust to estimation error at each point; it still is focused on a single task. These methods all tailor the perception tools to the problem at hand. In contrast, our method develops perception and formulates the task around the capabilities of the robot. An example of this was shown by Dugar et al. They proposed a novel algorithm for search for objects that are hidden behind other objects [3]. This method is computationally efficient, and is defined independent of the perception capabilities of the robot. Solving problems in this way allows for them to be placed into larger end-to-end frameworks easily. This increases the versatility of robots in general. In this work we aim to provide this end-to-end framework. CHAPTER 3

METHODS

We divide this chapter into several sections. First the problem will be clearly defined and formalized as a POMDP in Section 3.1. Following this we present an outline of our solution in Section 3.2. The method will be described in further detail in the remaining sections.

3.1 Problem Definition We begin by defining the state space. We do this by first defining the geometric primi- tive. A geometric primitive is a basic 3D geometry defined by a parameter vector φ ∈ Rp, and a 3-D pose xg ∈ SE(3). The length p is determined by the type of geometric primitive. The first element of φ indicates the type of primitive, and the remaining elements define the parameters of this primitive. For example, a sphere might have a first element equal to zero, and then a second element equal to the radius of the sphere. We define the set of all geometric primitives in the scene as Gscene. An example of the three geometric primtives used in this work is shown in Figure 3.1. Geometric primitives generate objects. We define an object as a set of geometric prim- itives such that all geometric primitives have a constant pose in the object’s coordinate

(A) (B) (C)

Figure 3.1. Examples of a box (A), cylinder (B), and sphere (C) geometric primitive. 11 frame. This gives us the following definition:

ok Ok = {Gj ∈ Gscene| TGj = Mj} (3.1)

ok where TGj represents the transform from k-th object’s coordinate frame to the j-th geo- metric primitive’s coordinate frame and Mj is a constant transformation matrix. In other words, geometric primitives are rigidly attached to the object coordinate frame. An exam- ple of how a drill could be represented by geometric primitives is shown in Figure 3.2. We can now define the state space of our model:

X = {(Ok, xk)|k ∈ 1, 2 ··· n; xk ∈ SE(3), Ok ∈ O} (3.2) as a set of tuples, where n is the number of objects in the scene, O is set of all objects that could be in the scene, and xk is the pose of the k-th object. While we assume that there is a finite number of objects, there is still an infinite number of possible poses. Note that in our problem we do not know the number of objects n, or which objects are in the scene. We can however say that in general n  ||O||. In a POMDP, we need a set of actions, A. For our work we have one parameterized action that generates an infinite set of actions on any given scene. We call this action pick-and-place. Pick-and-place grasps an object, lifts it and places it at a new location. Therefore, for a given state, the set of actions is defined as:

0 0 A = {(Ok, xk)|Ok ∈ O; xk ∈ SE(3)} (3.3) 0 where xk is the next location to move the object too. For a given state and action we need to define a probabilistic transition function:

(A) (B)

Figure 3.2. Given a point cloud of a drill (A) we can fit two cylinders and a box to represent this object (B). 12

T(x, a) = Pr(x0|x, a) (3.4) where x ∈ X, a ∈ A and x0 ∈ X. This is a probability distribution over possible next states, conditioned on the current state and action selected. This model can be as complex as desired, depending on the assumptions made. In the general case this would need to account for dropping the object along any part of the action, variation in the pose of the object when placed and falling over when placed, and even more generally, how dropping an object would affect the other objects in the scene, for example, if when the object was dropped, it knocked another object over. In our work, we assume that we can perform pick-and-place successfully, and place the object at the new position with only Gaussian error on the final pose. Next, we define a reward function, as a function that maps state-action-next-states to real numbers, R(x, a, x0). We next define the observation space. In this work, we use a single depth camera. This generates a noisy point cloud P, which is a matrix of 3D points. We then define the set of possible observations as follows:

n×m×3 Z = {Pk|Pk ∈ R } (3.5) where n is the width of the point cloud and m is the height of the point cloud. The camera is only able to view the scene from a single view and thus significant portions of the scene are occluded. Defining an observation function is difficult for this state space. The observation func- tion maps the state to an observation. The error in the point cloud is a function of both the distance the point is from the camera and the surface normal of the point. This is due to the 3D camera being a structured light camera. The amount of occlusions in the scene make this more difficult. Many states will have the same observation, before sensor error is added. Attempting to define this analytically is difficult. We are more interested in getting an estimate of the maximum likelihood of the state given an observation. We treat our segmentation and localization algorithm as approximation of this. POMDP are used to develop a policy π. The policy is a function that maps states to actions:

π(x) = a (3.6) 13 where x ∈ X and a ∈ A. The optimal policy π∗ will return the action that maximizes the expected long-term reward. " ∞ # ∗ π = arg max E R(xt, at, xt+1|π) (3.7) π ∑ t=0

3.2 Method Overview We now examine the tractability and complexity of common POMDP solvers for our problem. In the standard model, the robot performs an action and then receives a reward and observation. The robot does not explicitly see the state but must estimate it from the observation. This is normally done by defining a belief, i.e, the probability distribution over the possible states of the world, b. We then define the probability of an observation conditioned on the next state and action, Z(z|xt+1, a). We then update our belief after a given action: Z b(x)t+1 = ηZ(z|xt+1, a) T(xt+1|xt, a)b(xt) (3.8) x∈X where η is a normalizing constant. The integral term is used here, but our state space has both a continuous component and discrete component, so for the discrete component the integral devolves into a summation. This changes Equation 3.7 to the following: " ∞ Z # ∗ π = arg max E b(xt+1)R(xt, at, xt+1)|π (3.9) π ∑ t=0 x∈X An example of an observation and possible true states is shown in Figure 3.3. Looking at this example it is clear that tracking each possible state in the belief is unnecessary, as many of these states are unlikely. Further, due to the size of the state space, solving for the optimal policy is intractable without simplification. There are two key observations that suggest meaningful simplifications:

1. Different States can give the same observation, within some e.

2. n  ||O||, where n is the number of objects in the scene.

Point 1 happens when there is occlusion in the scene. In this case, any combination of objects in O, in any set of orientations, that fit in this occluded region represents a unique state, and in turn a state we are updating in the belief. Point 2 indicates that often times 14

(A) (B)

Figure 3.3. Given an observation (A) possible true states (B).

most of O is not in the scene. Together these indicate that many of the possible states in the belief are unlikely. Reformulating the problem to handle this robustly will more accurately represent the problem, but this can be difficult for general applications. Even with a better representation calculating π∗ is still intractable, as integrating over the belief is combinatorially intractable. Instead we propose to not deal directly with the occluded regions. We do this by treating occluded regions as obstacles that we cannot interact with. Furthermore, in the estimation of geometric primitives, we assume Gaussian noise on the parameters and poses. Finally, we assume that when a pick-and-place action is performed, the object ends up at approximately the desired location plus some Gaussian noise. We then discretize the state space by representing the scene as a 3D grid. This also discretizes the action space, as the next location for the object must be selected from a set of fixed locations. We do not represent the full belief, but instead approximate a maximum-likelihood estimate of the state, which we call x˜. The approximate state, x˜, is represented by the voxel grid of the scene. The means of the Gaussian distribution over the geometric primitive parameters are inserted into the voxel grid. This voxel grid contains an integer for each voxel. The tuples can be of one of three types: 15

1. If the voxel overlaps a geometric primitive, it will contain a pointer to the geometric primitive.

2. If the voxel is occluded, it will be label occluded.

3. If the voxel has been viewed at any time, and does not currently contain a primitive, it will be marked free.

This approximation of the state keeps the problem computationally efficient, but comes with a strong assumption. We assume that the scene is only being changed by the robot, and so once a voxel is viewed and labeled free, unless the robot places an object in that voxel, it will remain free. The reward function directly defines π∗. For handling any generic task, such as orga- nizing a shelf or finding the remote, we can change the reward function to create higher- level behavior. In Equation 3.7 we have an infinite sum, but in practice we would like the robot to perform actions until the task is complete. We then define a task as having two parts: a reward function and a task-complete function, both a function of x˜. We present the Geometric Primitive Manipulation Planner in Algorithm 1 (see Ap- pendix). We have an initial observation z0 and a given task T. In Line 1, we create a voxel grid and insert geometric primitives fit to z0 using the segmentation and localization method discussed in Section 3.3. This returns the state approximation x˜. In line 4, we use the task’s reward function and x˜ to selection the next action, discussed in Section 3.4. Next, in line 5 we perform the action on the robot, and receive an observation, z. Finally, in line 5, we update x˜ as described in Section 3.6. Lines 2-5 are repeated until the condition shown in line 2 is met, or until x˜ meets the requirements for the task’s task complete function returns true. We formulate the reward functions and task-complete functions for two tasks: a singlulation task shown in Section 3.7, and putting away groceries shown in Section 3.8.

3.3 Geometric Primitive Segmentation The geometric primitive segmentation method uses at its core a RANdom SAmple Con- sensus (RANSAC) variant to fit geometric primitives to the scene. RANSAC fits models by sampling a random set of points from the dataset, fitting a model to these points, and checking how many points in the dataset are inliers to this model [6]. RANSAC has the 16 advantage of fitting models accurately in data with a large number of outliers. Objects at normal distances to the camera are often made up of only 500-1500 points, where a point cloud, P, is often around 300,000 points. As more iterations are performed, the probability of sampling the optimal model increases: log(1 − p) k = (3.10) log(1 − wn) where k is the number of iterations, p is the probability that after k iterations we have found the underlying model, w is the probability of sampling an inlier to the model, and n is the minimum number of points required to fit a model. The relationship between number of iterations and probability of finding the underly- ing model is attractive as we can now give a confidence that we have found the underlying geometric primitives in the scene. In this work, we assume that all geometric primitives are ether cylinders, , or boxes. From Equation 3.10 we can see that as the minimum number of points to describe a model increases, the number of iterations also increases. In order to limit this, for the sphere and the cylinder, the surface normals are used to fit the model as follows:

2 d = ||p − pm|| + αnp · nm (3.11) where p is the point being tested, np is its surface normal, pm is the closest point on the model, nm is its surface normal, and α a weighting term. If d < e then it is an inlier. For the box, the following distance is used:

2 dbox = ||p − pm|| (3.12)

The parameter vector φ is different for each primitive. For the sphere, φ consists of the radius of the sphere. The orientation of a sphere is not defined, so it is assumed to be axis aligned, with the origin being the center of the sphere. The φ vector for the cylinder consists of the radius of the cylinder and the height. The origin of the cylinder is the center of the bottom face of the cylinder, with the direction of the cylinder’s major axis defining the z-axis. For the box, the φ vector consists of the width, height, and depth of the box, with its origin defined as the centroid. Geometric Primitive Segmentation is shown in Algorithm 2 (see Appendix). Geometric Primitive Segmentation requires a point cloud, P, as the input, and will return a set of geometric primitives, Glist, and an initialization of the state x˜. The incoming point cloud 17

P is first filtered in line 1. This filtering is done by using RANSAC to fit a plane to the table. All points on and below the table, and any points a distance of zmax or more are removed from P. Then lines 3-19 are repeated while ||P|| > minimum object size. minimum primitive size is a parameter defining the minimum number of points that can be considered a geometric primitive. Results for this are shown in Figure 3.4. Lines 3-18 fit one of each type of geometric primitive to the current point cloud P (line 6). This is done with replacement. RANSAC does not provide bounds on all of the parameters in φ. For example, the cylinder height is not defined in the RANSAC model. In order to define this, on line 7, we call the largestConnectedComponents on the inlier found during RANSAC. This projects each point back into the image, and keeps only the largest set of connected points. From this, the bounds not defined by RANSAC are calculated. For each primitive type, lines 8-12 ensure we keep only the parameterization with the largest

(A) (B)

(C)

Figure 3.4. An example of Geometric Primitive Segmentation results. (A) Initial 3D point cloud data. (B) Segmentation of the scene. Here each color represents all the pixels assigned to a given segment. (C) Geometric primitives fit to the different segments. Blue pixels have a box fit to them, green pixels have a cylinder, and red have a sphere. 18 number of inliers. Finally, lines 14-17 check ||max inliers|| >minimum primitive size, and if so, adds the primitives parameters and pose to Glist. Finally, line 19 initializes x˜. The voxel grid is predefined to fit the workspace of the robot, and its origin is calibrated to the camera and robot. The init grid function adds each geometric primitive in Glist to the voxel grid. It then runs ray tracing from the camera’s location to the scene. Any cell the ray passes through before striking a primitive is labeled as free, all cells with primitives in them are labeled by the primitive’s ID, and any cells that a ray does not pass through are labeled occluded.

3.4 Action Selection Action selection is performed in three steps. First, all possible actions are generated. This is done by assuming that all objects in the scene are graspable, and calculating all possible placement locations for each object. The next state x˜0 is computed and the task’s reward function, x˜, and x˜0 are used to calculate the reward for each state. The robot then executes the action with the highest reward. This is inherently a greedy procedure. We settle for this procedure due to the occlusion in the environment. While we are not required to reason about the different possible objects in the occluded regions, we have to plan as though that region is completely full of obstructions. This then puts more weight on the reward function to try to capture long-term aspects of the problem. We formulate reward functions in Section 3.7 and Section 3.8, as examples.

3.5 Actions Actions represent things the robot can do to the environment. In this work we consider a single type of action: pick-and-place. Pick-and-place consists of two steps: first the object is grasped, and second the object is placed on the table. The action a(O, x0) is parameterized by the object to be grasped, O, and the final object pose, x0. We consider all objects graspable, though this might not always be the case. As described in Section 3.5.1, if no plan can be found to grasp the object, it will be considered not graspable at that time step, and no pick-and-place operation can be run on that object. We will describe how a grasp is generated and performed in Section 3.5.1. 19

After grasping, the robot must lift the object to transport it via a collision-free trajectory to x0. Planning this trajectory is done using RRT-connect [12], but determining locations that are possible to place the object is more difficult. Leveraging the discretization pro- vided by the voxal grid, we can use a convolution between the object’s footprint and the scene’s footprint to determine collision-free placement locations. This is described in more detail in Section 3.5.2.

3.5.1 Grasping Grasping is a difficult and open-ended problem. When grasping from a single view, we cannot always be sure of the extent of the object. The geometric primitive provides an estimate of this from a single view. To grasp an object, we first calculate a set of grasp candidate poses. This is a set of poses in the workspace for which if the robot’s gripper closed while at this pose, the robot would grasp the object. Next, we try to plan and execute a collision-free path to each grasp candidate pose until we are successful or run out of grasp candidate poses. If we run out of grasp candidate poses, we determine the object cannot be grasped at this time step. If we are able to find and execute a collision-free path to a grasp candidate pose, we then close the gripper. In this work, we use a parallel jaw gripper, and so the close operation is simply moving the grippers together. For more complex grippers, grasp candidate poses and the close operation would have to be defined differently. We now describe how grasp candidates are generated for each grasp primitive. An example of the grasp candidate poses for each primitive type is shown in Figure 3.5.

(A) (B) (C)

Figure 3.5. An example of generated grasp candidates for (a) box, (b) sphere, and (c) cylinder. 20

Sphere: We first consider spherical grasp candidate poses. In order to generate can- didates we have to define two parameters: the number of grasp rotations ngr, and the number of rotational divisions, nr. Candidates are iteratively generated starting from an initial grasp pose candidate created by pointing the z-axis of the candidate towards the center of the sphere. The origin of the candidate is set at a predefined offset from the center of the sphere.

The initial candidate is then rotated around the object’s z-axis nr times, at an increment of 2π rad. For each rotation about the z-axis the candidate is rotated nr times around the nr 2 object’s y-axis, at an increment of 2π rad. Finally, each candidate is rotated about the nr candidate’s current z-axis for each z-axis and y-axis rotation ngr times, at an increment of 2π rad. ngr Box: Next, we will consider box grasp candidate poses. For grasping boxes we will also define two parameters: a boolean value flip grasp, and the number of plane division np. Grasps are generated along each face of the box. First, at the base of each face, a candidate is generated, with a z-axis pointing in the opposite direction of the box face’s surface normal. The origin of the candidate is a predefined offset from the center of the l box. We translate this candidate along the face in increments of f , where l is the length np f of the face. At each pose, if flip grasp is true then the candidate is rotated π rad around the candidate’s z-axis. In the case of a parallel gripper, this second grasp represents switching which face each jaw presses into. Cylinder: Finally, we consider the cylinder grasp candidate poses. We use three of the parameters already defined for the box and the sphere: the number of grasp rotations ngr, the number of plane divisions np, and the flip grasp boolean. There are two types of cylinder grasps possible, overhead grasps and side grasps. For overhead grasps, we select a candidate with the z-axis pointed along the cylinder’s main axis, with positive z pointed towards the cylinder. We then rotate this candidate around the candidate’s z-axis ngr times. For side grasps, we select a candidate with z-axis perpendicular to the cylinder’s main axis, with positive z-values in the direction of the cylinder. We then translate this candidate in the direction of the cylinder’s main axis n times, at an interval of h , where h is the p np height of the cylinder. At each translation we rotate the candidate around the cylinder’s 21 main axis n times, at rotation increments of 2π . For each of these candidates, if flip grasp gr ngr is true we rotate the grasp candidate pose around the candidate’s z-axis π rad. In the case of a parallel gripper this second candidate represents switching which side of the cylinder the jaws press into.

3.5.2 Object Placement Locations For a given object O, and state estimation x˜, we want to find a set of placement locations for O that are collision-free from other objects and occluded regions in the scene. If there is uncertainty in placing the object we could accidentally collide with other objects, or occluded regions. While in most POMDPs this could be reasoned about explicitly, we can take advantage of the voxel grid structure in b to make a conservative approximation. The first step is to project the voxel grid onto a bit map. Each pixel in the bitmap will represent a column in the voxel grid. Columns are parallel to the z-axis of the voxel grid’s coordinate frame. This makes the bit map’s plane the supporting surface of the scene. If all points are labeled free in the column, then the bitmap is labeled 1. Otherwise it is labeled 0. An example of this is shown in Figure 3.6 (B). It is important that the object that

(A) (B)

(C) (D)

Figure 3.6. Example of how to generate placement locations. (A) Possible state estimation the robot could encounter. The table is colored green, the objects purple and occluded regions are black. If the robot tries to find the placement locations for the cylinder in the middle (B) is the footprint of the scene without the cylinder, (C) is the footprint of the cylinder (scaled to be easily visable), and (D) is the convolution where green indicates placement locations without collision and red placement locations with collision. 22 is being pick-and-placed is removed form the voxel grid before the projection. In order to ensure that the placement location is safe, we then run a dilation operation on the bitmap. This artificially increases the size of the occluded regions and other objects. The amount of dilation can be set to reflect the expected error in placement. The object itself is then voxelized. While it is important that the scale and frame orientation of the object voxel grid are the same as the scene’s voxel grid, the origin can be the same as the objects. A bounding box is fit around the object. This bounding box is then voxelized and the object fit into it. Again, a bitmap representing the footprint of the object is generated where a bit in the bitmap represents a column of the object’s voxel grid. The columns in the object voxel grid are agin aligned to the z-axis of the voxel grid. This again makes the bit map’s plane the supporting surface under the object. A value of 1 is used when at least one voxel in the column contains the object and zero otherwise. The bounding box ensures that the projected footprint is tightly fit around the object. The tighter the fit on the object, the more possible placement locations that can be considered, especially on the borders of the scene’s bitmap. An example of this for the box shown in Figure 3.6 (A) is shown in Figure 3.6 (C). Finally, a convolution of the two bitmaps is computed. The result is not a bitmap but a matrix of integers, of the size of the scene’s footprint. A safe location for placement is then any location in the resulting matrix with a value equal to the sum of the object’s bitmap. An example of this is shown in Figure 3.6. This can be converted back into x and y coordinates for the object’s center. O’s current orientation and z value are kept. This fully specifies O’s new pose.

3.6 Update State Estimation The update belief algorithm is shown in Algorithm 3 (see Appendix). An observation, in the form of a point cloud P, the current state estimate x˜, the previous observation Pt−1, and the current geometric primitives Glist are given to it as input. From this, it relocates all previously known geometric primitives and segments new geometric primitives. Finally, the occluded, free and object voxels in x˜ are updated. This is done by first filtering P using the same function as described in Section 3.3, shown on line 1. In line 2 the function subtract clouds is called on P and Pt−1. This function 23 iterates over all points that have the same location in the image plane, and determines if that point has moved. This is done with the following equation:

2 dpc = ||p − pt−1|| (3.13) where p ∈ P and pt−1 ∈ Pt−1. If dpc > epc then the point is determined to have moved. A bitmap of the camera frame is generated where 1 is assigned to points that moved and 0 to points that didn’t move. In order to smooth out the image, and eliminate any local errors, a dilation and erosion are performed. There are two major sources of error that cause Equation 3.13 to give false positives or negatives. The first is sensor error. This can be handled by tuning the value of epc. The second happens when an object moves and part of the points of that object still occlude some subregion, and are at similar distances to the original. Typically the edges will be labeled as moved but the center of the object will not. This is fixed by the dilation and erosion, as this will fill in regions that are surrounded by moved points. Depending on the parameters of the dilation and erosion, this can give a more conservative estimate of the scene’s motion. Note also this will find changes, so if an object moved from one spot to another, the initial spot and the final spot will be labeled as moved in the bitmap.

Then in lines 3-11 we compare each primitive in Glist to the moved cloud, ∆P. If the points the primitive would cover are in ∆P then the primitive is added to a new list Gmoved. 0 Otherwise the primitive is added to the output list of primitives, Glist. If the primitive doesn’t project onto the moved bitmap, we can conclude that it has not moved, which is why we add it to the output list. In lines 12-29, we run code similar to what was done in Algorithm 2, except instead of adding the geometric primitives to the output list, we add them to another list, Gnew. This list is of the new primitives that have been uncovered and the primitives that moved from the previous scene. Lines 30-42 separate these two cases. For any geometric primitive in the list Gmoved, there are three possibilities: it has either been moved, has become occluded, or has fallen out of the scene. To determine if the primitive has moved and is still visible, each primitive in Gmoved is compared to each primitive in Gnew. If their parameters are within eφ, we conclude that they are the same primitive, relocated. If a primitive in Gmoved is not matched to a new primitive, we check if it is possible that that primitive has been occluded with the is occluded function on line 0 39. If it is possible, we then add that primitive to Glist. Finally, on line 43, we add the 24

0 remaining elements of Gnew to Glist. In line 44 we get a new state estimate. This is done by adding all new objects and updating all moved objects in the voxel grid. Ray tracing is then used again, except this time no new occluded voxels are added to the grid. Voxels can only change from free to occluded.

3.7 Object Singulation Task Formulation In this section, we formulate a reward function for the task of object singulation. Sin- gulation is the process of taking all objects in a scene and separating them by a minimum distance, es. This can be useful to facilitate other tasks such as modeling, or for organizing a scene. In this problem the reward function will be a function of the the current state estimate x˜, and the resulting state estimate x˜0, from taking action a. We do not directly reason about what could be in occluded regions when estimating x˜0. First, we will define a few useful metrics. The first will be the revealed volume:

0 Rv = ∑ is occluded(x˜, u, v, w) − ∑ is occluded(x˜ , u, v, w) (3.14) u,v,w∈x u,v,w∈x0 where Rv is the revealed volume, and is occluded(x˜, u, v, w) is a function that returns 1 if the voxel at u, v, and w is occluded and zero otherwise. The revealed volume is the sum of the unviewed voxels that can now be viewed. This is simply an estimate as fewer voxels will be revealed if an new object is discoved behind the moved object. This metric encourages exploring the scene. The objects that, when moved, reveal the most volume are also more likely to have other objects behind them. Note, voxels can never be given the label of occluded after the first observation, meaning this will always be a positive quantity and when all voxels in the scene have been viewed this will always have a value of zero. We will next define the effective object distance. For each object, the distance to the closest object is found. The sum of the minimum between es and this distance for each object is the effective object distance:

de f f = min(minO ∈O d(Oi, Oj), es) (3.15) ∑ j scene/Oi Oi∈Oscene

The function d(Oi, Oj) is the surface to surface distance between objects Oi and Oj. We then take the minimum of this and the goal distance es. We do not want to reward moving 25 objects apart, as we only care about d being above es. This means that if all objects are touching, then de f f = 0 and if all objects are singulated de f f = es||Oscene||. We define then the reward function to be:

0 R(x, x ) = de f f + βRv (3.16)

This reward function pushes the objects apart as much as possible with the first term and encourages revealing volume with the second term. Recall that no unsafe actions are considered, so setting the weighting term, β, to a large value will encourage revealing occluded regions quickly, and as the number of unviewed voxels decreases, this term will disappear, and then the agent will begin to prioritize singulating the objects. The second part of defining any task is then to define a task complete function. Naively, we could define it as when de f f = es||Oscene|| but this does not account for if there are significant parts of the scene the robot does not know about. To handle this we say that the task is complete when de f f = es||Oscene|| and ∑u,v,w∈x is occluded(x, u, v, w) < Vmax unknown. This second condition enforces that the scene state estimation is accurate.

3.8 Put Away Groceries Task Formulation In this task, the robot has been asked to put away groceries that are sitting on the table. Each type of object has a shelf that it must go on. Also, the residents of the home are not all aware of what is being put on the shelf, and want the robot to make as many of the groceries as visible as possible so they can quickly find their favorite foods. An example of a scene with groceries on a table and the shelves for the groceries is shown in Figure 3.7

(A) (B)

Figure 3.7. An example scene for the put away Groceries taks. (A) Counter full of groceries for the robot to put away. (B) Cabinet the robot must place the groceries into. 26

In order to accomplish this we first note that there are three types of food items: fruits, food boxes, and food cans. Each shelf is a flat rectangular support surface, and there are 3 shelves. Each shelf is set aside for one of the food types. There is a second table, with a rectangular support, that starts with all the groceries sitting on it. The robot is not able to see all of this table at the beginning. This reward function will be defined only by the resultant state estimate x˜0 and the current state estimate x˜. First, we begin by defining a reward for placing the objects on the correct shelf:

Rshel f = ∑ on correct shel f (Oi) (3.17) Oi∈Oscene where Oscene is all objects in the scene the robot knows about, and on correct shelf returns a value from zero to one indicating what fraction of the object is on the correct shelf. We now need to encourage that the robot finds all the objects sitting on the table. We do this by giving a negative reward for all voxels on the table that are occluded:

0 Ctable = − ∑ is occluded(x˜ , u, v, w) (3.18) u,v,w∈x0 where x(u, v, w) == occluded returns 1 if the voxel at (u, v, w) is occluded and zero other- wise. Next, we don’t want the residents to struggle to find their tasty treats, so we reward the robot for all free cells over the shelf. We can’t use our current state estimate as this will have information that residents of the home won’t have when viewing the scene for the first time. To get this, a new state estimation is created, with only geometric primitives that 0 are currently visible in the scene, x˜new. 0 Rvisability = − ∑ is occluded(xnew, u, v, w) (3.19) (u,v,w) 0 where xnew is a new belief state. This new belief state is created by generating a new voxel grid and running ray tracing on the shelf. Once a ray has hit an object, all voxels the ray hits afterward that are not that object are marked as occluded. This can cover up objects that are behind other objects. We define the final reward function as follows:

0 R(s, a, s ) = Rshel f + αCtable + βR f amily (3.20) where α and β are weighting terms. 27

Finally, we define the function task complete. This is done when Rshel f = ||Oscene|| and

∑(u,v,w)∈x is occluded(x, u, v, w) < Vmax unknown. CHAPTER 4

EXPERIMENTS AND RESULTS

This chapter is divided into two sections. Section 4.1 presents real-world results for the geometric primitive segmentation algorithm presented in Section 3.3. Then simulated results are shown in Section 4.2 for the action selection, execution, and planning pipeline presented in Section 3.2.

4.1 Geometric Primitive Segmentation In this section we aim to determine the effectiveness of the geometric primitive seg- mentation algorithm shown in Algorithm 2. All data were gathered with an ASUS Xtion Pro camera, which provided RGB-D images that were converted into point clouds. The point clouds were then used as an input to Algorithm 2. Using objects from the YCB dataset, several random scenes were generated. Three of these scenes are shown in the first column of Figure 4.1. The results were taken, a mask was placed over the image showing the segmentation, and the type of geometric primitive fit to the segment shown in columns 2 and 3 of Figure 4.1. The results generated give qualitatively good segmentations; however, there are im- portant issues to highlight. For example, in the second and third scenes, the wooden block has a cylinder fit to at least one of its sides. Similarly, in the first scene, the side of the drill box also has a cylinder fit to it. This error is a recurring theme in many scenes, and it causes irregularities similar to the problems with defining an observation function (Section 3.1). The error in the sensor readings is a function of both the distance to the camera and the surface normal of that point. The greater the angle between the point’s surface normal and the camera’s principle axis, the more error in the position estimate for that point. This phenomenon is referred to as edge distortion and it is responsible for misrepresenting the side of the drill box in scene 1. The points have a lot of noise, so a wide radius cylinder fits the distortion much better than a plane. 29

Scene Segmentation Geometric Primitive Type

Figure 4.1. Three examples of segmentations generated using Algorithm 2. The first column is an image of the scene. The second column is the segmentation, where each color is a unique segment in the scene. The third column idicates the type of geometric primitive that was assigned to each segment. Green is a cylinder, red a sphere, and blue a box.

Different from other segmentation methods, geometric primitives provide a 3D model of segments. Figure 4.2 compares two results of fitting a cylinder to a coffee can. The red cylinder indicates the fit, and it is clear that the fit in (A) is more accurate than the fit in (B). Getting a stable estimate of the parameters for many objects proved to be difficult. Edge distortion makes fitting cylinders and boxes difficult since the fluctuation from frame to frame changes what type to assign the segment. Other methods that fit geometic perimi- tives to scenes often use simpler shapes, such as only a plane [18], or use multiple views [19]. These methods are able to almost completely eliminate the effects of edge distortion before fitting is attempted. The inability to consistently fit primitive parameters makes the methods presented in Section 3.6 impossible to test.

4.2 Geometric Primitive Manipulation Planner In order to test the Geometric Primitive Manipulation Planner (GPMP) without using the state estimation and update methods, a simulation environment was used. It was 30

(A) (B)

Figure 4.2. Two different fits for the same data: (A) is an ideal fit and (B) is an error.

assumed that all objects were composed of a single geometric primitive and that the pa- rameters could be estimated correctly. During the action selection, the reward function presented in Section 3.7 was used. The task parameters were set to β = 1, Vmax unknown = 5 and es = 0.05m. A voxel grid with voxel side length of 2cm was used in the state estima- tion. A full-factorial repeated-measures design is used. Three factors are considered: the scene type S, the number of objects in the scene N, and the method used to solve the task M. The scene type is defined by the size of the task-space and there are two types: large and small. Large scenes have a task-space with 1m × 0.5m × 0.5m and small scenes have a task-space with dimensions 0.5m × 0.5m × 0.5m. The two scene types were selected to show different problems. In large scenes, the robot will have many actions to choose from, due to more placement locations, and there will be less clutter, making it easier to place objects such that they don’t occlude each other. Small scenes will have fewer actions to chose from and revealing volume will be more difficult, as occluding regions with multiple objects are more likely. Examples of each scene type, with N = 4, are shown in Figure 4.3. The N values considered are: 4, 5, and 6. All scenes with N objects will contain the same objects. The difference between scenes with the same N and S values are the poses of the objects. For each combination of N and S five scenes with random object poses were generated, totaling 30 scenes. 31

(A) (B)

Figure 4.3. Examples of each scene type (A) is a large scene and (B) is a small scene. The supporting surface is shown in green and the objects in red.

The two method types used here are: GPMP and random. GPMP is the method de- scribed in Section 3.4. Random is an agent that, given the set of possible actions, uniformly selects one from the set. For each scene both methods were allowed to solve it 5 times, totaling 150 samples from each method and 300 total samples. Allowing GPMP to run 5 times was purely for the purpose of the statistical study, since with no uncertainty on the state transition function, GPMP is deterministic. We use a 3-way fixed-effect analysis of variance (ANOVA) model to determine statis- tical significance in an experiment with the response variable being the number of actions to solve the scene, and M, S and N treated as fixed-effect variables. The conventional significance for the entire analysis was determined at α = 0.05, two tailed. The mean with 95% confidence interval (CI) for a given value is determined by a one-sample t-test. All analysis was done with MATLAB R2017b. We designed our experiment to investigate weather GPMP can perform fewer actions than a random agent, in a variety of scenes. We do this by increasing the complexity of the scene. A more complex scene will require more actions to solve. We hypothesize that we can make the scene more complex by making the task-space smaller, or by increasing the number of objects the robot must deal with, i.e, changing N or S changes the scene complexity. First, from Table 4.1, we can validate our second hypothesis: by changing N or S, the complexity of the scene changes. This is validated by the 3-way ANOVA test results. We can say that N has statistical significance (p = 0.0412) on the number of actions to solve 32

Table 4.1. The 3-way ANOVA test results. Source Sum Sq. D.F. Mean Sq. F Prob>F M 604.352 1 604.352 189.94 0.0052 N 148.21 2 74.105 23.29 0.0412 S 85.12 1 85.12 26.75 0.0354 M*N 60.157 2 30.079 9.45 0.0957 M*S 74.8 1 74.8 23.51 0.04 N*S 9.931 2 4.966 1.56 0.3905 Error 6.363 2 3.182 Total 988.935 11

the scene. This is because more objects inherently means more objects will have to be moved to fully explore the scene. Figure 4.4 shows that as the number of objects in the scene increases the number of actions required to solve the scene increases for both GPMP and random. Similarly, S has statistical significance (p = 0.0354) on the number of actions to solve the scene. Figure 4.5 shows that large scenes require fewer actions to solve than small scenes. With more space it is easier to place large objects. These objects typically block large parts of the scene, but can only be moved into a small subset of the state space. We next aim to show that GPMP is able to solve scenes with fewer actions than random.

Figure 4.4. The mean and CI for the data as a function of the number of objects in the scene. 33

Figure 4.5. The mean and CI for the data as a function of the size of the scene.

We can say that M has statistical significance (p = 0.0052) on the number of actions to solve the scene. Figure 4.6 shows that GPMP is able to solve the scene with less actions. As the scene becomes more complex, GPMP consistently outperforms random. The 2-way interactions help us understand how both these methods perform under different N and S values. We can say that the 2-way interaction between M and S have a statistical significance (p = 0.04) on the number of actions to solve the scene. Figure 4.7 shows that as the scene changes size GPMP is able to use fewer actions. Random uses more actions because in smaller scenes random has a low probability of moving large objects. This is because there are fewer actions that pick-and-place larger objects. This prevents random from being able to quickly explore the scene and then singulate the objects. In contrast, GPMP prioritizes moving objects that occlude large regions, making it able to quickly explore the scene and singulate the objects. If we relax the CI to 90%, we can also say that the 2-way interaction between M and N has statistical significance (p = 0.0957). Figure 4.8 shows that GPMP again performs fewer actions for all values of N. This is due to the increase in the number of possible actions. If only a small subset of actions progress the robot towards completing the singulation, 34

Figure 4.6. The mean and CI for the data as a function of the Method.

Figure 4.7. The mean and CI for the data as a function of the Method and the scene size. 35

Figure 4.8. The mean and CI for the data as a function of the Method and number of objects in the scene. random is less likely to find them as the number of actions in a scene increase. GPMP searches directly for these actions. The CI for GPMP is much smaller than random for similar reasons. GPMP is able to estimate the quality of a given action using the reward function. Finally, we will consider the amount of time to select actions for both GPMP and random. We present the average time to solve each combination of N, S, and M in Table 4.2. We can clearly see that the action selection time is significantly larger for GPMP in all cases. This is because GPMP is creating a list of all actions, calculating the occlusion in each possible next state, and calculating the effective distance for objects in these states. In contrast, random is generating a list of actions. The dominant computation for GPMP is the ray tracing required for the occlusion checking. Parallelization would dramatically reduce the run time for the ray tracing, and help the algorithm scale better. While the computation time is large for GPMP, the cost of running this computation can be much lower than running a robot. It is for the designer to decide if performing more actions is worth the computational savings. For many applications, reducing the number of actions may be superior. 36

Table 4.2. Simulation results for the time required to select actions. Times are the average time per actions. S N Avg. Time/Action Random (s) Avg. Time/Action GPMP (s) 4 5.1409±1.105 801.0± 40.4066 Large 5 6.2858±0.2774 921.7± 44.2821 6 11.2992±0.3261 1121.7± 97.6934 4 3.5804±0.1543 155.7098±26.9824 Small 5 4.3417±0.2867 163.1682±19.5738 6 6.9666±0.2056 166.6454±26.5371

On smaller scenes, it is shown in Table 4.2 that for GPMP the time to calculate a single action is almost an order of magnitude lower than for GPMP on large scenes. This is due entirely to the number of possible actions. GPMP computation time grows with the number of possible actions. The number of pick-and-place actions grows with the size of the workspace, so by changing only this factor, the computational time is greatly affected. This indicates that the GPMP algorithm, in its current form, does not scale well to scenes with more actions than those tested. CHAPTER 5

CONCLUSION

In summary, we presented a framework for general object-based manipulation. Central to this work is the geometric primitive. The geometric primitive contains important infor- mation that is useful at many different stages of task-based planning. These underlying geometries are common in many scenes, making segmentation and localization easier. Get- ting an approximation of the full extent of objects proves useful during motion planning. Finally, the intuitive shapes allow the designer to build actions for the robot that have a high likelihood of success. While the geometric primitive provides many benefits, it also presents many chal- lenges. A robust method for fitting these shapes is difficult to define. In this work, we showed how RANSAC could be used. While it showed positive segmentation results, the parameters of the primitives were inconsistent. We also described the Geometric Primitive Manipulation Planner (GPMP), which solves a generic manipulation task using a novel approach to reason about unknown parts of the scene. We realized that guessing what could be in the occluded regions of the scene didn’t change how we interact with the scene. Not tracking or estimating what could be in these portions of the scene reduced the computational complexity of the problem, and made solving it significantly easier. We showed positive simulation results for using this type of reasoning on the task of singulation. Developing algorithms and frameworks that can handle all of the complex problems that come with general manipulation is important. In this work, we made progress on unifying perception and actions, agnostic of the specific task. CHAPTER 6

FUTURE WORK

In future work, it is a high priority to make geometric primitive segmentation more robust. This can be achieved through filtering and smoothing techniques applied to the point cloud. While care must be taken to not disrupt the underlying structure of the scene, filtering approaches could eliminate part or all of the edge distortion. Also, a closer look at RANSAC itself could be useful. Reformulating the inlier functions to better account for the expected error in the scene, a more intelligent sampling strategy, or constraining the geometric primitives could each provide a more consistent fit. Finally, looking at alternative robust regression algorithms could prove useful. These methods would need to provide similar models while still being robust to a high percentage of outliers. In regard to GPMP, there are many extensions to consider. The first extension would be to develop multistep planning. This would make GPMP able to solve scenes with fewer actions, and reduce the likelihood of getting stuck in local minima. Looking into other action types is interesting as well. Adding in push, tapping, or stoking actions provide both new perceptual information, and new capabilities for the robot. These types of actions could provide insight into the weight of the object, its material type, or better manipulation strategies. APPENDIX

ALGORITHMS

Algorithm 1: Geometric Primitive Manipulation Planner

Data: Initial Observation z0, task T 1 Gscene, x˜ ← geometric primitive segmentation(P0); 2 while not T.task complete(b) do 3 zt−1 = z; 4 a ← select action(b, T.reward function); 5 z ← execute action(a); 6 x˜,Gscene ← update state estimate(Gscene, x˜, z, zt−1); 7 end

Algorithm 2: Geometric Primitive Segmentation Data: Point Cloud P Result: List of Visable Geometric Primatives Glist, belief state b 1 P ← filter cloud(P); 2 while ||P|| > minimum primitive size do 3 max inliers ← {}; 4 φmax ← {}; 5 for primitive type ∈ {cylinder, sphere, boxes} do 6 φ,xg,inliers ← RANSAC(gi,S); 7 φ,xg,inliers ← largestConnectedComponent(inliers); 8 if ||inliers|| > ||max inliers|| then 9 φmax, max inliers, xgmax ← φ,inliers, xg; 10 end 11 end 12 if ||max inliers|| > minimum primitive size then 13 Glist.push back((φmax, xgmax)); 14 P.remove(max inliers); 15 end 16 end 17 b ← init grid(Glist); 18 return Glist, b ; 40

Algorithm 3: Update Belief

Data: Know Primitives Glist, Belief State b,Point Cloud P, Previous Point Cloud Pt−1 0 Result: Updated list of Geometric Primatives Glist, belief state b 1 P ← filter cloud(P); 2 ∆P ← subtract clouds(Pt−1, P); 0 3 Glist ← {}; Gmoved ← {}; Gnew ← {}; 4 for g ∈ Glist do 5 if is visable(g, ∆P) then 6 Gmoved.push back(g); 7 else 0 8 Glist.push back(g); 9 end 10 end 11 while ||∆P|| > minimum primitive size do 12 max inliers ← {}; φmax ← {}; 13 for primitive type ∈ {cylinder, sphere, boxes} do 14 φ,xg,inliers ← RANSAC(gi,S); 15 φ,xg,inliers ← largestConnectedComponent(inliers); 16 if ||inliers|| > ||max inliers|| then 17 φmax, max inliers, xgmax ← φ,inliers, xg; 18 end 19 end 20 if ||max inliers|| > minimum primitive size then 21 Gnew.push back((φmax, xgmax)); 22 P.remove(max inliers); 23 end 24 end 25 for gmoved ∈ Gmoved do 26 f ound ← False; 27 for gnew ∈ Gnew do 28 if gmoved.φ − gnew.φ < eφ then 0 29 Glist.push back(gnew); 30 f ound ← True; 31 Gnew.remove(gnew); 32 end 33 end 34 if not f ound and is occluded(gmoved,b) then 0 35 Glist.push back(gmoved); 36 end 37 end 0 0 38 Glist ← Gnew + Glist; 39 b ← get new belief(Glist); 0 40 return Glist, b ; REFERENCES

[1] Y. Chen and G. Medioni, Object modelling by registration of multiple range images, Image and Vision Computing, 10 (1992), pp. 145–155.

[2] C. Choi and H. I. Christensen, Rgb-d object pose estimation in unstructured environ- ments, Robotics and Autonomous Systems, 75 (2016), pp. 595–613.

[3] M. R. Dogar, M. C. Koval, A. Tallavajhula, and S. S. Srinivasa, Object search by manipulation, Autonomous Robots, 36 (2014), pp. 153–167.

[4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, Decaf: A deep convolutional activation feature for generic visual recognition, in ICML, vol. 32, 2014, pp. 647–655.

[5] P. F. Felzenszwalb and D. P. Huttenlocher, Efficient graph-based image segmentation, International Journal of Computer Vision, 59 (2004), pp. 167–181.

[6] M. A. Fischler and R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, in Readings in Computer Vision, Elsevier, 1987, pp. 726–740.

[7] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit, Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes, in Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 858–865.

[8] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence, 101 (1998), pp. 99–134.

[9] D. Katz, A. Orthey, and O. Brock, Interactive perception of articulated objects, in Experimental Robotics, Springer, 2014, pp. 301–315.

[10] D. Katz, A. Venkatraman, M. Kazemi, J. A. Bagnell, and A. Stentz, Perceiving, learning, and exploiting object affordances for autonomous pile manipulation, Autonomous Robots, 37 (2014), pp. 369–382.

[11] M. Krainin, B. Curless, and D. Fox, Autonomous generation of complete 3d object models using next best view manipulation planning, in Robotics and Automation (ICRA), 2011 IEEE International Conference on, IEEE, 2011, pp. 5031–5037.

[12] J. J. Kuffner and S. M. Lavalle, Rrt-connect: An efficient approach to single-query path planning, vol. 2, IEEE, 2000, pp. 995–1001.

[13] D. G. Lowe, Object recognition from local scale-invariant features, in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2, IEEE, 1999, pp. 1150–1157. 42

[14] L. Ma, M. Ghafarianzadeh, D. Coleman, N. Correll, and G. Sibley, Simultaneous localization, mapping, and manipulation for unsupervised object discovery, in Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE, 2015, pp. 1344– 1351.

[15] R. M. Martin and O. Brock, Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors, in Intelligent Robots and Sys- tems (IROS 2014), 2014 IEEE/RSJ International Conference on, IEEE, 2014, pp. 2494– 2501.

[16] T. Rabbani, F. Van Den Heuvel, and G. Vosselmann, Segmentation of point clouds us- ing smoothness constraint, International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 36 (2006), pp. 248–253.

[17] R. B. Rusu, N. Blodow, and M. Beetz, Fast point feature histograms (fpfh) for 3d reg- istration, in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, IEEE, 2009, pp. 3212–3217.

[18] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz, Towards 3d point cloud based object maps for household environments, Robotics and Autonomous Systems, 56 (2008), pp. 927–941.

[19] R. Schnabel, R. Wahl, and R. Klein, Efficient ransac for point-cloud shape detection, in Forum, vol. 26, Wiley Online Library, 2007, pp. 214–226.

[20] H. Van Hoof, O. Kroemer, and J. Peters, Probabilistic segmentation and targeted exploration of objects in cluttered environments, IEEE Transactions on Robotics, 30 (2014), pp. 1198–1209.