A Category-Level 3-D Object Dataset: Putting the to Work

Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, Trevor Darrell UC Berkeley and Max-Plank-Institute for Informatics {allie, sergeyk, jiayq, barron, saenko, trevor}@eecs.berkeley.edu, [email protected]

Abstract

Recent proliferation of a cheap but quality depth sen- sor, the Kinect, has brought the need for a chal- lenging category-level 3D object detection dataset to the fore. We review current 3D datasets and find them lack- ing in variation of scenes, categories, instances, and view- points. Here we present our dataset of color and depth image pairs, gathered in real domestic and office environ- ments. It currently includes over 50 classes, with more images added continuously by a crowd-sourced collection effort. We establish baseline performance in a PASCAL VOC-style detection task, and suggest two ways that in- ferred world size of the object may be used to improve de- tection. The dataset and annotations can be downloaded at http://www.kinectdata.com. Figure 1. Two scenes typical of our dataset.

1. Introduction ously published datasets. The latest version of the dataset is available at http://www.kinectdata.com Recently, there has been a resurgence of interest in avail- able 3-D sensing techniques due to advances in active depth As with existing 2-D challenge datasets, our dataset has sensing, including techniques based on LIDAR, time-of- considerable variation in pose and object size. An impor- flight (Canesta), and projected texture stereo (PR2). The tant observation our dataset enables is that the actual world Primesense sensor used on the Microsoft Kinect gaming size distribution of objects has less variance than the image- interface offers a particularly attractive set of capabilities, projected, apparent size distribution. We report the statistics and is quite likely the most common depth sensor available of these and other quantities for categories in our dataset. worldwide due to its rapid market acceptance (8 million A key question is what value does depth data offer for Kinects were sold in just the first two months). category level recognition? It is conventional wisdom that While there is a large literature on instance recogni- ideal 3-D observations provide strong shape cues for recog- tion using 3-D scans in the computer vision and robotics nition, but in practice even the cleanest 3-D scans may re- literatures, there are surprisingly few existing datasets for veal less about an object than available 2-D intensity data. category-level 3-D recognition, or for recognition in clut- Numerous schemes for defining 3-D features analogous to tered indoor scenes, despite the obvious importance of this popular 2-D features for category-level recognition have application to both communities. As reviewed below, pub- been proposed and can perform in uncluttered domains. We lished 3-D datasets have been limited to instance tasks, or evaluate the application of HOG descriptors on 3D data and to a very small numbers of categories. We have collected evaluate the benefit of such a scheme on our dataset. We and describe here the initial bulk of the Berkeley 3-D Ob- also use our observation about world size distribution to ject dataset (B3DO), an ongoing collection effort using the place a size prior on detections, and find that it improves Kinect sensor in domestic environments. The dataset al- detections as evaluated by average precision, and provides ready has an order of magnitude more variation than previ- a potential benefit for detection efficiency. 2. Related Work There have been numerous previous efforts in collect- ing datasets with aligned 2D and 3D observations for object recognition and localization. We review the most pertinent ones, and briefly highlight how our dataset is different. We also give a brief overview of previous work targeting the integration of the 2D appearance and depth modalities. Figure 2. Illustration of our depth smoothing method.

2.1. 3D Datasets for Detection examples of the “chair” category in our dataset. These qual- We present an overview of previously published datasets ities make our dataset more representative of the kind of that combine 2D and 3D observation and contrast our data that can actually be seen in people’s homes; data that a dataset from those previous efforts: domestic service robot would be required to deal with and RGBD-dataset of [21]: This dataset from Intel Research do online training on. and UW features 300 objects in 51 categories. The cate- gory count refers to nodes in a hierarchy, with, for example, 2.2. 3D and 2D/3D Recognition coffee mug having mug as parent. Each category is repre- A comprehensive review of all 3-D features proposed for sented by 4-6 instances, which are densely photographed recognition is beyond the scope of our work. Briefly, no- on a turntable. For object detection, only 8 short video clips table prominent techniques include spin images [20], 3-D are available, which lend themselves to evaluation of just 4 shape context [15], and the recent VFH model [23]–but this categories (bowl, cap, coffee mug, and soda can) and 20 in- list is not exhaustive. stances. There does not appear to be significant viewpoint A number of 2D/3D hybrid approaches have been re- variation in the detection test set. cently proposed, and our dataset should be a relevant testbed UBC Visual Robot Survey [3, 19] : This dataset from for these methods. A multi-modal object detector in which UBC provides training data for 4 categories (mug, bottle, 2D and 3D are traded off in a logistic classifier is proposed bowl, and shoe) and 30 cluttered scenes for testing. Each by [16]. Their method leverages additional handcrafted fea- scene is photographed in a controlled setting from multiple ture derived from the 3D observation such as “height above viewpoints. ground” and “surface normal”, which provide contextual in- 3D table top object dataset [24]: This dataset from formation. [24] shows how to benefit from 3D training data University of Michigan covers 3 categories (mouse, mug, in a voting based method. Fritz et al. [14] extends branch stapler) and provides 200 test images with cluttered back- & bound efficient detection to 3D and adds size and support grounds. There is no significant viewpoint variation in the surface constraints derived from the 3D observation. test set. Most prominently, a set of methods have been proposed Solutions in Perception Challenge [2]: This dataset for fusing 2D and 3D information for the task of pedestrian from Willow Garage forms the challenge which took place detection. The popular HOG detector [8] to disparity-based in conjunction with International Conference on Robotics features is extended by [18]. A late integration approach is and Automation 2011, and is instance-only. It consists of proposed by [22] for combining detectors on the appearance 35 distinct objects such as branded boxes and household as well as depth image for pedestrian detection. Instead of cleaner bottles that are presented in isolation for training directly learning on the depth map, [25] uses a depth statis- and in 27 scenes for test. tic that learns to enforce height constraints of pedestrians. Other datasets: Beyond these, other datasets have Finally, [10] explores pedestrian detection by using stereo been made available which do include simultaneous cap- and temporal information in a hough voting framework also ture of image and depth but serve more specialized purposes using scene constraints. like autonomous driving [1], pedestrian detection [10] and driver assistance [25]. Their specialized nature means that 3. Our Dataset they cannot be leveraged for the multi-object category lo- calization task that is our goal. We have compiled a large-scale dataset of images taken In contrast to all of these datasets, our dataset contains in domestic and office settings with the commonly available both a large number of categories and many different in- Kinect sensor. The sensor provides a color and depth image stances per category, is photographed “in the wild” instead pair, and is processed by us for alignment and inpainting. of in a controlled turntable setting, and has significant vari- The data was collected by many members of our research ation in lighting and viewpoint throughout the set. For an community, as well as Amazon Mechanical Turk (AMT) illustration, consider Figure4, which presents a sample of workers, enabling us to have impressive variety in scene and chair table cup monitor book bottle pen bowl keyboard sofa mouse pillow phone power outlet speaker bookcase laptop cabinet door handle headphones light switch shoe door handle remote towel scissors plate toothbrush keys plate trash can glass magazine medicine cell phone stapler toothpaste eyeglasses pen or pencil desk lamp laptop printer backpack kleenex container calculator fork notebook table_knife soap spoon tapedispenser desk_lamp mail bike can ruler wallet purse wristwatch bike_helmet hair_brush medicine_box paper_notebook ring_binder tv_remote letter_tray back_pack computer_mouse shirt coin plastic_food_container shoe_or_sandal glasses marker sock credit_card newspaper pants bill tape_dispenser file_cabinet lighter 250 219 202 200 189 161 131 126 106 86 84 71 66 59 53 46 44 42 41 40 38 38 36 36 34 32 31 28 27 26 26 25 25 25 22 21 21 20 20 19 19 19 18 18 17 16 15 15 15 14 14 14 13 13 11 11 11 11 10 10 9 9 9 7 7 7 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 0 0

250

200

150

100

50

0 cup pen sofa keys chair plate shoe glass bowl book table plate towel pillow bottle phone mouse laptop stapler scissors remote monitor cabinet speaker trash can medicine keyboard bookcase magazine cell phone cell toothbrush eyeglasses light switch toothpaste door handle door handle power outlet headphones pen or pencil

Figure 3. Object frequency for 39 classes with 20 or more examples.

object appearance. As such, the dataset is intended for eval- able to collect quality images from AMT workers. With the uating approaches to category-level object recognition and growing popularity of the Kinect, the pool of potential data localization. contributors keeps getting bigger. The size of the dataset is not fixed and will continue growing with crowd-sourced submissions. The first release 3.2. The Kinect Sensor of the dataset contains 849 images taken in 75 different The Microsoft Kinect sensor consists of a hori- scenes. Over 50 different object classes are represented zontal bar with cameras, a structured light projector, an ac- in the crowd-sourced labels. The annotation is done by celerometer and an array of microphones mounted on a mo- Amazon Mechanical Turk workers in the form of bounding torized pivoting foot. Since its release in November 2010, boxes on the color image, which are automatically trans- much open source software has been released allowing the ferred to the depth image. use of the Kinect as a depth sensor [7]. Across the horizon- tal bar are three sensors: two laser depth sensors 3.1. Data Collection with a depth range of approximately 0.6 to 6 meters, and We use crowd sourcing on AMT in order to label the data one RGB camera (640 x 480 pixels) [4]. Depth reconstruc- we collect and collect data in addition to the data collected tion uses proprietary technology from Primesense, consist- in-house. AMT is a well-known service for “Human Intelli- ing of continuous infrared structured light projection onto gence Tasks” (HIT), which are typically small tasks that are the scene. too difficult for current machine intelligence. Our labeling The Kinect color and IR cameras are a few centimeters HIT gives workers a list of eight objects to draw bound- apart horizontally, and have different intrinsic and extrin- ing boxes around in a color image. Each image is labeled sic camera parameters, necessitating their calibration for by five workers for each set of labels in order to provide proper registration of the depth and color images. We found sufficient evidence to determine the validity of a bound- that the calibration parameters differ significantly from unit ing box. A proposed annotation or bounding box is only to unit, which poses a problem to totally indiscriminate data deemed valid if at least one similarly overlapping bounding collection. Fortunately, the calibration procedure is made box is drawn by another worker. The criteria for similarity easy and automatic due to efforts of the open source com- of bounding boxes is based on the PASCAL VOC [11] over- munity [7,6]. lap criterion (described in more detail in section 4.1), with 3.3. Smoothing Depth Images the acceptance threshold set to 0.3. If only two bounding boxes are found to be similar, the larger one is chosen. If The structured-light method we use for recovering more than two are deemed similar, we keep the bounding ground-truth depth-maps necessarily creates areas of the box with the most overlap with the others, and discard the image that lack an estimate of depth. In particular, glass rest. surfaces and infrared-absorbing surfaces can be missing in After the initial intensive in-house data collection, our depth data. Tasks such as getting the average depth of a dataset is now in a sustained effort to collect Kinect data bounding box, or applying a global descriptor to a part of from AMT workers as well as the local community. This the depth image therefore benefit from some method for AMT collection task is obviously more difficult than simple “inpainting” this missing data. labeling, since workers must own a Kinect and be able and Our view is that proper inpainting of the depth im- willing to set their Kinect up properly. Despite this, we were age requires some assumption of the behavior of natural 70 examples, 27 classes with more than 30 examples, and over 39 classes with 20 or more examples. Unlike other 3D datasets for object recognition, our dataset features large variability in the appearance of object class instances. This can be seen in Figure4, presenting random examples of the chair class in our dataset; the vari- ation in viewpoint, distance to object, frequent presence of partial occlusion, and diversity of appearance in this sample poses a challenging detection problem. The apparent size of the objects in the image, as mea- sured by the bounding box containing them, can vary sig- nificantly across the dataset. Our claim is that the real-world size of the objects in the same class varies far less, as can be seen in Figure5. As proxy for the real-world object size, we use the product of the diagonal of the bounding box l and the distance to the object from the camera D, which is roughly proportional to the world object size by similar tri- angles (of course, viewpoint variation slightly scatters this distribution–but less so than for the bounding box size). We find that mean smoothed depth is roughly equivalent to the median depth of the depth image ignoring missing data, and so use this to measure distance. The Gaussian was found to be a close fit to these size distributions, allowing us to estimate size likelihood of a bounding box as N (x|µ, σ)), where µ and σ are learned on the training data. This will be Figure 4. Instances of the “chair” class in our dataset, demonstrat- used in section 4.3. ing the diversity of object types, viewpoint, and illumination. 4. Detection Baselines shapes. We assume that objects have second order smooth- The cluttered scenes of our dataset provide for a chal- ness (that curvature is minimized)—a classic prior on nat- lenging object detection task, where the task is to local- ural shapes [17, 27]. In short, our algorithm minimizes ize all objects of interest in an image. We constrain the 2 T 2 ˆ kh ∗ ZkF + kh ∗ ZkF with the constraints Zx,y = Zx,y task to finding eight different object classes: chairs, mon- for all (x, y) ∈ Zˆ, where h = [−1, +2, −1], is an oriented itors, cups, bottles, bowls, keyboards, computer mice, and 1D discrete Laplacian filter, ∗ is a convolution operation, phones. These object classes were among the most well- 2 1 and k·kF is the squared Frobenius norm. The solution to represented in our dataset. this optimization problem is a depth-map Z in which all observed pixels in Zˆ are preserved, and all missing pixels 4.1. Sliding window detector have been filled in with values that minimize curvature in a Our baseline system is based on a standard detection ap- least-squares sense. proach of sliding window classifiers operating on a gradient Figure2 illustrates this algorithm operating on a typical representation of the image [9, 13, 26]. Such detectors are input image with missing depth in our dataset to produce currently the state of the art on cluttered scene datasets of the smoothed output. varied viewpoints and instance types, such as the PASCAL- VOC challenge [11]. The detector considers windows of 3.4. Data Statistics a fixed aspect ratio across locations and scales of an im- The dataset described here, which will be able to be used age pyramid and evaluates them with a score function, out- for benchmarking for recognition tasks, is the first release putting detections that score above some threshold. of the B3DO dataset. As our collection efforts are ongoing, Specifically, we follow the implementation of the De- subsequent releases of data will include even more variation formable Part Model detector [13], which uses the La- and larger quantities of data. The distribution of objects in tentSVM formulation fβ(x) = maxz β · Φ(x, z) for scor- household and office scenes as represented in our dataset is ing candidate windows, where β is a vector of model pa- shown in Figure3. The typical long tail of unconstrained 1We chose not to include a couple of other well-represented classes into datasets is present, and suggests directions for targeted data this test set because of extreme variation in interpretation of instances of collection. At this time, there are 12 classes with more than object by the annotators, such as the classes of “table” and “book.” Diagonal x Mean smoothed depth Diagonal x Mean smoothed depth Diagonal x Mean smoothed depth 1 1 1

0.5 Chair 0.5 Monitor 0.5 Cup

0 0 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

Diagonal of bounding box (pixels) Diagonal of bounding box (pixels) Diagonal of bounding box (pixels) 6 8 20

6 15 4 4 10 2 2 5

0 0 0 0 80 160 240 320 400 480 560 640 720 800 0 80 160 240 320 400 480 560 640 720 800 0 80 160 240 320 400 480 560 640 720 800

Unknown depth ratio Unknown depth ratio Unknown depth ratio 40 Diagonal x Mean smoothed depth 40 Diagonal x Mean smoothed depth 80 Diagonal x Mean smoothed depth 1 1 1 30 30 60

20 20 40 0.5 0.5 0.5 10 Bottle 10 Bowl 20 Keyboard

0 0 0 00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

Diagonal of bounding box (pixels) Diagonal of bounding box (pixels) Diagonal of bounding box (pixels) 15 8 6

6 10 4 4 5 2 2

0 0 0 0 80 160 240 320 400 480 560 640 720 800 0 80 160 240 320 400 480 560 640 720 800 0 80 160 240 320 400 480 560 640 720 800

Unknown depth ratio Unknown depth ratio Unknown depth ratio 20 Diagonal x Mean smoothed depth 30 Diagonal x Mean smoothed depth 30 Diagonal x Mean smoothed depth Figure1 5. Statistics of object size. For each object1 class, the top histogram is inferred world1 object size, obtained as the product of the 15 20 20 bounding10 box diagonal and the average depth of points in the bounding box. The bottom histogram is the distribution of just the diagonal 0.5 100.5 0.510 of5 the bounding box size. Mouse Phone Plate 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

Diagonal of bounding box (pixels) Diagonal of bounding box (pixels) Diagonal of bounding box (pixels) 8 6 6

6 rameters and z are latent values (allowing4 for part defor- 4.2. Evaluation 4 4 2 2 mations).2 Optimizing the LatentSVM objective function is

0 0 0 a semi-convex0 80 160 240 problem,320 400 480 and560 so640 the720 detector800 0 can80 be160 trained240 320 400 480 Evaluation560 640 720 of800 detection0 80 160 is done240 320 in400 the480 widely560 640 adopted720 800 Unknown depth ratio Unknown depth ratio Unknown depth ratio even30 though the latent information is absent15 for negative ex- style of the PASCAL30 detection challenge, where a detec- 20 10 20 amples. tion is considered correct if area(B∩G) > 0.5 where B is 10 5 10 area(B∪G)

0 0 0 0 Since0.1 0.2 finding0.3 0.4 good0.5 negative0.6 0.7 0.8 examples0.9 1 to0 train0.1 0.2 on0.3 is of0.4 0.5 the0.6 bounding0.7 0.8 0.9 box1 of0 the0.1 detection0.2 0.3 and0.4 G0.5 is0.6 the0.7 ground0.8 0.9 truth1 paramount importance in a large dataset, the system per- bounding box of the same class. Only one detection can forms rounds of data mining for small samples of hard neg- be considered correct for a given ground truth box, with atives, providing a provably exact solution to training on the the rest considered false positives. Detection performance entire dataset. is represented by precision-recall (PR) curves, and sum- marized by the area under the curve–the average precision To featurize the image, we use a histogram of oriented (AP). We evaluate on six different splits of the dataset, av- gradients (HOG) with both contrast-sensitive and contrast- eraging the AP numbers across splits. insensitive orientation bins, four different normalization factors, and 8-pixel wide cells. The descriptor is analyti- Our goal is category, not instance-level recognition. As cally projected to just 31 dimensions, motivated by the anal- such, it is important to keep instances of a category confined ysis in [13]. to either training or test set. This makes the recognition task We explore two feature channels for the detector. One much harder than if we were allowed to train on the same consists of featurizing the color image, as is standard. For instances of a category as exist in the test set (but not neces- the other, we apply HOG to the depth image (Depth HOG), sarily same the views of them). We enforce this constraint where the intensity value of a pixel corresponds to the depth by ensuring that images from the same scene or room are to that point in space, measured in meters. This applica- never in both sets. This is a harder constraint than needed, tion of a gradient feature to depth images has little theo- and is not necessarily perfect (for example many different retical justification, since first-order statistics do not matter offices might contain the same model laptop), but as there is as much for depth data (this is why we use second-order no realistic way to provide per-instance labeling of a large, smoothing in section 3.3). Yet this is an expected first base- crowd-sourced dataset of cluttered scenes, we settle for this line that also forms the detection approach on some other method, and keep the problem open for further research. 3D object detection tasks, such as in [21]. Figure6 shows the detector performance on 8 different Detections are further pruned by non-maximum suppres- classes. We note that Depth HOG is never better than HOG sion, which greedily takes the highest-scoring bounding on the 2D image. We attribute this to the inappropriateness boxes and rejects boxes that sufficiently overlap with an al- of a gradient feature on depth data, as mentioned earlier, ready selected detection. This procedure results in a reduc- and to the fact that due to the limitations of the infrared tion of detections on the order of ten, and is important for structured light depth reconstruction, some objects tend to our evaluation metric, which penalizes repeat detections. be missing depth data. Average AP for Baseline, Size Pruning, and Size Rescoring 0.8 baseline pruning rescoring AP contribution of Size Pruning Average AP of Baseline and Depth Rescoring chair 0.5 0.6 baseline rescore monitor 0.375

cup 0.25 0.4 AP bottle AP AP contribution of Size Rescoring0.125 bowl 0 0.2 keyboard chair

mouse monitor -0.125 chair monitor cup bottle bowl keyboard mouse phone pillow phone cup 0 chair monitor cup bottle bowl keyboard mouse phone pillow plate bottle

-0.025 -0.005 0.015bowl 0.035 0.055 0.075 AP contribution keyboard

mouse AP contribution of Depth Rescoring to Depth HOG AP contribution of Size Pruning Average Percentage of Detections Pruned phone 0.15 chair plate 0.1 monitor -0.05 0 0.05 0.1 0.15 AP contribution cup 0.05 bottle pruning bowl rescoring AP contribution 0 keyboard mouse -0.05 chair monitor cup bottle bowl keyboard mouse phone pillow phone

-0.025 0.01 0.045 0.08 0.115 0%0.15 15% 30% 45% 60% 75% Average AP of Baseline and Size Rescoring AP contribution Percentage 0.8 baseline rescoring

0.6

0.4 AP 0.2

0 Performance of baseline detector AP contribution of Size Rescoring 0.8 -0.2 0.15 chair monitor cup bottle bowl keyboard mouse phone pillow

0.6

0.1 0.4 AP

0.05 0.2 AP contribution 0 0

chair cup bowl bottle phone pillow monitor mouse keyboard -0.05 cup chair bowl plate bottle phone bination of the two values: monitor mouse keyboard Performance of baseline detectors 0.8 s(x) = exp(α log(w(x)) + (1 − α) log(N (x|µ, σ))) color depth (1)

0.6 where w(x) = 1/(1 + exp(−2fβ(x))) is the normalized SVM score, N (x|µ, σ)) is the likelihood of the inferred world size of the detection under the size distribution of the 0.4 AP object class, and α is a parameter learned on the training set. This corresponds to unnormalized Naive Bayes combi- 0.2 nation of the SVM model likelihood and object size likeli- hood. Since what matters for the precision-recall evaluation 0 is the ordering of confidences and whether they are normal- ized is irrelevant, we are able to evaluate s(x). chair cup bowl bottle phone monitor mouse As Figure7 demonstrates, the rescoring method works keyboard better than pruning. This method is able to boost recall as Figure 6. Performance of the baseline detector on our dataset, as well as precision by assigning a higher score to likely detec- measured by the average precision. Depth HOG fails completely tions in addition to lowering the score (which is, in effect, on some categories, for reasons explained in the text. pruning) of unlikely detections.

5. Discussion 4.3. Pruning and rescoring by size We presented a novel paradigm for crowd-sourced data In section 3.4, we made the observation that true object collection that leverages the success of the Kinect depth size, even as approximated by the product of object projec- sensor. Its popularity has been encouraging, and we think tion in the image and median depth of its bounding box, it is time to “put the Kinect to work” for computer vision. varies less than bounding box size. We therefore investigate The main contribution of this paper is a novel category-level two ways of using approximated object size as an additional object dataset, which presents a challenging task and is far source of discriminative signal to the detector. beyond existing 3-D datasets in terms of the number of ob- ject categories, the number of examples per category, and Our first way of using size information consists of prun- intra-category variation. Importantly, the dataset poses the ing candidate detections that are sufficiently unlikely given problem of object detection “in the wild”, in real rooms in the size distribution of that object class. The object size people’s homes and offices, and therefore has many practi- distribution is modeled with a Gaussian, which we found is cal applications. a close fit to the underlying distribution; the Gaussian pa- rameters are estimated on the training data only. We prune References boxes that are more than σ = 3 standard deviations away from the mean of the distribution. [1] Ford campus vision and lidar dataset. Figure7 shows that the pruning results provide a boost http://robots.engin.umich.edu/Downloads.2 in detection performance, while rejecting from 12% to 68% [2] Solution in perception challenge. http://opencv.willow- garage.com/wiki/SolutionsInPerceptionChallenge.2 of the suggested detection boxes (on average across the [3] UBC Robot Vision Survey. classes, 32% of candidate detections are rejected). This ob- http://www.cs.ubc.ca/labs/lci/vrs/index.html.2 servation can be leveraged as part of an “objectness” filter [4] Introducing Kinect for . http://www.xbox.com/en- or as a thresholding step in a cascaded implementation of US/Kinect/, 2011.3 this detector for detection speed gain [5, 12]. The classes [5] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In chair and mouse are the two classes most helped by size CVPR, 2010.6 pruning, while monitors and bottle are the least helped. [6] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Using bounding box size of the detection (as measured Software Tools, 2000.3 by its diagonal) instead of inferred world size results in no [7] N. Burrus. Kinect RGB Demo v0.4.0. improvement to AP performance on average. Two classes http://nicolas.burrus.name/index.php/Research/KinectRgb that are most hurt are bowl and plate; two that are least hurt DemoV4?from=Research.KinectRgbDemoV2, Feb. 2011.3 by the bounding box size pruning are bottle and mouse. [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of Computer Vision and The second way we use size information consists of Pattern Recognition (CVPR), 2005.2 learning a rescoring function for detections, given their [9] N. Dalal and B. Triggs. Histograms of oriented gradients for SVM score and size likelihood. We learn a simple com- human detection. CVPR, 2005.4 Average AP for Baseline, Size Pruning, and Size Rescoring 0.8 baseline pruning rescoring AP contribution of Size Pruning Average AP of Baseline and Depth Rescoring chair 0.5 0.6 baseline rescore monitor 0.375

cup 0.25 0.4 AP bottle AP AP contribution of Size Rescoring0.125 bowl 0 0.2 keyboard chair

mouse monitor -0.125 chair monitor cup bottle bowl keyboard mouse phone pillow phone cup 0 chair monitor cup bottle bowl keyboard mouse phone pillow plate bottle

-0.025 -0.005 0.015bowl 0.035 0.055 0.075 AP contribution keyboard

mouse AP contribution of Depth Rescoring to Depth HOG AP contribution of Size Pruning Average Percentage of Detections Pruned phone 0.15 chair plate 0.1 monitor -0.05 0 0.05 0.1 0.15 AP contribution cup 0.05 bottle pruning bowl rescoring AP contribution 0 keyboard

mouse -0.05 chair monitor cup bottle bowl keyboard mouse phone pillow phone

-0.025 0.01 0.045 0.08 0.115 0%0.15 15% 30% 45% 60% 75% Average AP of Baseline and Size Rescoring AP contribution Percentage 0.8 baseline rescoring

Figure 7. Left: Effect on the performance of our detector shown by the two uses of object size we consider. Right: Average percentage 0.6 of past-threshold detections pruned by considering the size of the object. The light gray rectangle reaching to 32% is the average across classes. In both cases, error bars show standard deviation across six different splits of the data. 0.4 AP [10] A. Ess, K. Schindler, B. Leibe, and L. V. Gool. Object de- [18] H. Hattori, A. Seki, M. Nishiyama, and T. Watanabe. Stereo- 0.2 tection and tracking for autonomous navigation in dynamic based pedestrian detection using multiple patterns. In Pro- environments. International Journal on Robotics Research, ceedings of British Machine Vision Conference, 2009.2 0 2010.2 [19] S. Helmer, D. Meger, M. Muja, J. J. Little, and D. G. Lowe. Performance of baseline detector AP contribution of Size RescoringMultiple viewpoint recognition and localization. In Proceed- [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, 0.8 -0.2 and A.0.15 Zisserman. The PASCAL Visual Object Classes ings of Asian Conferene on Computer Vision, 2010.2 chair monitor cup bottle bowl keyboard mouse phone pillow Challenge 2010 (VOC2010) Results. http://www.pascal- [20] A. Johnson and M. Hebert. Using spin images for efficient network.org/challenges/VOC/voc2010/workshop/index.html. object recognition in cluttered 3d scenes. Pattern Analysis0.6 3,4 0.1 and Machine Intelligence, IEEE Transactions on, 21(5):433 –449, May 1999.2 0.4 [12] P. F. Felzenszwalb, R. B. Girschick, and D. McAllester. Cas- AP [21] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical cade object detection with deformable part models. CVPR, 0.05 multi-view rgb-d object dataset. ICRA, Feb 2011.2,5 Mar 2010.6 [22] M. Rohrbach, M. Enzweiler, and D. M. Gavrila. High-level0.2 [13] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- AP contribution fusion of depth and intensity for pedestrian classification. manan. Object0 detection with discriminatively trained part In Annual Symposium of German Association for Pattern0 based models. IEEE Pattern Analysis and Machine Intelli- Recognition (DAGM), 2009.2 chair cup bowl gence (PAMI), Jul 2009.4,5 bottle phone pillow [23] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu. Fast 3d monitor mouse keyboard [14] M. Fritz,-0.05 K. Saenko, and T. Darrell. Size matters: Metric recognition and pose using the viewpoint feature histogram.

cup In Proceedings of the 23rd IEEE/RSJ International Confer- visual search constraintschair from monocular metadata. In Ad- bowl plate bottle phone monitor mouse vances in Neural Information Processing Systems 23, 2010. encekeyboard on Intelligent Robots and Systems (IROS), Taipei, Tai- Performance of baseline detectors 2 wan, 10/2010 2010.2 0.8 [24] M. Sun, G. Bradski, B.-X. Xu, and S. Savarese. Depth- color depth [15] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Rec- encoded hough voting for joint object detection and shape ognizing objects in range data using regional point descrip- recovery. In ECCV, 2010.2 0.6 tors. In ECCV04, pages Vol III: 224–237, 2004.2 [25] S. Walk, K. Schindler, and B. Schiele. Disparity statistics [16] S. Gould, P. Baumstarck, M. Quigley, A. Y. Ng, and for pedestrian detection: Combining appearance, motion and0.4 AP D. Koller. Integrating visual and range data for robotic ob- stereo. In ECCV, 2010.2 ject detection. In ECCV Workshop on Multi-camera and [26] X. Wang, T. X. Han, and S. Yan. An HOG-LBP human de- 0.2 Multi-modal Sensor Fusion Algorithms and Applications tector with partial occlusion handling. ICCV, Jul 2009.4 (M2SFA2), 2008.2 [27] O. Woodford, P. Torr, I. Reid, and A. Fitzgibbon. Global [17] W. Grimson. From images to surfaces: A computational stereo reconstruction under second-order smoothness priors.0 study of the human early visual system. MIT Press, 1981. PAMI, 2009.4 chair cup bowl bottle phone 4 monitor mouse keyboard