Harvestnet: Mining Valuable Training Data from High-Volume Robot Sensory Streams
Total Page:16
File Type:pdf, Size:1020Kb
HarvestNet: Mining Valuable Training Data from High-Volume Robot Sensory Streams Sandeep Chinchali, Evgenya Pergament, Manabu Nakanoya, Eyal Cidon, Edward Zhang Dinesh Bharadia, Marco Pavone, Sachin Katti Abstract—Today’s robotic fleets are increasingly measuring In general, the problem of intelligently sampling useful or high-volume video and LIDAR sensory streams, which can anomalous training data is of extremely wide scope, since field be mined for valuable training data, such as rare scenes of robotic data contains both known weakpoints of a model as road construction sites, to steadily improve robotic perception models. However, re-training perception models on growing well as unknown weakpoints that have yet to even be discov- volumes of rich sensory data in central compute servers (or the ered. A concrete example of a known weakpoint is shown “cloud”) places an enormous time and cost burden on network in Fig. 2, where a fleet operator identifies that construction transfer, cloud storage, human annotation, and cloud computing sites are common errors of a perception model and wishes resources. Hence, we introduce HARVESTNET, an intelligent to collect more of such images, perhaps because they are sampling algorithm that resides on-board a robot and reduces system bottlenecks by only storing rare, useful events to steadily insufficiently represented in a training dataset. In contrast, improve perception models re-trained in the cloud. HARVESTNET unknown weakpoints could be rare visual concepts that are significantly improves the accuracy of machine-learning models not even anticipated by a roboticist. on our novel dataset of road construction sites, field testing of self-driving cars, and streaming face recognition, while reducing To provide a focused contribution, this paper addresses cloud storage, dataset annotation time, and cloud compute time the problem of how robotic fleets can source more training by between 65:7 − 81:3%. Further, it is between 1:05 − 2:58× data for known weakpoints to maximize model accuracy, but more accurate than baseline algorithms and scalably runs on with minimal end-to-end systems costs. Though of tractable embedded deep learning hardware. scope, this problem is still extremely challenging since a robot must harvest rare, task-relevant training examples from I. INTRODUCTION growing volumes of robotic sensory data. A principal benefit of addressing known model weakpoints is that the cloud can Learning to identify rare events, such as traffic disrup- directly use knowledge of task-relevant image classes, which tions due to construction or hazardous weather conditions, require more training data, to guide and improve a robot’s can significantly improve the safety of robotic perception sampling process. Our key technical insight is to keep robot and decision-making models. While performing their primary sampling logic simple, but to leverage the power of the cloud tasks, fleets of future robots, ranging from delivery drones to to continually re-train models and annotate uncertain images self-driving cars, can passively collect rare training data they with a human-in-the-loop, thereby using cloud feedback to happen to observe from their high-volume video and LIDAR direct robotic sampling. Moreover, we discuss how insights sensory streams. As such, we envision that robotic fleets will from HARVESTNET can be extended to collecting training data act as distributed, passive sensors that collect valuable training for unknown weakpoints in future work. data to steadily improve the robotic autonomy stack. Despite the benefits of continually improving robotic mod- els from field data, they come with severe systems costs that are largely under-appreciated in today’s robotics literature. Robot Limited Network Cloud Specifically, these systems bottlenecks (shown in Fig. 1) stem Vision Model Update from the time or cost of (1) network data transfer to compute Model servers, (2) limited robot and cloud storage, (3) dataset anno- Stored tation with human supervision, and (4) cloud computing for Dataset training models on massive datasets. Indeed, it is estimated that Re-train Sampler Model a single self-driving car will produce upwards of 4 Terabytes Annotate Cache (TB) of LIDAR and video data per day [1, 7], which we “Interesting” contrast with our own calculations in Section II. Image? Upload In this paper, we introduce a novel Robot Sensory Sampling Sample Problem (Fig. 1), which asks how a robot can maximally Cache improve model accuracy via cloud re-training, but with min- imal end-to-end systems costs. To address this problem, we Fig. 1: The Robot Sensory Sampling Problem. What are introduce HARVESTNET, a compute-efficient sampler that the minimal set of “interesting” training samples robots can scalably runs on even resource-constrained robots and reduces send to the cloud to periodically improve perception model systems bottlenecks by “harvesting” only task-relevant training accuracy while minimizing system costs of network transfer, examples for cloud processing. cloud storage, human annotation, and cloud-computing? Literature Review: HARVESTNET is closely aligned with Next, Sections IV and V quantify HARVESTNET’s accuracy cloud robotics [6, 16, 15] , where robots send video or LIDAR improvements for perception models, key reductions in system to the cloud to query compute-intensive models or databases bottlenecks, and performance on the Edge TPU [5]. Finally, when they are uncertain. Robots can face significant network we conclude with future applications in Section VI. congestion while transmitting high-datarate sensory streams to the cloud, leading to work that selectively offloads only uncertain samples [11] or uses low-latency “fog computing” nodes for object manipulation and tele-operation [25, 26]. As opposed to focusing solely on network latency for real- time control, HARVESTNET addresses overall system cost (from network transfer to human annotation time) and instead Error focuses on the open problem of continual learning from Domain- cleverly-sampled batches of field data. Specific Our work is inspired by traditional active learning [27, 22, 12, 14], novelty detection [18, 10], and few-shot learning [30, 28, 20], but differs significantly since we use novel compute- efficient algorithms, distributed between the robot and cloud, Corrected that are scalable to run on even resource-constrained robots. Many active learning algorithms [27, 22, 12] use compute- Fig. 2: HARVESTNET Re-Trained Vision Models. From only intensive acquisition functions to decide which limited samples a minimal set of training examples, HARVESTNET adapts a learner should label to maximize model accuracy, which default vision models trained on the standard COCO dataset often involve Monte Carlo Sampling to estimate uncertainty [17] (top images) to better, domain-specific models (bottom over a new image. Indeed, the most similar work to ours [14] images). We flexibly fine-tune mobile-optimized models, such uses Bayesian Deep Neural Networks (DNNs), which have a as MobileNet 2, as well as compute-intensive, but more distribution over model parameters, to take several stochastic accurate models such as Faster R-CNN [19]. forward passes of a neural network to empirically quantify uncertainty and decide whether an image should be labeled. Such methods are infeasible for resource-constrained robots, II. MOTIVATION which might not have the time to take several forward passes We now describe the costs and benefits of continual learning of a DNN for a single image or capability to efficiently run in the cloud by collecting and processing three novel percep- a distribution over model parameters. Indeed, we show in tion datasets. Section V how our sampler uses a hardware DNN accelerator, specifically the Google Edge Tensor Processing Unit (TPU), A. Costs and Benefits of Continual Learning in the Cloud to evaluate an on-board vision DNN, with one fixed set of Benefits: A prime benefit of continual learning in the pre-compiled model parameters, in < 16ms and decides to cloud is to correct weak points of existing models or specialize sample an image in only ≈ 0:3ms! More broadly, our work them to new domains. As shown in Fig. 2 (right, top), stan- differs since we (1) address sensory streams (as opposed to dard, publicly-available vision deep neural networks (DNNs) a pool of unlabeled data), (2) use a compute-efficient model pre-trained on benchmark datasets like MS-COCO [17] can to guide sampling, (3) direct sampling with task-relevant misclassify traffic cones as humans but also miss the large cloud feedback, and (4) also address storage, network, and excavator vehicle, which neither looks like a standard truck computing costs. nor a car. However, as shown in Fig. 2 (right, bottom), a Statement of Contributions: In light of prior work, our con- specialized model re-trained in the cloud using automatically tributions are four-fold. First, we collect months of novel video sampled, task-relevant field data can correct such errors. footage to quantify the systems bottlenecks of processing More generally, field robots that “crowd-source” interesting growing sensory data. Second, we empirically show that large or anomalous training data could automatically update high- improvements in perception model accuracy can be realized by definition (HD) road maps. automatically sampling task-relevant training examples from Costs: The above benefits of cloud re-training, however, vast sensory streams. Third, we design a sampler that intelli- often come with understated costs, which we now quantify. gently delegates compute-intensive tasks to the cloud, such as Table I estimates monthly systems costs for a small fleet of 10 adaptively tuning key parameters of robot sampling policies self-driving cars, each of which is assumed to capture 4 hours via cloud feedback. Finally, we show that our perception of driving data per day. In the first scenario, called “Dashcam- models run efficiently on the state-of-the-art Google Edge TPU ours”, we assume a single car has only one dashcam capturing [5] for resource-constrained robots. H.264 compressed video, as in our collected dataset.