<<

Block Annotation: Better Image Annotation with Sub-Image Decomposition

Hubert Lin Paul Upchurch Kavita Bala Cornell University Cornell University Cornell University

Abstract are required for principled conclusions to be drawn. In the past, polygon annotation tools have enabled partially dense Image datasets with high-quality pixel-level annotations annotations (in which small semantic regions are densely are valuable for semantic segmentation: labelling every annotated) to be crowdsourced at scale with public crowd pixel in an image ensures that rare classes and small ob- workers. These tools paved the way for the cost-effective cre- jects are annotated. However, full-image annotations are ation of large-scale partially dense datasets such as [8, 37]. expensive, with experts spending up to 90 minutes per image. Despite the success of these annotation tools, fully dense We propose block sub-image annotation as a replacement datasets have relied extensively on expensive expert annota- for full-image annotation. Despite the attention cost of fre- tors [60, 14, 41, 64, 42] and private crowdworkers [11]. quent task switching, we find that block annotations can We propose annotation of small blocks of pixels as a be crowdsourced at higher quality compared to full-image stand-in replacement for full-image annotation (figure1). annotation with equal monetary cost using existing annota- We find that these annotations can be effectively gathered tion tools developed for full-image annotation. Surprisingly, by crowdworkers, and that annotation of a sparse number of we find that 50% pixels annotated with blocks allows se- blocks per image can train a high performance segmentation mantic segmentation to achieve equivalent performance to network. We further show these sparsely annotated images 100% pixels annotated. Furthermore, as little as 12% of can be extended automatically to full-image annotations. pixels annotated allows performance as high as 98% of the We show block annotation has: performance with dense annotation. In weakly-supervised • Wide applicability. (Section3) Block annotations can be settings, block annotation outperforms existing methods by effectively crowdsourced at higher quality compared to full 3-4% (absolute) given equivalent annotation time. To re- annotation. It is easy to implement and works with existing cover the necessary global structure for applications such as advances in image annotation. characterizing spatial context and affordance relationships, we propose an effective method to inpaint block-annotated • Cost-efficient Design. (Section3) Block annotation re- images with high-quality labels without additional human flects a cost-efficient design paradigm (while current re- effort. As such, fewer annotations can also be used for these search focuses on reducing annotation time). This is remi- applications compared to full-image annotation. niscent of gamification and citizen science where enjoyable tasks lead to low-cost high-engagement work. • Complex Region Annotation. (Section3) Block annota- 1. Introduction tion shifts focus from categorical regions to spatial regions. When annotating categorical regions, workers segment sim- Recent large-scale datasets place a heavy ple objects before complex objects. With spatial regions, emphasis on high-quality fully dense annotations (in which informative complex regions are forced to be annotated. over 90% of the pixels are labelled) for hundreds of thou- • Weakly-Supervised Performance. (Section4) Block an- sands of images. Dense annotations are valuable for both notation is competitive in weakly-supervised settings, outper- semantic segmentation and applications beyond segmenta- forming existing methods by 3-4% (absolute) given equiva- tion such as characterizing spatial context and affordance lent annotation time. relationships [11, 23]. The long-tail distribution of classes means it is difficult to gather annotations for rare classes, es- • Scalable Performance. (Section4) Full-supervision per- pecially if these classes are difficult to segment. Annotating formance is achieved by annotating 50% of blocks per image. every pixel in an image ensures that pixels corresponding to Thus, blocks can be annotated until desired performance is rare classes or small objects are labelled. Dense annotations achieved, in contrast to methods such as scribbles. also capture pixels that form the boundary between classes. • Scalable Structure. (Section5) Block-annotated images For applications such as understanding spatial context be- can be effectively inpainted with high quality labels without tween classes or affordance relationships, dense annotations additional human effort.

1 Figure 1: (a) Sub-image block annotations are more effective to gather than full-image annotations (b) Training on sparse block annotations enables semantic segmentation performance equivalent to full-image annotations (c) Block labels can be inpainted with high-quality labels.

2. Related Work human-annotated blocks can be extended automatically into dense annotations (sec.5), and we discuss how other human- In this section we review recent works on pixel-level anno- machine methods can be used with blocks (sec. 3.6). tation in three areas: human annotation, and human-machine annotation, and dense segmentation with weak supervision. Weakly-Supervised Dense Segmentation. There are al- Human Annotation. Manual labeling of every pixel is ternatives to training with high-quality densely annotated impractical for large-scale datasets. A successful method images which substitute quantity for label quality and/or rich- is to have crowdsource workers segment polygonal regions ness. Previous works have used low-quality pixel-level an- to click on boundaries. Employing crowdsource workers notations [65], bounding boxes [45, 28, 49], point-clicks [7], offers its own set of challenges with quality control and task scribbles [7, 36], image-level class labels [45, 53, 3], image- design [56, 8, 55]. Although large-scale public crowdsourc- level text descriptions [24] and unlabeled related web ing can be successful [37] recent benchmark datasets have videos [24] to train semantic segmentation networks. Com- resorted to in-house expert annotators [14, 43]. Annotation bining weak annotations with small amounts of high-quality time can be reduced through improvements such as autopan, dense annotation is another strategy for reducing cost [9, 26]. zoom [8] and shared polygon boundaries [60]. Polygon seg- [52] proposes a two-stage approach where image-level class mentation can be augmented by painted labels on superpixel labels are automatically converted into pixel-level masks groups [11] and Bezier curves [60]. Pixel-level labels for which are used to train a semantic segmentation network. images can also be obtained by (1) constructing a 3D scene We find a small number of sub-image block annotations is a from an image collection, (2) grouping and labeling 3D competitive form of weak supervision (sec. 4.3). shapes and (3) propagating shape labels to image pixels [39]. In our work, we investigate sub-image polygon annotation, which can be further combined with other methods (sec.3.) 3. Block Annotation

Human-Machine Annotation. Complex boundaries are Sub-image block annotation is composed of three stages: time-consuming to trace manually. In these cases the cost of (1) Given an image I, select a small spatial region I0; (2) pixel-level annotation can be reduced by automating a por- Annotate I0 with pixel-level labels; (3) Repeat (with different tion of the task. Matting and object selection [50, 33, 34, 6, I0) until I is sufficiently annotated. In this paper, we explore 58, 57, 10, 30, 59] generate tight boundaries from loosely an- the case where I0 is rectangular, and focus on the use of notated boundaries or few inside/outside clicks and scribbles. existing pixel-level annotation tools. [44, 38] introduced a predictive method which automatically Can block annotations be gathered as effectively as full- infers a foreground mask from 4 boundary clicks, and was image annotations with existing tools? In section 3.1, we extended to full-image segmentation in [2]. The number show our annotation interface. In section 3.2, we explore the of boundary clicks was further reduced to as few as one quality of block annotation versus full-image annotation. In by [1]. Predictive methods require an additional human section 3.3, we examine block annotation for a real-world verification step since the machine mutates the original hu- dataset. In section 3.4, we discuss the cost of block annota- man annotation. The additional step can be avoided with an tion and show worker feedback. In section 3.5, we discuss online method. However, online methods (e.g., [1, 30, 2]) how blocks for annotation can be selected in practice. Fi- have higher requirements since the algorithm must be trans- nally, in section 3.6 we discuss the compatibility of block lated into the web browser setting and the worker’s machine annotation with existing annotation methods. must be powerful enough to run the algorithm1. Alterna- tively, automatic proposals can be generated for humans to manipulate: [5] generates segments, [4] generates a set of 3.1. Annotation Interface matting layers, [61] generates superpixel labels, and [47] Our block annotation interface is given in figure2 and generates boundary fragments. In our work, we show that implemented with existing tools [8]. For full image annota- 1Offloading online methods onto a cloud service offers a different land- tion, the highlighted block covers the entire image. Studies scape of higher costs (upfront development and ongoing operation costs). are deployed on Amazon Mechanical Turk. Block Full Error 0.253 0.286 Error (small regions) 0.636 0.677 $ / hr $1.40 / hr $3.12 / hr Total cost $2.00 $2.05 Total cost (median) $1.99 $2.23 (a) Highlighted block. (b) Finished block annotation. # segments 95.68 38.95 Figure 2: Block Annotation UI. Annotators are given one high- $ / segment $0.0215 $0.0595 lighted block to annotate with the remainder of the image as context. Table 1: Block vs Full Annotation. Average statistics per image.

Figure 4: SUNCG/CGIntrinsics annotation. (a) Ground truth. (b) Block annotation (zoomed-in) (c) Full annotation (zoomed-in). White dotted box highlights an example where block annotation qualitatively outperforms full annotation. More in supplemental. (a) (b) Figure 3: Annotation error rate for block and full annotation. takes, synthetic datasets are generated with known ground Each point represents one image. The same set of images are both truth labels with which annotation error can be computed. block annotated and full-image annotated. The stars represent the The CGIntrinsics dataset [35] contains physically-based ren- centroid (median). Cost/time include estimated cost/time to assign derings of indoor scenes from the SUNCG dataset [54, 63]. labels for each segment [8]. Lower-left is better. With block annotation, workers (a) choose to work for lower wages and (b) We use the more realistic CGIntrinsics renderings and the segment more regions for less pay per region. The overall quality known semantic labels from SUNCG. The labels are catego- is higher for block annotation. rized according to the NYU40[20] semantic categories. Due . to the nature of indoor scenes, the depth and field of view of each image is smaller than outdoor datasets. The reduced 3.2. Quality of Block Annotation complexity means that crowdworkers are able to produce We explore the quality of block annotations compared good full-image annotations for this dataset. to full-image annotations on a synthetic dataset. How does We select MTurk workers who are skilled at both full- the quality and cost compare between block and full annota- image annotation and block annotation in a pilot study (a tions? We find that the average quality for block-annotated standard quality control practice [8]). The final pool consists images is higher while the total monetary cost is about the of 10 workers. Image difficulty is estimated by counting the same. The average quality of block annotations is consis- maximum number of ground truth segments in a fixed-size tently higher including for small regions (e.g. fig4). The sliding window. Windows, mirrors, and void regions are overall block annotation error is 12% lower than full an- masked out in the images so that workers do not expend notation. For regions smaller than 0.5% of the image, the effort on visible content for which ground truth labels do block annotation error is 6% lower. In figure3, the cost not exist (such as objects seen through a window or mirror). and quality of block versus full image annotation is shown. We manually cull images that include transparent glass ta- Remarkably, we find that workers are willing to work on bles which are not visible in the renderings, or doorways block annotation tasks for a significantly lower hourly wage. through which visible content can be seen but no ground This indicates that block annotation is more intrinsically truth labels exist. After filtering, twenty of the one hundred palatable for crowdworkers, in line with [27] which shows most difficult images are selected. We choose a block size task design can influence quality of work. Moreover, work- so that an average of 3.5 segments are in each block. This ers are more likely to over-segment objects with respect to results in 16 blocks per image. For each task, a highlighted ground truth (e.g. individual cushions on a couch, handles rectangle outlines the block to be annotated. We find that on cabinets) with block annotation tasks. Note that block workers will annotate up to the inner edge of the highlighted boundaries may also divide semantic regions. Table1 con- boundary. Therefore, we ensure the edges of the rectangle tains additional statistics. Despite similar costs to annotate do not overlap with the region to be annotated. 2 an image in blocks or in full, we show in section4 that com- Workers are paid $0.06 per block annotation task and petitive performance is achieved with less than half of the $0.96 per full-image task. Bonuses up to 1.5 times the base blocks annotated per image. pay are awarded to attempt to raise the effective hourly wage for difficult tasks to $4 / hr. Our results show that workers Study Details. For these experiments, we chose to use a synthetic dataset. While human annotations may contain mis- 2$ refers to USD in throughout this paper. Block (Crowd) Full (Expert [14, 42]) $ / Task $0.13 - Time / Task 2 min 1.5 hr Table 2: Real-world cost of annotation. Cost evaluated on Cityscapes. Each block is annotated by MTurk workers. Full- image is annotated by experts in [14]. Note: [14] annotates instance segments. See table1 for crowd-to-crowd comparison. are willing to work on block annotation tasks beyond the time threshold for bonuses, effectively producing work for an hourly wage significantly lower than the intended $4 / hr. On the other hand, workers do not often exhibit this behavior with full annotation tasks. Different workers may work on Figure 5: Crowdsourced vs expert segments. Crowdsourced different blocks belonging to the same image. We use two block-annotated segments are compared to expert Cityscapes seg- forms of quality control: (1) annotations must contain a num- ments. Crowdsourced segments are colored for easier comparison. ber of segments greater than 25% of the known number of Top-left is a high-quality example. See supplemental for more. ground-truth segments for that task and (2) annotations can- not be submitted until at least 10 seconds / 3 minutes (block / full) have passed. All submissions satsifying these condi- (for a total cost of $4). We approved all of their submissions tions are accepted during the user study. For an overview of during the user study. We do not restrict workers from an- QA methods, please refer to [48]. Labels are assigned by notating outside of the block, and we do not force workers majority ground-truth voting, with cost estimated from [8]. to densely annotate the block. We do not include the use of To evaluate the quality of annotations in an image with sentinels or tutorials as in [8]. K classes, we measure the class-balanced error rate (class- Thirteen randomly selected validation images from balanced Jaccard distance): Cityscapes are annotated by crowdworkers. Each image is divided into 100 uniformly shaped blocks. A total of 650 K 1 X (FPc + FNc) (50 per image) are annotated in random order. Workers are error rate = (1) K (TPc + FPc + FNc) paid $0.06 per task. Workers are automatically awarded c=1 bonuses so that the effective hourly wage at least $5 for each = 1 − mIOU block, with bonuses capped at $0.24 to prevent abuse. For 3.3. Viability of Real-World Block Annotation one block, the total base payout is $0.06 with an average of $0.0636 in bonuses over 93 seconds of active work. On aver- How does block annotation fare with a real-world non- age, each annotated block contains 3.5 segments. Assigning synthetic dataset? To study the viability of block an- class labels will cost an additional $0.01 and 26 seconds notating real-world datasets with scalable crowdsourcing, [8]. To be consistent with Cityscapes, we instruct workers to we ask crowdworkers to annotate blocks from images in not segment windows, powerlines, or small regions of sky Cityscapes [14]. We choose Cityscapes for the annotation between leaves. However, workers will occasionally choose complexity of its scenes – 1.5 hours of expert annotation to do so and submit higher quality segments than required. effort is required per image. In contrast, other datasets such as [41, 11] require less than 20 minutes of annotation effort 3.4. Annotation Cost and Worker Feedback per image. We expect crowd work to be worse than expert work, so it is a surprisingly positive result that the quality of Our costs (tables1,2) are aligned with existing large-scale the crowdsourced segments are visually comparable to the studies. Large-scale datasets [8, 37] show that cultivating expert Cityscapes segments (figure5). Some crowdsourced good workers produces high quality data at low cost. Table segments are very high quality. We find that 47% of blocks 2 of [21] reports a median wage of $1.77/hr to $2.11/hr; the have more crowd segments than expert segments (20% have median MTurk wage in India is $1.43/hr [22]. For “image fewer segments, and 33% have the same # of segments). transcription”, the median wage is $1.13/hr over 150K tasks. A summary of the cost is given in table2 which compares Workers gave overwhelmingly positive feedback for block public crowdworkers to trained experts. It is feasible block annotation (table3), and we found that some workers would annotation time will decrease with expert training. Given reserve hundreds of block annotation tasks at once. Only 3 100 uniformly sized blocks per image, we ask an expert to out of the 57 workers who successfully completed at least create equal-quality block and full annotations; we find one one pilot or user study task requested higher pay. In contrast, block is 1.56% of the effort of a full image. our pilot studies showed that workers are unwilling to accept Study Details. We searched for workers who produce full-image annotation tasks if the payment is reduced to high-quality work in a pilot study and found a set of 7 work- match the wage of block annotation. We conjecture that task ers. These workers were found within a hundred pilot HITs enjoyment leads to long term high-quality output (c.f. [27]). “Nice” “Fun” Release Increase Interactive Segmentation. Recent advances in interactive “Good” “Easy” “Okay” “Happy” More HITs Pay segmentation (e.g., [1, 38, 2]) utilize neural networks to “Great” convert sparse human inputs into high quality segments. # 8 5 4 2 2 3 For novel domains without large-scale training data, block- Table 3: Block annotation worker feedback. Free-form re- annotated images can act as cost-efficient seed data to train sponses are aggregated over SUNCG and Cityscapes experiments, these models. Once trained, these methods can be applied and collected at most once per worker. All 24 sentiments across all directly to each block, although further analysis should be 19 worker responses are summarized. conducted to explore the efficiency of such an approach due 3.5. Block Selection to block boundaries splitting semantic regions. Our experiments show that workers are comfortable an- 4. Segmentation Performance notating between 3 to 6 segments per block. Therefore, block size can be selected by picking a size such that the How well do block annotations serve as training data for average number of segments per block falls in this range. semantic segmentation? In section 4.1, the experimental For a novel dataset, this can be done fully labelling sev- setup is summarized. In section 4.2, we evaluate the ef- eral samples and producing an estimate from the fully la- fectiveness of block annotations for semantic segmentation. belled samples. Without priors on spatial distribution of rare In section 4.3, we compare block annotation with existing classes or difficult samples within an image, a checkerboard weakly supervised segmentation methods. or pseudo-checkerboard pattern of blocks focuses attention (across different tasks) uniformly across the image. Far apart 4.1. Experimental Setup pixels within an image are less correlated than neighboring Pixel Budget. We vary the “pixel budget” in our experi- pixels. Therefore, it is good to sample blocks that are spread ments to explore segmentation performance across a range out to encourage pixel diversity within images. available annotated pixels. “Pixel budget” refers to the % of 3.6. Compatibility with Existing pixels annotated across the training dataset, which can be Annotation Methods controlled by varying the number of annotated images, the number annotated blocks per image, and the size of blocks Block annotation is compatible with many annotation per image. Our block sizes are fixed in our experiments. tools and innovations besides polygon boundary annotation. Block Size. We divide images into a 10-by-10 grid for our Point-clicks and Scribbles. Annotations such as point experiments. clicks or scribbles are faster to acquire than polygons, which Block Selection. We experiment with two block selection leads to a larger and more varied dataset at the same cost. strategies: (a) Checkerboard annotation and (b) Pseudo- Combining this with blocks will further increase annotation checkerboard annotation. Checkerboard annotation means variety due to the diversity that come from annotating a that every other block in a variable number of images are few blocks in many images over annotating fewer number annotated. Pseudo-checkerboard annotation means that ev- of images fully. Additionally, [7, 9] show that the most ery N blocks are annotated in every image, where N is cost-effective method for semantic segmentation is a combi- # pixels in dataset . For example, with a pixel nation of densely annotated images and a large number of pixel budget point clicks. The densely annotated images can be replaced budget equivalent to 25% of the dataset, every fourth block by polygon block annotations since they also contain class is annotated for the entire dataset. At pixel budget 50%, boundary supervision for the segmentation network. checkerboard and pseudo-checkerboard are identical. For the remainder of the paper, “Block-X%” refers to Superpixels. Superpixel annotations enable workers to pseudo-checkerboard annotation in which X% of the blocks mark a group of visually-related pixels at once [11]. This per image are annotated. can reduce the annotation time for background regions and objects with complex boundaries. Superpixel annotation can Sementation Model. We use DeepLabv3+[13] initialized be easily deployed to our block annotation setting. with the official pretrained checkpoint (pretrained on Im- ageNet [16] + MSCOCO [37] + Pascal VOC [17]). The Polygon Boundary Sharing. Boundary sharing reuses ex- network is trained for a fixed number of epochs. See supple- isting boundaries so that workers do not need to trace each mental for additional details. boundary twice [60]. This approach can be easily deployed in our block annotation setting. Datasets. Cityscapes is a dataset with ground truth anno- tations for 19 classes with 2975 training images and 500 Curves. Bezier tools allow workers to quickly annotate validation images. ADE20K contains ground truth annota- curves [60]. It can be easily deployed in our block annotation tions for 150 classes with 20210 training images and 2000 setting but it may be less effective on long curves since each validation images. These datasets are chosen for their high part of the curve must be fit separately. when the network is trained on the full dataset compared to pseudo-checkerboard blocks. Remarkably, we find that checkerboard blocks with 50% pixel budget allow the net- work to achieve similar performance to the full dataset with 100% pixel budget, indicating that at least 50% of the pix- els in Cityscapes and ADE20K are redundant for learning semantic segmentation. Furthermore, with only 12% of the pixels in the dataset annotated, relative error in segmen- tation performance is within 12%/2% of the optimal for Cityscapes/ADE20K. These results suggest that fewer than 50% of the blocks in an image need to be annotated for train- ing semantic segmentation, reducing the cost of annotation reported in section3. 4.3. Weakly Supervised Segmentation Comparison Block annotation can be considered a form of weakly supervised annotation where a small number of pixels in Figure 6: Semantic segmentation performance. Training an image are labelled. Representative works in this area images are annotated with different pixel budgets. Pseudo- include [36, 7, 46, 45, 15]. Table 3 of [36] is replicated here checkerboard block annotation outperforms checkerboard and full (table5) for reference, and extended with our results. All annotation. existing results show performance with a VGG-16 based quality dense ground truth annotations and for their differ- model. We train a MobileNet based model which has been ences in number of images / classes and types of scenes shown to achieve similar performance to VGG-16 (71.8% vs represented. The block annotations are synthetically gener- 71.5% Top-1 accuracy on ImageNet) while requiring fewer ated from the existing annotations. computational resources [25, 51]. Our fully-supervised im- plementation pretrained on ImageNet achieves 69.6% mIOU 4.2. Evaluation on Pascal VOC 2012 [17]; in comparison, the reference Blocks vs Full Image. How does block annotation com- DeepLab-VGG16 model achieves 68.7% mIOU [12] and pare to full-image annotation for semantic segmentation? We the re-implementation in [36] achieves 68.5% mIOU. plot the mIOU achieved when trained on a set of annotations against pixel budget in figure6. Method Annotations mIOU (%) For both Cityscapes and ADE20K, block annotation sig- MIL-FCN [46] Image-level 25.1 nificantly outperforms full-image annotation. The perfor- WSSL [45] Image-level 38.2 mance gap widens as the pixel budget is decreased – at pixel point sup. [7] Point 46.1 ScribbleSup [36] budget 12%, the reduction in error from full annotation to Point 51.6 WSSL [45] Box 60.6 block annotation is 13% (10%) Cityscapes (ADE20K). Our BoxSup [15] Box 62.0 results indicate that the quantity of annotated images is more ScribbleSup [36] Scribble 63.1 valuable than the quantity of annotations per image. The Ours: Block-1% Pixel-level Block 61.2 pseudo-checkerboard block selection pattern consistently Ours: Block-5% Pixel-level Block 67.6 outperforms the checkerboard block selection pattern and Ours: Block-12% Pixel-level Block 68.4 full annotation. For any pixel budget, pseudo-checkerboard Full Supervision Pixel-level Image 69.6 block annotation annotates fewer pixels per image which Table 5: Weakly-supervised segmentation performance. Eval- means more images are annotated. uated on Pascal VOC 2012 validation set. Original table from [36]. Blocks (N%) indicates N% of image pixels (N pseudo- Optimal (Full) Block-50% Block-12% checkerboard blocks) are labelled. Cityscapes 77.7 77.7 74.6 ADE20K 37.4 37.2 36.1 Performance Comparison. With only 1% of the pixels Table 4: Semantic segmentation performance when trained on all images. Training with block annotations uses fewer annotated annotated, block annotation achieves comparable perfor- pixels than full annotation but achieves equivalent performance. mance to existing weak supervision methods. Based on our results in section 3.2, the cost of annotation for 1% of pixels Blocks vs Optimal Performance. How many blocks need with blocks will be 100× less than the cost of full-image to be annotated for segmentation performance to approach annotation. Increasing the budget to 5%-12% significantly the performance achieved by training on full-image anno- increases performance. With 12% of pixels annotated with tations for the entire dataset? In table4, we show results blocks, the segmentation performance (error) is within 98% Ours: Block Coarse Full Supervision h×w×K Cityscapes “hint” (ala [62]) of 1-0 class labels W ∈ R where (7 min) (7 min [14]) (90 min [14]) K is the number of classes. At inference time, the hint mIOU (%) 72.1 68.8 77.7 contains known labels for the annotated blocks of an image Ours: Block Scribbles Full Supervision Pascal which serve as context for the inpainting task. Hidden layers (25 sec) (25 sec [36]) (4 min [41]) are augmented with dropout which will be used to control mIOU (%) 67.2 63.1 [36] 69.6 quality by estimating epistemic uncertainty [18, 19]. Table 6: Weakly-supervised segmentation performance given Estimating Uncertainty. Inpainting fills all missing re- equal annotation time. For time comparison of scribbles against gions without considering the trade off between quantity other methods, please refer to [36]. and quality. Existing datasets have high-quality annotations (4%) of segmentation performance (error) with 100% of for 92-94% of pixels [64, 11]. Therefore, we modify our pixels annotated. network to produce uncertainty estimates which allow us Note that block annotations can be directly transformed to explicitly control this trade off. The uncertainty of pre- into gold-standard fully dense annotations by simply gath- dictions is correlated with incorrect predictions [31, 29]. ering more block annotations within an image. This is not Uncertainty is computed by activating dropout at inference feasible with other annotations such as point clicks, scrib- time. The predictions are averaged over the g trials giving h×w bles, and bounding boxes. Furthermore, in section5, we us U ∈ R , a matrix of uncertainty estimates per image. demonstrate a method to transform block annotations into We take the sample standard deviation corresponding to the dense annotations without any additional human effort. predicted class for each pixel to be the uncertainty. For each pixel (i, j), the mean softmax vector over g trials is: Equal Annotation Time Comparison. Given equal an- notation time, block annotation significantly outperforms g P (i,j) coarse and scribble annotations by ∼3-4% mIOU (table6). p (y|I,W ) t=1 On Pascal, 97% of full-supervision mIOU is achieved with µ(i,j) = (2) g 1/10 annotation time. We convert annotation time to number of annotated blocks as follows. Block annotation may use up where p(y|I,W ) ∈ RK is the softmax output of the network. to 2.2× the time of full-image annotation. Given an image The corresponding uncertainty vector is: divided into 100 blocks, an annotation time of T leads to v T (eq. 5) blocks annotated, where F is the full-image u g 0.022F u P (i,j) (i,j) 2 annotation time. u (p (y|I,W ) − µ ) U0(i,j) = t t=1 (3) g − 1 5. Block-Inpainting Annotations Although block annotations are useful for learning se- Thus, the uncertainty for each pixel (i, j) is: mantic segmentation, the full structure of images is required (i,j) 0(i,j) (i,j) for many applications. Understanding the spatial context U = Um , where m = arg max µk (4) k or affordance relationships [11, 23] between classes relies on understanding the role of each pixel in an image. Shape- Training. Block annotations serve both as hints and tar- based retrieval, object counting [32], or co-occurrence rela- gets. This means that no additional data (or human annota- tionships [40] also depend on a global understanding of the tion effort) is required to train the block-inpainting model. image. The naive approach to recover pixel-level labels is to For our experiments, we use (synthetically generated) Block- use automatic segmentation to predict labels. However, this 50% annotations. For each image, half of annotated blocks does not leverage existing annotations to improve the quality are randomly selected online at training time to be hints. All of predicted labels. In section 5.1, we propose a method to of the annotated blocks are used as targets. This encourages inpaint block-annotated images by using annotated blocks the network to “copy-paste” hints in the final output while as context. In section 5.2, we examine the quality of these leveraging the hints as context to inpaint labels for regions inpainted annotations. where hints are not provided. 5.1. Block-Inpainting Model 5.2. Evaluation The goal of the block-inpainting model is to inpaint labels Quality of Inpainted Labels. How good are inpainted la- for unannotated blocks given the labels for annotated blocks bels? We compare labels produced by the block-inpainting in an image. For full implementation details and ablation network with low U (i,j) against the known human labels in studies, please refer to the supplemental. Cityscapes and ADE20K. The block-inpainting model pro- Architecture. The block-inpainting model is based on duces labels whose human-agreement is competitive with DeepLabv3+. The input layer is modified so that the RGB that achieved by human annotators. We inpaint Block-50% image, I ∈ Rh×w×3, is concatenated with multichannel annotations in this experiment. At a relative uncertainty Random Random Checker Every oth. None (Bndy) (Full) (10x10) pixel Rel. mIOU 0.77 0.90 0.92 0.95 1.0 Table 7: Block-inpainting with different types of hints. “Every other pixel” annotations are infeasible in practice. Relative perfor- (a) Full human labels (b) Original image mance of hints with respect to “every other pixel” hints is shown. Checkerboard blocks outperform no hints, random blocks (only boundaries within blocks), and random blocks (full blocks).

pixels can be inferred with a simple nearest-neighbors al- gorithm. In practice, it is impossible to precisely annotate single pixels in an image. However, we can approximate the (c) Inpainted labels (all) (d) Label agreement (white) same properties of labelling every other pixel by labelling every other block instead (i.e., a checkerboard pattern). In table7, we show the block-inpainting model mIOU when different types of hints are given. The rightmost col- umn (“every other pixel”) is not feasible to collect in practice. Checkerboard annotations outperform random block annota- (e) Inpainted labels (<20% rel- (f) Label agreement (white) tions even though the network is trained to expect random ative uncertainty) block hints. Providing only boundary annotations within Figure 7: Block-inpainted labels. Example of human labels vs each block (i.e. annotating pixels within 10 pixels of each human Block-50% + inpainted labels. Void labels are masked out. boundary in each block) allows the network to achieve nearly the same performance as full block hints. This suggests that threshold of 0.2 (0.4) on Cityscapes (ADE20K), over 94% the most informative pixels for the block-inpainting model of the pixels are labelled. The mean pixel agreement is are those near a boundary. 99.8% (98.7%) and the class-balanced error rate is 3.1% (28%). Previous work show that human label agreement 6. Conclusion across annotators is 66.8% to 73.6% while annotator self- agreement is 82.4% to 97.0% [64, 11]. Human annotators In this paper we have introduced block annotation as a fail to agree in non-trivial fashion – [64] shows that annota- replacement for traditional full-image annotation with public tor self-agreement fails in three ways: variations in complex crowdworkers. For semantic segmentation, Block-12% of- boundaries (32%), incorrect naming of ambiguous classes fers strong performance at 1/8th of the monetary cost. Block- (34%), and failure to segment small objects (34%). In figure 5% offers competitive weakly-supervised performance at 7, a visualization of labels generated by the block-inpainting equal annotation time to existing methods. For optimal model is shown. The number of pixel disagreements de- semantic segmentation performance, or to recover global creases with a higher uncertainty threshold. structure with inpainting, Block-50% should be utilized. Block Inpainting vs Automatic Segmentation. Consider There are many directions for future work. Our crowd- a scenario in which a small number of pixels in a dataset worker tasks are similar to full-image annotation tasks so it are annotated, and the remainder are automatically labelled may be possible to improve the gains with more exploration to produce dense annotations. Why should block inpaint- and development of boundary marking algorithms. We have ing be used instead of automatic segmentation? Full pixel- explored some block patterns and further exploration may level labels produced by block inpainting are superior to reveal even better trade-offs between annotation quality, cost automatic segmentation. On Cityscapes, automatic seg- and image variety. Another interesting direction is acquiring mentation achieves 78% validation mIOU while block in- instance-level anntoations by merging segments across block painting Block-50% annotations achieves 92% validation boundaries. Active learning can be used to select blocks mIOU. With Block-12% annotations, automatic segmenta- of rare classes, and workers can be assigned blocks so that tion achieves 75% validation mIOU while block inpainting annotation difficulty matches worker skill. achieves 82% validation mIOU. Acknowledgements. We acknowledge support from Block Selection vs Block-Inpainting Quality How does , NSF (CHS-1617861 and CHS-1513967), the checkerboard pattern compare to other block selection PERISCOPE MURI Contract #N00014-17-1-2699, and strategies as hints to the block-inpainting model? Intuitively, NSERC (PGS-D). We thank the reviewers for their it is easier to infer labels for pixels that are close to pix- constructive comments. We appreciate the efforts of MTurk els with known labels than for pixels that are further away. workers who participated in our user studies. Consider a scenario in which every other pixel in an image is annotated. Reasonably good labels for the unannotated References Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In Computer Vision and [1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Ef- Pattern Recognition (CVPR), pages 3213–3223, 2016.1,2,4, ficient interactive annotation of segmentation datasets with 7 polygon-rnn++. In Proceedings of the IEEE Conference on [15] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploit- Computer Vision and Pattern Recognition, pages 859–868, ing bounding boxes to supervise convolutional networks for 2018.2,5 semantic segmentation. In Proceedings of the IEEE Inter- [2] Eirikur Agustsson, Jasper RR Uijlings, and Vittorio Ferrari. national Conference on Computer Vision, pages 1635–1643, Interactive full image segmentation by considering all regions 2015.6 jointly. In Proceedings of the IEEE Conference on Computer [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Vision and Pattern Recognition, pages 11622–11631, 2019. Fei-Fei. ImageNet: A large-scale hierarchical image database. 2,5 In Computer Vision and Pattern Recognition (CVPR), pages [3] Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic 248–255. IEEE, 2009.5 affinity with image-level supervision for weakly supervised semantic segmentation. arXiv preprint arXiv:1803.10464, [17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 2018.2 A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision (IJCV), [4] Yagız˘ Aksoy, Tae-Hyun Oh, Sylvain Paris, Marc Pollefeys, 88(2):303–338, June 2010.5,6 and Wojciech Matusik. Semantic soft segmentation. ACM Trans. Graph. (Proc. SIGGRAPH), 37(4):72:1–72:13, 2018. [18] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional 2 neural networks with bernoulli approximate variational infer- ence. arXiv preprint arXiv:1506.02158, 2015.7 [5] Mykhaylo Andriluka, Jasper RR Uijlings, and Vittorio Ferrari. Fluid annotation: a human-machine collaboration interface [19] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian for full image annotation. arXiv preprint arXiv:1806.07527, approximation: Representing model uncertainty in deep learn- 2018.2 ing. In international conference on , pages [6] Xue Bai and Guillermo Sapiro. Geodesic matting: A frame- 1050–1059, 2016.7 work for fast interactive image and video segmentation and [20] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Percep- matting. International journal of computer vision, 82(2):113– tual organization and recognition of indoor scenes from rgb-d 132, 2009.2 images. In Proceedings of the IEEE Conference on Computer [7] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Vision and Pattern Recognition, pages 564–571, 2013.3 Fei-Fei. What’s the point: Semantic segmentation with point [21] Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, supervision. In European Conference on Computer Vision Chris Callison-Burch, and Jeffrey P Bigham. A data-driven (ECCV), pages 549–565. Springer, 2016.2,5,6 analysis of workers’ earnings on amazon mechanical turk. In [8] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Proceedings of the 2018 CHI Conference on Human Factors OpenSurfaces: A richly annotated catalog of surface appear- in Computing Systems, page 449. ACM, 2018.4 ance. ACM Transactions on Graphics (TOG), 32(4), 2013.1, [22] Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, 2,3,4 Benjamin V Hanrahan, Jeffrey P Bigham, and Chris Callison- [9] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Burch. Worker demographics and earnings on amazon me- Material recognition in the wild with the Materials in Context chanical turk: An exploratory analysis. In Extended Abstracts database. Computer Vision and Pattern Recognition (CVPR), of the 2019 CHI Conference on Human Factors in Computing 2015.2,5 Systems, page LBW1217. ACM, 2019.4 [10] Ali Sharifi Boroujerdi, Maryam Khanian, and Michael Breuß. [23] Mohammed Hassanin, Salman Khan, and Murat Tahtali. Vi- Deep interactive region segmentation and captioning. In sual affordance and function understanding: A survey. arXiv Signal-Image Technology & Internet-Based Systems (SITIS), preprint arXiv:1807.06775, 2018.1,7 2017 13th International Conference on, pages 103–110. IEEE, [24] Seunghoon Hong, Donghun Yeo, Suha Kwak, Honglak Lee, 2017.2 and Bohyung Han. Weakly supervised semantic segmentation [11] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO- using web-crawled videos. arXiv preprint arXiv:1701.00352, Stuff: Thing and stuff classes in context. In Computer Vision 2017.2 and Pattern Recognition (CVPR). IEEE, 2018.1,2,4,5,7,8 [25] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry [12] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image dreetto, and Hartwig Adam. Mobilenets: Efficient convolu- segmentation with deep convolutional nets, atrous convolu- tional neural networks for mobile vision applications. arXiv tion, and fully connected crfs. IEEE Transactions on Pattern preprint arXiv:1704.04861, 2017.6 Analysis and Machine Intelligence, 40(4):834–848, 2018.6 [26] Ronghang Hu, Piotr Dollar,´ Kaiming He, Trevor Darrell, and [13] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Ross Girshick. Learning to segment every thing. Cornell Schroff, and Hartwig Adam. Encoder-decoder with atrous University arXiv Institution: Ithaca, NY, USA, 2017.2 separable convolution for semantic image segmentation. In [27] Eric Huang, Haoqi Zhang, David C Parkes, Krzysztof Z Gajos, ECCV, 2018.5 and Yiling Chen. Toward automatic task design: a progress [14] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo report. In Proceedings of the ACM SIGKDD workshop on Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, human computation, pages 77–85. ACM, 2010.3,4 [28] Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang, [42] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo,` and Matthias Hein, and Bernt Schiele. Simple does it: Weakly Peter Kontschieder. The mapillary vistas dataset for semantic supervised instance and semantic segmentation. In CVPR, understanding of street scenes. In ICCV, pages 5000–5009, volume 1, page 3, 2017.2 2017.1,4 [29] Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek [43] Gerhard Neuhold, Tobias Ollmann, S Rota Bulo, and Peter Smyl. Time-series extreme event forecasting with neural Kontschieder. The Mapillary Vistas dataset for semantic networks at uber. In International Conference on Machine understanding of street scenes. In International Conference Learning, number 34, pages 1–5, 2017.7 on Computer Vision (ICCV), pages 22–29, 2017.2 [30] Hoang Le, Long Mai, Brian Price, Scott Cohen, Hailin Jin, [44] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and and Feng Liu. Interactive boundary prediction for object Vittorio Ferrari. Extreme clicking for efficient object annota- selection. In Proceedings of the European Conference on tion. In Computer Vision (ICCV), 2017 IEEE International Computer Vision (ECCV), pages 18–33, 2018.2 Conference on, pages 4940–4949. IEEE, 2017.2 [31] Christian Leibig, Vaneeda Allken, Murat Sec¸kin Ayhan, [45] George Papandreou, Liang-Chieh Chen, Kevin P Murphy, Philipp Berens, and Siegfried Wahl. Leveraging uncertainty and Alan L Yuille. Weakly-and semi-supervised learning of information from deep neural networks for disease detection. a deep convolutional network for semantic image segmenta- Scientific reports, 7(1):17816, 2017.7 tion. In Proceedings of the IEEE international conference on [32] Victor Lempitsky and Andrew Zisserman. Learning to count computer vision, pages 1742–1750, 2015.2,6 objects in images. In Advances in neural information process- [46] Deepak Pathak, Evan Shelhamer, Jonathan Long, and Trevor ing systems, pages 1324–1332, 2010.7 Darrell. Fully convolutional multi-class multiple instance [33] Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form learning. arXiv preprint arXiv:1412.7144, 2014.6 solution to natural image matting. IEEE transactions on [47] Xuebin Qin, Shida He, Zichen Zhang, Masood Dehghan, and pattern analysis and machine intelligence, 30(2):228–242, Martin Jagersand. Bylabel: A boundary based semi-automatic 2008.2 image annotation tool. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1804–1813. [34] Anat Levin, Alex Rav-Acha, and Dani Lischinski. Spectral IEEE, 2018.2 matting. IEEE transactions on pattern analysis and machine [48] Alexander J Quinn and Benjamin B Bederson. Human com- intelligence, 30(10):1699–1712, 2008.2 putation: a survey and taxonomy of a growing field. In [35] Zhengqi Li and Noah Snavely. Cgintrinsics: Better intrinsic Proceedings of the SIGCHI conference on human factors in image decomposition through physically-based rendering. In computing systems, pages 1403–1412. ACM, 2011.4 Proceedings of the European Conference on Computer Vision [49] Tal Remez, Jonathan Huang, and Matthew Brown. Learning (ECCV), pages 371–387, 2018.3 to segment via cut-and-paste. CoRR, abs/1803.06414, 2018. [36] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. 2 Scribblesup: Scribble-supervised convolutional networks for [50] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. semantic segmentation. In Proceedings of the IEEE Con- Grabcut: Interactive foreground extraction using iterated ference on Computer Vision and Pattern Recognition, pages graph cuts. In ACM transactions on graphics (TOG), vol- 3159–3167, 2016.2,6,7 ume 23, pages 309–314. ACM, 2004.2 [37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, [51] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- Pietro Perona, Deva Ramanan, Piotr Dollar,´ and C Lawrence moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted Zitnick. Microsoft COCO: Common objects in context. In residuals and linear bottlenecks. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2014.1, Conference on Computer Vision and Pattern Recognition, 2,4,5 pages 4510–4520, 2018.6 [38] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and [52] Tong Shen, Guosheng Lin, Chunhua Shen, and Ian Reid. Luc Van Gool. Deep extreme cut: From extreme points to Bootstrapping the performance of webly supervised semantic object segmentation. arXiv preprint arXiv:1711.09081, 2017. segmentation. In Proceedings of the IEEE Conference on 2,5 Computer Vision and Pattern Recognition, pages 1363–1371, [39] Pat Marion, Peter R Florence, Lucas Manuelli, and Russ 2018.2 Tedrake. A pipeline for generating ground truth labels [53] Zhiyuan Shi, Yongxin Yang, Timothy M Hospedales, and Tao for real rgbd data of cluttered scenes. arXiv preprint Xiang. Weakly-supervised image annotation and segmenta- arXiv:1707.04796, 2017.2 tion with objects and attributes. IEEE transactions on pattern [40] Branislav Micuˇ sˇ´l´ık and Jana Koseckˇ a.´ Semantic segmentation analysis and machine intelligence, 39(12):2525–2538, 2017. of street scenes by superpixel co-occurrence and 3d geometry. 2 In 2009 IEEE 12th International Conference on Computer [54] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano- Vision Workshops, ICCV Workshops, pages 625–632. IEEE, lis Savva, and Thomas Funkhouser. Semantic scene comple- 2009.7 tion from a single depth image. Proceedings of 30th IEEE [41] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Conference on Computer Vision and Pattern Recognition, Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and 2017.3 Alan Yuille. The role of context for object detection and se- [55] Paul Upchurch, Daniel Sedra, Andrew Mullen, Haym Hirsh, mantic segmentation in the wild. In Proceedings of the IEEE and Kavita Bala. Interactive consensus agreement games for Conference on Computer Vision and Pattern Recognition, labeling images. In AAAI Conference on Human Computation pages 891–898, 2014.1,4,7 and Crowdsourcing (HCOMP), October 2016.2 [56] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Be- longie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432, 2010.2 [57] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas Huang. Deep grabcut for object selection. arXiv preprint arXiv:1707.00243, 2017.2 [58] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep interactive object selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 373–381, 2016.2 [59] Ning Xu, Brian L Price, Scott Cohen, and Thomas S Huang. Deep image matting. In CVPR, volume 2, page 4, 2017.2 [60] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018.1,2,5 [61] Lishi Zhang, Chenghan Fu, and Jia Li. Collaborative anno- tation of semantic objects in images with multi-granularity supervisions. In 2018 ACM Multimedia Conference on Multi- media Conference, pages 474–482. ACM, 2018.2 [62] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros. Real-time user- guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG), 9(4), 2017.7 [63] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin, and Thomas Funkhouser. Physically-based rendering for indoor scene understanding using convolutional neural networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3 [64] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Bar- riuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 4. IEEE, 2017. 1,7,8 [65] Aleksandar Zlateski, Ronnachai Jaroensri, Prafull Sharma, and Fredo´ Durand. On the importance of label quality for semantic segmentation. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 1479–1487, 2018.2