Better Image Annotation with Sub-Image Decomposition

Block Annotation: Better Image Annotation with Sub-Image Decomposition Hubert Lin Paul Upchurch Kavita Bala Cornell University Cornell University Cornell University Abstract are required for principled conclusions to be drawn. In the past, polygon annotation tools have enabled partially dense Image datasets with high-quality pixel-level annotations annotations (in which small semantic regions are densely are valuable for semantic segmentation: labelling every annotated) to be crowdsourced at scale with public crowd pixel in an image ensures that rare classes and small ob- workers. These tools paved the way for the cost-effective cre- jects are annotated. However, full-image annotations are ation of large-scale partially dense datasets such as [8, 37]. expensive, with experts spending up to 90 minutes per image. Despite the success of these annotation tools, fully dense We propose block sub-image annotation as a replacement datasets have relied extensively on expensive expert annota- for full-image annotation. Despite the attention cost of fre- tors [60, 14, 41, 64, 42] and private crowdworkers [11]. quent task switching, we find that block annotations can We propose annotation of small blocks of pixels as a be crowdsourced at higher quality compared to full-image stand-in replacement for full-image annotation (figure1). annotation with equal monetary cost using existing annota- We find that these annotations can be effectively gathered tion tools developed for full-image annotation. Surprisingly, by crowdworkers, and that annotation of a sparse number of we find that 50% pixels annotated with blocks allows se- blocks per image can train a high performance segmentation mantic segmentation to achieve equivalent performance to network. We further show these sparsely annotated images 100% pixels annotated. Furthermore, as little as 12% of can be extended automatically to full-image annotations. pixels annotated allows performance as high as 98% of the We show block annotation has: performance with dense annotation. In weakly-supervised • Wide applicability. (Section3) Block annotations can be settings, block annotation outperforms existing methods by effectively crowdsourced at higher quality compared to full 3-4% (absolute) given equivalent annotation time. To re- annotation. It is easy to implement and works with existing cover the necessary global structure for applications such as advances in image annotation. characterizing spatial context and affordance relationships, we propose an effective method to inpaint block-annotated • Cost-efficient Design. (Section3) Block annotation re- images with high-quality labels without additional human flects a cost-efficient design paradigm (while current re- effort. As such, fewer annotations can also be used for these search focuses on reducing annotation time). This is remi- applications compared to full-image annotation. niscent of gamification and citizen science where enjoyable tasks lead to low-cost high-engagement work. • Complex Region Annotation. (Section3) Block annota- 1. Introduction tion shifts focus from categorical regions to spatial regions. When annotating categorical regions, workers segment sim- Recent large-scale computer vision datasets place a heavy ple objects before complex objects. With spatial regions, emphasis on high-quality fully dense annotations (in which informative complex regions are forced to be annotated. over 90% of the pixels are labelled) for hundreds of thou- • Weakly-Supervised Performance. (Section4) Block an- sands of images. Dense annotations are valuable for both notation is competitive in weakly-supervised settings, outper- semantic segmentation and applications beyond segmenta- forming existing methods by 3-4% (absolute) given equiva- tion such as characterizing spatial context and affordance lent annotation time. relationships [11, 23]. The long-tail distribution of classes means it is difficult to gather annotations for rare classes, es- • Scalable Performance. (Section4) Full-supervision per- pecially if these classes are difficult to segment. Annotating formance is achieved by annotating 50% of blocks per image. every pixel in an image ensures that pixels corresponding to Thus, blocks can be annotated until desired performance is rare classes or small objects are labelled. Dense annotations achieved, in contrast to methods such as scribbles. also capture pixels that form the boundary between classes. • Scalable Structure. (Section5) Block-annotated images For applications such as understanding spatial context be- can be effectively inpainted with high quality labels without tween classes or affordance relationships, dense annotations additional human effort. 1 Figure 1: (a) Sub-image block annotations are more effective to gather than full-image annotations (b) Training on sparse block annotations enables semantic segmentation performance equivalent to full-image annotations (c) Block labels can be inpainted with high-quality labels. 2. Related Work human-annotated blocks can be extended automatically into dense annotations (sec.5), and we discuss how other human- In this section we review recent works on pixel-level anno- machine methods can be used with blocks (sec. 3.6). tation in three areas: human annotation, and human-machine annotation, and dense segmentation with weak supervision. Weakly-Supervised Dense Segmentation. There are al- Human Annotation. Manual labeling of every pixel is ternatives to training with high-quality densely annotated impractical for large-scale datasets. A successful method images which substitute quantity for label quality and/or rich- is to have crowdsource workers segment polygonal regions ness. Previous works have used low-quality pixel-level an- to click on boundaries. Employing crowdsource workers notations [65], bounding boxes [45, 28, 49], point-clicks [7], offers its own set of challenges with quality control and task scribbles [7, 36], image-level class labels [45, 53, 3], image- design [56, 8, 55]. Although large-scale public crowdsourc- level text descriptions [24] and unlabeled related web ing can be successful [37] recent benchmark datasets have videos [24] to train semantic segmentation networks. Com- resorted to in-house expert annotators [14, 43]. Annotation bining weak annotations with small amounts of high-quality time can be reduced through improvements such as autopan, dense annotation is another strategy for reducing cost [9, 26]. zoom [8] and shared polygon boundaries [60]. Polygon seg- [52] proposes a two-stage approach where image-level class mentation can be augmented by painted labels on superpixel labels are automatically converted into pixel-level masks groups [11] and Bezier curves [60]. Pixel-level labels for which are used to train a semantic segmentation network. images can also be obtained by (1) constructing a 3D scene We find a small number of sub-image block annotations is a from an image collection, (2) grouping and labeling 3D competitive form of weak supervision (sec. 4.3). shapes and (3) propagating shape labels to image pixels [39]. In our work, we investigate sub-image polygon annotation, which can be further combined with other methods (sec.3.) 3. Block Annotation Human-Machine Annotation. Complex boundaries are Sub-image block annotation is composed of three stages: time-consuming to trace manually. In these cases the cost of (1) Given an image I, select a small spatial region I0; (2) pixel-level annotation can be reduced by automating a por- Annotate I0 with pixel-level labels; (3) Repeat (with different tion of the task. Matting and object selection [50, 33, 34, 6, I0) until I is sufficiently annotated. In this paper, we explore 58, 57, 10, 30, 59] generate tight boundaries from loosely an- the case where I0 is rectangular, and focus on the use of notated boundaries or few inside/outside clicks and scribbles. existing pixel-level annotation tools. [44, 38] introduced a predictive method which automatically Can block annotations be gathered as effectively as full- infers a foreground mask from 4 boundary clicks, and was image annotations with existing tools? In section 3.1, we extended to full-image segmentation in [2]. The number show our annotation interface. In section 3.2, we explore the of boundary clicks was further reduced to as few as one quality of block annotation versus full-image annotation. In by [1]. Predictive methods require an additional human section 3.3, we examine block annotation for a real-world verification step since the machine mutates the original hu- dataset. In section 3.4, we discuss the cost of block annota- man annotation. The additional step can be avoided with an tion and show worker feedback. In section 3.5, we discuss online method. However, online methods (e.g., [1, 30, 2]) how blocks for annotation can be selected in practice. Fi- have higher requirements since the algorithm must be trans- nally, in section 3.6 we discuss the compatibility of block lated into the web browser setting and the worker’s machine annotation with existing annotation methods. must be powerful enough to run the algorithm1. Alterna- tively, automatic proposals can be generated for humans to manipulate: [5] generates segments, [4] generates a set of 3.1. Annotation Interface matting layers, [61] generates superpixel labels, and [47] Our block annotation interface is given in figure2 and generates boundary fragments. In our work, we show that implemented with existing tools [8]. For full image annota- 1Offloading online methods onto a cloud service offers a different

Load more