Deep Object Co-Segmentation Via Spatial-Semantic Network Modulation
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Deep Object Co-Segmentation via Spatial-Semantic Network Modulation Kaihua Zhang,1 Jin Chen,1 Bo Liu,2∗ Qingshan Liu1 1B-DAT and CICAEET, Nanjing University of Information Science and Technology, Nanjing, China 2JD Finance America Corporation {zhkhua, kfliubo}@gmail.com Abstract Object co-segmentation is to segment the shared objects in (a) multiple relevant images, which has numerous applications in computer vision. This paper presents a spatial and se- mantic modulated deep network framework for object co- segmentation. A backbone network is adopted to extract multi-resolution image features. With the multi-resolution (b) features of the relevant images as input, we design a spa- tial modulator to learn a mask for each image. The spatial modulator captures the correlations of image feature descrip- tors via unsupervised learning. The learned mask can roughly localize the shared foreground object while suppressing the Figure 1: Object co-segmentation examples by our ap- background. For the semantic modulator, we model it as a proach. (a) Horse group; (b) Horse group co-segmentation supervised image classification task. We propose a hierar- maps. chical second-order pooling module to transform the image features for classification use. The outputs of the two mod- ulators manipulate the multi-resolution features by a shift- features cannot well handle the challenging cases in co- and-scale operation so that the features focus on segmenting segmentation such as background clutter and large-scale co-object regions. The proposed model is trained end-to-end appearance variations of the co-objects in images. In re- without any intricate post-processing. Extensive experiments cent years, deep-learning-based co-segmentation methods on four image co-segmentation benchmark datasets demon- strate the superior accuracy of the proposed method com- have attracted much attention. For example, (Li et al. 2018; pared to state-of-the-art methods. The codes are available at Chen et al. 2018) leverage a Siamese network architecture http://kaihuazhang.net/. for object co-segmentation and an attention mechanism is used to enhance the co-object feature representations. These methods have shown superior performance compared to 1 Introduction the traditional methods (Yuan, Lu, and Shivakumara 2014; As a special case of image object segmentation, object co- Dai et al. 2013; Lee et al. 2015), which inspire us to explore segmentation refers to the task of jointly discovering and a deep-learning-based solution to object co-segmentation. segmenting the objects shared in a group of images. It has One critical property of object co-segmentation is that the been widely used to support various computer vision appli- co-objects in images belong to the same semantic category. cations, such as interactive image segmentation (Kamranian Those co-objects usually occupy part of each image. One il- et al. 2018), 3D reconstruction (Mustafa and Hilton 2017) lustrative example is shown in Figure 1. It is desirable that and object co-localization (Wei et al. 2019; Han et al. 2018), the deep convolutional network layers, being used as a fea- to name a few. ture extractor, are targeted on modelling the co-objects. To Image features that characterize the co-objects in the im- this end, we propose a spatial-semantic modulated network age group are vital for a co-segmentation task. Conven- structure to model this property. The two modulators are tional approaches use the hand-crafted cues such as color achieved by the designed group-wise mask learning branch histograms, Gabor filter outputs and SIFT descriptors as and co-category classification branch, respectively. We sum- feature representations (Yuan, Lu, and Shivakumara 2014; marize the technical contributions of this work as follows: Dai et al. 2013; Lee et al. 2015). Those hand-crafted • We propose a spatial-semantic modulated deep network ∗Corresponding author for object co-segmentation. Image features extracted by a Copyright c 2020, Association for the Advancement of Artificial backbone network are used to learn a spatial modulator Intelligence (www.aaai.org). All rights reserved. and a semantic modulator. The outputs of the modulators 12813 guide the image features up-sampling to generate the co- by each other, and benefit each other synergistically. The segmentation results. The network parameter learning is idea of joint processing can exploit the inherent interdepen- formulated into a multi-task learning task, and the whole dencies of two tasks to achieve better results jointly. Re- network is trained in an end-to-end manner. cently, (Li et al. 2018; Chen et al. 2018) respectively pro- • For the spatial modulation branch, an unsupervised learn- pose an end-to-end deep-learning-based method for object ing method is proposed to learn a mask for each image. co-segmentation using a Siamese encoder-decoder architec- With the fused multi-resolution image features as input, ture and a semantic-aware attention mechanism. we formulate the mask learning as an integer program- 2.2 Network Modulation ming problem. Its continuous relaxation has a closed- form solution. The learned parameter indicates whether Modulation module has been proved to be an effective way the corresponding image pixel corresponds to foreground to manipulate network parameter learning. The modulator or background. can be modelled as parameters or output of the auxiliary branch that are used to guide the main branch parameter • In the semantic modulation branch, we design a hierar- learning. In the segmentation method (Dai, He, and Sun chical second-order pooling (HSP) operator to transform 2015), an image mask is used as a modulator for background the convolutional features for object classification. Spatial removal. In the Mask R-CNN model (He et al. 2017), a clas- pooling (SP) is shown to be able to capture the high-order sification branch is used to guide the segmentation branch feature statistical dependency (Gao et al. 2019). The pro- learning. Feature-wise linear modulation is a widely-used posed HSP module has a stack of two SP layers. They are scheme, which has been applied to object detection (Lin et dedicated to capturing the long-range channel-wise de- al. 2017) and graph neural networks learning (Brockschmidt pendency of the holistic feature representation. The out- 2019). In visual reasoning problem, network modulation is put of the HSP layer is fed into a fully-connected layer for used to encode the language information (De Vries et al. object classification and used as the semantic modulator. 2017; Perez et al. 2018). The attention module in the image We conduct extensive evaluations on four object co- caption model (Chen et al. 2017) can be viewed as a mod- segmentation benchmark datasets (Faktor and Irani 2013; ulator. (Yang et al. 2018) proposes to model the visual and Rubinstein et al. 2013), including the sub-set of MSRC, In- spatial information by a modulator for video object segmen- ternet, the sub-set of iCoseg and PASCAL-VOC datasets. tation. In (Flores et al. 2019), a saliency detection branch is The proposed model achieves a significantly higher accuracy added to an existing CNN architecture as a modulator for than state-of-the-art methods. Especially, on the most chal- fine-grained object recognition. A cross-modulation mecha- lenging PASCAL-VOC dataset, our method outperforms the nism is proposed in (Prol, Dumoulin, and Herranz 2018) for second best-performing state-of-the-art approach (Li et al. few-shot learning. 2018) by 6% in terms of average Jaccard index J . The rest of this work is organized as follows: In § 2, 3 Proposed Approach we introduce the related works of our study. § 3 describes 3.1 Problem Formulation the proposed framework and its main components. Af- Figure 2 presents an overview of our model. Given a group terwards, comprehensive experimental evaluations are pre- I { n}N of N images = I n=1 containing co-objects of a spe- sented in § 4. Finally, we conclude this work in § 5. cific category, our objective is to learn a feed-forward net- work f that produces a set of object co-segmentation masks M {Mn}N 2 Related Work = n=1: M I θ 2.1 Object Co-segmentation = f( ; ), (1) where θ denotes the network parameters to be optimized. A more comprehensive literature review about image co- The network f is composed of three sub-networks: spa- segmentation can be found in (Zhu et al. 2016). Existing tial modulation sub-net f , semantic modulation sub-net object co-segmentation methods can be roughly grouped spa fsem and segmentation sub-net fseg. The renowned SPP- into four categories including graph-based model, saliency- Net (He et al. 2015) has shown that the convolutional fea- based model, joint processing model and deep learning ture maps (CFMs) for object recognition encode both spa- model. Conventional approaches such as (Yuan, Lu, and tial layouts of objects (by their positions) and the semantics Shivakumara 2014; Collins et al. 2012; Lee et al. 2015) (by strengths of their activations). Inspired by this model, assume the pixels or superpixels in the co-objects can be we design fspa and fsem to encode the spatial and seman- grouped together and then they formulate co-segmentation tic information of the co-objects in I, respectively. The two as a clustering task to search for the co-objects. Saliency- modulators guide the convolution layers learning in fseg to detection-based methods assume regions of interest in the focus on the co-objects in the images. Specifically, the sub- images are usually the co-objects to be segmented. They net f is to learn a mask for each image to coarsely local- conduct image co-segmentation through detecting the re- spa ize the co-object in it.