Unifying Global-Local Representations in Salient Object Detection with Transformer
Total Page:16
File Type:pdf, Size:1020Kb
Unifying Global-Local Representations in Salient Object Detection with Transformer Sucheng Ren1, Qiang Wen1, Nanxuan Zhao2, Guoqiang Han1, Shengfeng He1* 1 School of Computer Science and Engineering, South China University of Technology 2 The Chinese University of Hong Kong Abstract The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the lo- cality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient ob- ject detection to ensure the globalization of the represen- tations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can cap- ture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and Input GT Ours GateNet minimize the redundant features, making that every output feature of transformer layers contributes uniquely for fi- Figure 1: Examples of our method. We propose the Global- nal prediction. To decode features from the transformer, Local Saliency TRansformer (GLSTR) to unify global and we propose a simple yet effective deeply-transformed de- local features in each layer. We compare our method with coder. The decoder densely decodes and upsamples the a SOTA method, GateNet [59], based on FCN architecture. transformer features, generating the final saliency map with Our method can localize salient region precisely with accu- less noise injection. Experimental results demonstrate that rate boundary. our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large and hand-crafted features like color contrast [6], bright- margin, with an average of 12.17% improvement in terms ness [11], foreground and background distinctness [15]. of Mean Absolute Error (MAE). Code will be available at However, these methods limit the representation power, https://github.com/OliverRensu/GLSTR. arXiv:2108.02759v1 [cs.CV] 5 Aug 2021 without considering semantic information. Thus, they of- ten fail to deal with complex scenes. With the development of deep convolutional neural network (CNN), fully convo- 1. Introduction lutional network (FCN) [24] becomes an essential building block for SOD [20, 39, 40, 51]. These methods can encode Salient object detection (SOD) aims at segmenting the high-level semantic features and better localize the salient most visually attracting objects of the images, aligning regions. with the human perception. It gained broad attention in However, due to the nature of locality, methods relying recent years due to its fundamental role in many vision on FCN usually face a trade-off between capturing global tasks, such as image manipulation [27], semantic segmen- and local features. To encode high-level global represen- tation [25], autonomous navigation [7], and person re- tations, the model needs to take a stack of convolutional identification [49, 58]. layers to obtain larger receptive fields, but it erases local de- Traditional methods mainly rely on different priors tails. To retain the local understanding of saliency, features *Corresponding author ([email protected]). from shallow layers fail to incorporate high-level semantics. still preserve rich local and global features. To the best of our knowledge, we are the first work to apply the pure transformer-based encoder on SOD tasks. To sum up, our contributions are in three folds: • We unify global-local representations with a novel transformer-based architecture, which models long- range dependency within each layer. • To take full advantage of the global information of Input GT Layer 1 Layer 12 previous layers, we propose a new decoder, deeply- Figure 2: The visual attention map of the red block in the in- transformed decoder, to densely decode features of put image on the first and twelfth layers of the transformer. each layer. The attention maps show that features from shallow layer can also have global information and features from deep • We conduct extensive evaluations on five widely-used layer can also have local information. benchmark datasets, showing that our method outper- forms state-of-the-art methods by a large margin, with an average of 12:17% improvement in terms of MAE. This discrepancy may make the fusion of global and local features less efficient. Is it possible to unify global and local 2. Related Work features in each layer to reduce the ambiguity and enhance the accuracy for SOD? Salient Object Detection. FCN has became the main- In this paper, we aim to jointly learn global and local stream architecture for the saliency detection methods, to features in a layer-wise manner for solving the salient ob- predict pixel-wise saliency maps directly [14, 20, 23, 31, 36, ject detection task. Rather than using a pure CNN archi- 39, 40, 51, 55]. Especially, the skip-connection architecture tecture, we turn to transformer [35] for help. We get in- has been demonstrated its performance in combining global spired from the recent success of transformer on various and local context information [16,22,26,34,41,52,59]. Hou vision tasks [9] for its superiority in exploring long-range et al. [16] proposed that the high-level features are capable dependency. Transformer applies self-attention mechanism of localizing the salient regions while the shallower ones are for each layer to learn a global representation. Therefore, active at preserving low-level details. Therefore, they intro- it is possible to inject globality in the shallow layers, while duced short connections between the deeper and shallower maintaining local features. As illustrated in Fig.2, the atten- feature pairs. Based on the U-Shape architecture, Zhao et tion map of the red block has the global view of the chicken al. [59] introduced a transition layer and a gate into every and the egg even in the first layer. The network does at- skip connection, to suppress the misleading convolutional tend to small scale details even in the final layer (the twelfth information in the original features from the encoder. In layer). More importantly, with the self-attention mecha- order to localize the salient objects better, they also built a nism, the transformer is capable to model the “contrast”, Fold-ASPP module on the top of the encoder. Similarly, Liu which has demonstrated to be crucial for saliency percep- et al. [22] inserted a pyramid pooling module on the top of tion [8, 28, 32]. the encoder to capture the high-level semantic information. With the merits mentioned above, we propose a method In order to recover the diluted information in the convolu- called Global-Local Saliency TRansformer (GLSTR). The tional encoder and alleviate the coarse-level feature aggre- core of our method is a pure transformer-based encoder and gation problem while decoding, they introduced a feature a mixed decoder to aggregate features generated by trans- aggregation module after every skip connection. In [57], formers. To encode features through transformers, we first Zhao et al. adopted the VGG network without fully con- split the input image into a grid of fixed-size patches. We nected layers as their backbone. They fused the high-level use a linear projection layer to map each image patch to feature into the shallower one at each scale to recover the a feature vector for representing local details. By passing diluted location information. After three convolution lay- the feature through the multi-head self-attention module in ers, each fused feature was enhanced by the saliency map. each transformer layer, the model further encodes global Specifically, they introduced the edge map as a guidance features without diluting the local ones. To decode the fea- for the shallowest feature and fused this feature into deeper tures with global-local information over the inputs and the ones to produce the side outputs. previous layer from the transformer encoder, we propose to Aforementioned methods usually suffer from the lim- densely decode feature from each transformer layer. With ited size of convolutional operation. Although there will this dense connection, the representations during decoding be large receptive fields in the deep convolution layers, the Patch Position Reshape, Pixel Shuffle (x4) Reshape, Pixel Shuffle (x2) Reshape, Conv Layer Embedding Embedding Bilinear Interpolation (x2) Bilinear Interpolation (x2) Bilinear Interpolation (x2) Pixel Shuffle (x2) Transformer Encoder N=16x16 Linear N=16x16 Transformer Transformer Transformer Transformer Projection Layer Layer Layer Layer MLP N=12 Layer Norm Multi-Head Attention Supervision N=4 N=4 H/8 H/2 H/4 Layer Norm H W/8 W/4 Stage 3 W/2 Stage 2 Stage 1 Saliency GT Network output W Deeply-transformed Decoder Figure 3: The pipeline of our proposed method. We first divide the image into non-overlap patches and map each patch into the token before feeding to transformer layers. After encoding features through 12 transformer layers, we decode each output feature in three successive stages with 8×, 4×, and 2× upsampling respectively. Each decoding stage contains four layers and the input of each layer comes from the features of its previous layer together with the corresponding transformer layer. low-level information is diluted increasingly with the in- representations jointly in each encoder layer and decodes creasing number of layers. all features from transformers with gradual spatial resolu- Transformer. Since the local information covered by tion recovery. In the following subsections, we first review the convolutional operation is deficient, numerous methods the local and global feature extraction by the fully convo- adopt attention mechanisms to capture the long-range cor- lutional network-based salient object detection methods in relation [1, 17, 37, 43, 56].