Unifying Global-Local Representations in Salient Object Detection with Transformer

Sucheng Ren1, Qiang Wen1, Nanxuan Zhao2, Guoqiang Han1, Shengfeng He1* 1 School of Computer Science and Engineering, South China University of Technology 2 The Chinese University of Hong Kong

Abstract

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the lo- cality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient ob- ject detection to ensure the globalization of the represen- tations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can cap- ture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and Input GT Ours GateNet minimize the redundant features, making that every output feature of transformer layers contributes uniquely for fi- Figure 1: Examples of our method. We propose the Global- nal prediction. To decode features from the transformer, Local Saliency TRansformer (GLSTR) to unify global and we propose a simple yet effective deeply-transformed de- local features in each layer. We compare our method with coder. The decoder densely decodes and upsamples the a SOTA method, GateNet [59], based on FCN architecture. transformer features, generating the final saliency map with Our method can localize salient region precisely with accu- less noise injection. Experimental results demonstrate that rate boundary. our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large and hand-crafted features like color contrast [6], bright- margin, with an average of 12.17% improvement in terms ness [11], foreground and background distinctness [15]. of Mean Absolute Error (MAE). Code will be available at However, these methods limit the representation power, https://github.com/OliverRensu/GLSTR. arXiv:2108.02759v1 [cs.CV] 5 Aug 2021 without considering semantic information. Thus, they of- ten fail to deal with complex scenes. With the development of deep convolutional neural network (CNN), fully convo- 1. Introduction lutional network (FCN) [24] becomes an essential building block for SOD [20, 39, 40, 51]. These methods can encode Salient object detection (SOD) aims at segmenting the high-level semantic features and better localize the salient most visually attracting objects of the images, aligning regions. with the human perception. It gained broad attention in However, due to the nature of locality, methods relying recent years due to its fundamental role in many vision on FCN usually face a trade-off between capturing global tasks, such as image manipulation [27], semantic segmen- and local features. To encode high-level global represen- tation [25], autonomous navigation [7], and person re- tations, the model needs to take a stack of convolutional identification [49, 58]. layers to obtain larger receptive fields, but it erases local de- Traditional methods mainly rely on different priors tails. To retain the local understanding of saliency, features *Corresponding author ([email protected]). from shallow layers fail to incorporate high-level semantics. still preserve rich local and global features. To the best of our knowledge, we are the first work to apply the pure transformer-based encoder on SOD tasks. To sum up, our contributions are in three folds:

• We unify global-local representations with a novel transformer-based architecture, which models long- range dependency within each layer.

• To take full advantage of the global information of Input GT Layer 1 Layer 12 previous layers, we propose a new decoder, deeply- Figure 2: The visual attention map of the red block in the in- transformed decoder, to densely decode features of put image on the first and twelfth layers of the transformer. each layer. The attention maps show that features from shallow layer can also have global information and features from deep • We conduct extensive evaluations on five widely-used layer can also have local information. benchmark datasets, showing that our method outper- forms state-of-the-art methods by a large margin, with an average of 12.17% improvement in terms of MAE. This discrepancy may make the fusion of global and local features less efficient. Is it possible to unify global and local 2. Related Work features in each layer to reduce the ambiguity and enhance the accuracy for SOD? Salient Object Detection. FCN has became the main- In this paper, we aim to jointly learn global and local stream architecture for the saliency detection methods, to features in a layer-wise manner for solving the salient ob- predict pixel-wise saliency maps directly [14, 20, 23, 31, 36, ject detection task. Rather than using a pure CNN archi- 39, 40, 51, 55]. Especially, the skip-connection architecture tecture, we turn to transformer [35] for help. We get in- has been demonstrated its performance in combining global spired from the recent success of transformer on various and local context information [16,22,26,34,41,52,59]. Hou vision tasks [9] for its superiority in exploring long-range et al. [16] proposed that the high-level features are capable dependency. Transformer applies self-attention mechanism of localizing the salient regions while the shallower ones are for each layer to learn a global representation. Therefore, active at preserving low-level details. Therefore, they intro- it is possible to inject globality in the shallow layers, while duced short connections between the deeper and shallower maintaining local features. As illustrated in Fig.2, the atten- feature pairs. Based on the U-Shape architecture, Zhao et tion map of the red block has the global view of the chicken al. [59] introduced a transition layer and a gate into every and the egg even in the first layer. The network does at- skip connection, to suppress the misleading convolutional tend to small scale details even in the final layer (the twelfth information in the original features from the encoder. In layer). More importantly, with the self-attention mecha- order to localize the salient objects better, they also built a nism, the transformer is capable to model the “contrast”, Fold-ASPP module on the top of the encoder. Similarly, Liu which has demonstrated to be crucial for saliency percep- et al. [22] inserted a pyramid pooling module on the top of tion [8, 28, 32]. the encoder to capture the high-level semantic information. With the merits mentioned above, we propose a method In order to recover the diluted information in the convolu- called Global-Local Saliency TRansformer (GLSTR). The tional encoder and alleviate the coarse-level feature aggre- core of our method is a pure transformer-based encoder and gation problem while decoding, they introduced a feature a mixed decoder to aggregate features generated by trans- aggregation module after every skip connection. In [57], formers. To encode features through transformers, we first Zhao et al. adopted the VGG network without fully con- split the input image into a grid of fixed-size patches. We nected layers as their backbone. They fused the high-level use a linear projection layer to map each image patch to feature into the shallower one at each scale to recover the a feature vector for representing local details. By passing diluted location information. After three convolution lay- the feature through the multi-head self-attention module in ers, each fused feature was enhanced by the saliency map. each transformer layer, the model further encodes global Specifically, they introduced the edge map as a guidance features without diluting the local ones. To decode the fea- for the shallowest feature and fused this feature into deeper tures with global-local information over the inputs and the ones to produce the side outputs. previous layer from the transformer encoder, we propose to Aforementioned methods usually suffer from the lim- densely decode feature from each transformer layer. With ited size of convolutional operation. Although there will this dense connection, the representations during decoding be large receptive fields in the deep convolution layers, the Patch Position Reshape, Pixel Shuffle (x4) Reshape, Pixel Shuffle (x2) Reshape, Conv Layer Embedding Embedding Bilinear Interpolation (x2) Bilinear Interpolation (x2) Bilinear Interpolation (x2) Pixel Shuffle (x2)

Transformer Encoder

N=16x16 Linear N=16x16 Transformer Transformer Transformer Transformer Projection Layer Layer Layer Layer MLP

N=12 Layer Norm

Multi-Head Attention

Supervision N=4 N=4 H/8 H/2 H/4 Layer Norm H W/8 W/4 Stage 3 W/2 Stage 2 Stage 1 Saliency GT Network output W Deeply-transformed Decoder Figure 3: The pipeline of our proposed method. We first divide the image into non-overlap patches and map each patch into the token before feeding to transformer layers. After encoding features through 12 transformer layers, we decode each output feature in three successive stages with 8×, 4×, and 2× upsampling respectively. Each decoding stage contains four layers and the input of each layer comes from the features of its previous layer together with the corresponding transformer layer. low-level information is diluted increasingly with the in- representations jointly in each encoder layer and decodes creasing number of layers. all features from transformers with gradual spatial resolu- Transformer. Since the local information covered by tion recovery. In the following subsections, we first review the convolutional operation is deficient, numerous methods the local and global feature extraction by the fully convo- adopt attention mechanisms to capture the long-range cor- lutional network-based salient object detection methods in relation [1, 17, 37, 43, 56]. The transformer, proposed by Sec. 3.1. Then we give more details about how the trans- Vaswani et al. [35], has demonstrated its global-attention former extracts global and local features in each layer in performance in machine translation task. Recently, more Sec. 3.2. Finally, we specify different decoders including and more works focus on exploring the ability of trans- naive decoder, stage-by-stage decoder, multi-level feature former in computer vision tasks. Dosovitskiy et al. [9] pro- aggregation decoder [60], and our proposed decoder. The posed to apply the pure transformer directly to sequences framework of our proposed method is shown in Fig.3. of image patches for exploring spatial correlation on image classification tasks. Carion et al. [2] proposed a transformer 3.1. FCN-based Salient Object Detection encoder-decoder architecture on object detection tasks. By We revisit traditional FCN based salient object detec- collapsing the feature from CNN into a sequence of learned tion methods on how to extract global-local representa- object queries, the transformer can dig out the correlation tions. Normally, a CNN-based encoder (e.g., VGG [34] and between objects and the global image context. Zeng et ResNet [13]) takes an image as input. Due to the locality of al. [50] proposed a spatial-temporal transformer network to CNN, a stack of convolution layers is applied to the input tackle video inpainting problem along spatial and temporal images. At the same time, the spatial resolution of the fea- channel jointly. ture maps gradually reduces at the end of each stage. Such The most related work to ours is [60]. They proposed an operation extends the receptive field of the encoder and to replace the convolution layers with the pure transformer reaches a global view in deep layers. This kind of encoder during encoding for semantic segmentation task. Unlike faces three serious problems: 1) features from deep layers this work, we propose a transformer based U-shape archi- have a global view with diluted local details due to too much tecture network that is more suitable for dense prediction convolution operations and resolution reduction; 2) features tasks. With our dense skip-connections and progressive from shallow layers have more local details but lack of se- upsampling strategy, the proposed method can sufficiently mantic understanding to identify salient objects, due to the associate the decoding features with the global-local ones locality of the convolution layers; 3) features from adjacent from the transformer. layers have limited variances and lead to redundancy when they are densely decoded. 3. Method Recent work focuses on strengthening the ability to ex- tract either local details or global semantics. To learn local In this paper, we propose a Global-Local Saliency details, previous works [42, 46, 57] focus on features from TRansfromer (GLSTR) which models the local and global shallow layers by adding edge supervision. Such low-level features lack a global view and may misguide the whole The multi-head self-attention is the extension of self- network. To learn global semantics, attention mechanism attention (SA): is used to enlarge the receptive field. However, previous works [4, 55] only apply the attention mechanism in deep Q = FWq,K = FWk,V = FWv, layers (e.g. , the last layer of the encoder). In the following QKT (2) SA(F ) = φ( √ )V, subsections, we demonstrate how to preferably unify local- d global features using the transformer. where F indicates the input features of the self-attention, 3.2. Transformer Encoder and Wq,Wk,Wv are the weights with the trainable parame- ters. d is the dimension of Q, K, V and φ(?) is the softmax 3.2.1 Image Serialization function. To apply multiple attentions in parallel, multi- Since the transformer is designed for the 1D sequence in head self-attention has m independent self-attentions: natural language processing tasks, we first map the input 2D images into 1D sequences. Specifically, given an image MSA(F ) = SA1(F ) ⊕ SA2(F ) ⊕ ... ⊕ SAm(F ), (3) x ∈ RH×W ×c with height H, width W and channel c, the where ⊕ indicates the concatenation. To sum up, in the i 1D sequence x0 ∈ RL×C is encoded from the image x.A th transformer layer, the output feature F is calculated by: simple idea is to directly reshape the image into a sequence, i namely, L = H × W . However, due to the quadratic spatial Fˆi = MSA(LN(Fi−1)) + Fi−1, complexity of self-attention, such an idea leads to enormous (4) ˆ ˆ GPU memory consumption and computation cost. Inspired Fi = MLP (LN(Fi)) + Fi, by ViT [9], we divide the image x into HW non-overlapping 256 where LN(?) indicates layer norm and F is the i layer image patches with a resolution of 16×16. Then we project i th features of the transformer. these patches into tokens with a linear layer, and the se- HW 0 quence length L is 256 . Every token in x represents a 3.3. Various Decoders non-overlay 16 × 16 patch. Such an operation trades off the spatial information and sequence length. In this section, we design a global-local decoder to de- HW ×C code the features F ∈ R 256 from the transformer en- 0 coder and generate the final saliency maps S ∈ RH×W 3.2.2 Transformer with the same resolution as the inputs. Before exploring the Taking 1D sequence x0 as the input, we use pure trans- ability of our global-local decoder, we simply introduce the former as the encoder instead of CNN used in previous naive decoder, stage-by-stage decoder, and multi-level fea- works [29, 30, 44]. Rather than only having limited recep- ture aggregation decoder used in [60] (see Fig.4). To begin HW ×C tive field as CNN-based encoder, with the power of mod- with, we first reshape the features Fi ∈ R 256 from the 0 H × W ×C eling global representations in each layer, the transformer transformer into Fi ∈ R 16 16 . usually requires much fewer layers to receive a global re- Naive Decoder. As illustrated in Fig.4(a), a naive idea ceptive. Besides, the transformer encoder has connections of decoding transformer features is directly upsampling the across each pair of image patches. Based on these two char- outputs of last layer to the same resolution of inputs and acteristics, more local detailed information is preserved in generating the saliency maps. Specifically, three conv-batch our transformer encoder. Specifically, the transformer en- norm-ReLU layers are applied on F12. Then we bilinearly coder contains positional encoding and transformer layers upsample the feature maps 16×, followed by a simple clas- with multi-head attention and multi-layer perception. sification layer:

Positional Encoding. Since the attention mechanism 0 Z = Up(CBR(F ; θz); 16), cannot distinguish the positional difference, the first step is 12 (5) 0 0 to feed the positional information into sequence x and get S = σ(Conv(Z; θs)), the position enhanced feature F0: where CBR(?; θ) indicates convolution-batch norm-ReLU 0 0 0 F0 = [x0, x1, ..., xL] + Epos, (1) layer with parameters θ. Up(?; s) is the bilinear operation for upsampling s× and σ is the sigmoid function. S0 is the where + is the addition operation and Epos indicates the po- predicted saliency map. sitional code which is randomly initialized under truncated Stage-by-Stage Decoder. Directly upsampling 16× at Gaussian distribution and is trainable in our method. one time like the naive decoder will inject too much noise Transformer Layer. The transformer encoder contains and lead to coarse spatial details. Inspired by previous FCN- 17 transformer layers, and each layer contains multi-head based methods [30] in salient object detection, stage-by- self-attention (MSA) and multi-layer perceptron (MLP). stage decoder which upsamples the resolution 2× in each Reshape Transformer Layer & Conv & 4↑ th Conv & 16↑ 12 & Conv & Conv & Conv Conv Conv Conv Conv (a) Naive Decoder & 4↑ & 4↑ & 4↑ & 4↑

Reshape & 8↑ Reshape & 4↑ Reshape & 2↑ Conv & 8↑ Conv & 2↑ Conv Conv Conv Conv Upsampling ↑ Reshape Reshape Transformer Transformer Transformer Transformer Layer Transformer Transformer Transformer Transformer th Layer 12th Layer 9th Layer 6th Layer 3rd Layer ×4 Layer ×4 Layer ×4 Concatenate 12 (b) Stage-by-stage Decoder (c) Multi-level Feature Aggregation Decoder (d) Our Decoder

Figure 4: Different types of decoder. Our decoder (d) densely decodes all transformer features and gradually upsamples to the resolution of inputs. stage will mitigate the losses of spatial details seen in Fig. Then we simply apply a convolutional operation on 0 4(b). Therefore, there are four stages, and in each stage, a salient features Zi,j to generate the final saliency maps Si,j. decoder block contains three conv-batch norm-ReLU lay- Therefore, in our method, we have twelve output saliency ers: maps:

0 0 Z0 = CBR(F12; θ0), Si,j = σ(Conv(Up(Zi,j; 2); θsi,j )). (8) Zi = CBR(Up(Zi−1; 2); θi), (6) 0 S = σ(Conv(Z4; θs)). 4. Experiment Multi-level feature Aggregation. As illustrated in Fig. 4(c), Zheng et al. [60] propose a multi-level feature ag- 4.1. Implementation Details gregation to sparsely fuse multi-level features following the similar setting of pyramid network. Rather than having a We train our model on DUTS following the same set- pyramid shape, the spatial resolutions of features keep the ting as [57]. The transformer encoder is pretrained on Im- same in the transformer encoder. ageNet [9] and the rest layers are randomly initialized fol- Specifically, the decoder take features F3,F6,F9,F12 lowing the default settings in PyTorch. We use SGD with from the 3rd, 6th, 9th, 12th layers and upsample them 4× momentum as the optimizer. We set momentum = 0.9, and following several convolutional layers. Then features are weight decay = 0.0005. The learning rate starts from 0.001 top-down aggregated via element-wise addition. Finally, and gradually decays to 1e-5. We train our model for 40 the features from all streams are fused to generate the final epochs with a batch size of 8. We adopt vertical and hor- saliency maps. izontal flipping as the data augmentation techniques. The Deeply-transformed Decoder. As illustrated in Fig. input images are resized to 384×384. During the inference, 4(d), to combine the benefit from the features of each we take the output of the last layer as the final prediction. transformer layer and include less noise injection from up- sampling, we gradually integrate all transformer layer fea- tures and upsample them into the same spatial resolution 4.2. Datasets and Loss Function of the input images. Specifically, we first divide the trans- We evaluate our methods on five widely used bench- former features into three stages and we upsample them mark datasets: DUT-OMRON [48], HKU-IS [19], EC- 8×, 4×, 2× from the first to the third stage respectively. SSD [33], DUTS [38], PASCAL-S [21]. DUT-OMRON Note that the upsampling operation includes pixel shuf- contains 5,169 challenging images which usually have com- fle (23−n×) and bilinear upsampling (2×), where n indi- plex background and more than one salient objects. HKU- cates the n stage. In the j layer of the i stage, the th th th IS has 4,447 high-quality images with multiple discon- salient feature Z comes from the concatenation of the i,j nected salient objects in each image. ECSSD has 1,000 transformer feature F from the corresponding layer to- i∗4+j meaningful images with various scenes. DUTS is the largest gether with the salient feature from the previous layer. A 0 salient object detection dataset including 10,533 training convolution block is applied after upsampling F for Z . 12 3,4 images and 5,019 testing images. PASCAL-S chooses 850 ( 0 i images from PASCAL VOC 2009. CBR(Zi,j−1 ⊕ Up(Fi∗4+j ; 2 ); θi,j ) j = 1, 2, 3, Zi,j = 0 i CBR(Up(Zi−1,j ; 2) ⊕ Up(Fi∗4+j ; 2 ); θi,j ) j = 4. We use standard binary cross entropy loss as our loss (7) function. Given twelve outputs, we design the final loss as: DUTS-TE DUT-OMRON HKU-IS ECSSD PASCAL-S Method

MAE ↓Fβ ↑ S ↑ MAE ↓Fβ ↑ S ↑ MAE ↓Fβ ↑ S ↑ MAE ↓Fβ ↑ S ↑ MAE ↓Fβ ↑ S ↑

DSS C 0.056 0.801 0.824 0.063 0.737 0.790 0.040 0.889 0.790 0.052 0.906 0.882 0.095 0.809 0.797 UCF C 0.112 0.742 0.782 0.120 0.698 0.760 0.062 0.874 0.875 0.069 0.890 0.883 0.116 0.791 0.806 C 0.085 0.751 0.804 0.098 0.715 0.781 0.051 0.887 0.886 0.059 0.905 0.891 0.100 0.810 0.819 BMPM C 0.049 0.828 0.862 0.064 0.734 0.809 0.039 0.910 0.907 0.045 0.917 0.911 0.074 0.836 0.846 RAS C 0.059 0.807 0.838 0.062 0.753 0.814 --- 0.056 0.908 0.893 0.104 0.803 0.796 PSAM C 0.041 0.854 0.879 0.057 0.765 0.830 0.034 0.918 0.914 0.040 0.931 0.920 0.075 0.847 0.851 CPD C 0.043 0.841 0.869 0.056 0.754 0.825 0.034 0.911 0.906 0.037 0.927 0.918 0.072 0.837 0.847 SCRN C 0.040 0.864 0.885 0.056 0.772 0.837 0.034 0.921 0.916 0.038 0.937 0.927 0.064 0.858 0.868 BASNet C 0.048 0.838 0.866 0.056 0.779 0.836 0.032 0.919 0.909 0.037 0.932 0.916 0.078 0.836 0.837 EGNet C 0.039 0.866 0.887 0.053 0.777 0.841 0.031 0.923 0.918 0.037 0.936 0.925 0.075 0.846 0.853 MINet C 0.037 0.865 0.884 0.056 0.769 0.833 0.029 0.926 0.919 0.034 0.938 0.925 0.064 0.852 0.856 LDF C 0.034 0.877 0.892 0.052 0.782 0.839 0.028 0.929 0.920 0.034 0.938 0.924 0.061 0.859 0.862 GateNet C 0.040 0.869 0.885 0.055 0.781 0.838 0.033 0.920 0.915 0.040 0.933 0.920 0.069 0.852 0.858 SAC C 0.034 0.882 0.895 0.052 0.804 0.849 0.026 0.935 0.925 0.031 0.945 0.931 0.063 0.868 0.866 Naive T 0.043 0.855 0.878 0.059 0.776 0.835 0.042 0.905 0.903 0.042 0.927 0.919 0.068 0.854 0.862 SETR T 0.039 0.869 0.888 0.056 0.782 0.838 0.037 0.917 0.912 0.041 0.930 0.921 0.062 0.867 0.859 VST T 0.037 0.877 0.896 0.058 0.800 0.850 0.030 0.937 0.928 0.034 0.944 0.932 0.067 0.850 0.873 Ours T 0.029 0.901 0.912 0.045 0.819 0.865 0.026 0.941 0.932 0.028 0.953 0.940 0.054 0.882 0.881

Table 1: Quantitative comparisons with FCN-based (denoted as C) and Transformer-based (denoted as T) salient object detection methods by three evaluation metrics. Top three performances are marked in Red, Green, and Blue respectively. Simply utilizing transformer (Naive) achieves comparable performance. Our method significantly outperforms all competi- tors regardless of what basic component they used (i.e., either FCN or transformer-based).The results of VST† come from the paper due to the lack of released code or saliency maps.

0 0 MAE(S ,S) = S − S . (10) 0 0 0 lbce(S ,S) = S log S + (1 − S) log(1 − S ), 3 4 Fβ evaluates the precision and recall at the same time 0 (9) X X and use β2 to weight precision: L = lbce(Si,j,S), i=1 j=1 (1 + β2) × P recision × Recall F = , (11) 0 β β2 × P recision + Recall where {Si } indicate our predicted saliency maps and S in- dicates the ground truth. where β2 is set to 0.3 as in previous work [57]. 4.3. Evaluation Metrics S-Measure strengthens the structural information of both foreground and background. Specifically, the regional simi- In this paper, we evaluate our methods under three larity Sr and the object similarity So are combined together widely used evaluation metrics: mean absolute error with weight α: (MAE), weighted F-Measure (Fβ), and S-Measure (S)[12].

MAE directly measures the average pixel-wise differ- S = α ∗ So + (1 − α) ∗ Sr, (12) ences between saliency maps and labels by mean absolute error: where α is set to 0.5 [57]. (a) Image (b) GT (c) Ours (T) (d) SETR (T) (e) GateNet (C) (f) LDF (C) (g) EGNet (C) (h) BMPM (C) (i) DSS (C) Figure 5: Qualitative comparisons with state-of-the-art methods. Our method provides more visually reasonable saliency maps by accurately locating salient objects and generating sharp boundaries than other transformer-based (denoted as T) and FCN-based methods (denoted as C).

4.4. Comparison with state-of-the-art methods lar, the performance is averagely improved 12.17% over the second best method LDF in terms of MAE. The superior We compare our method with 12 state-of-the-art salient performance on both easy and difficult datasets shows that object detection methods: DSS [16], UCF [54], Amulet [53] our method does well on both simple and complex scenes. BMPM [51],RAS [5], PSAM [3] PoolNet [22], CPD [45], Another interesting finding is that transformer is powerful SCRN [47], BASNet [30], EGNet [57], MINet [29], to generate rather good saliency maps with “Naive” de- LDF [44], GateNet [59], SAC [18], SETR [60], VST [10], coder, although it cannot outperform state-of-the-art FCN- and Naive (Transformer encoder with the naive decoder). based method (i.e., SAC). Further investigating the poten- The saliency maps are provided by their authors or calcu- tial for transformer may keep enhancing the performance lated by the released code except SETR and Naive which on salient object detection task. are implemented by ourselves. Besides, all results are eval- Qualitative Comparison. The qualitative results are uated with the same evaluation code. shown in Fig.5. With the power of modeling long range Quantitative Comparison. We report the quantitative dependency over the whole image, transformer-based meth- results in Tab.1. As can be seen from these results, with out ods are able to capture the salient object more accurately. any post-processing, our method outperforms all compared For example, in the sixth row of Fig.5, our method can methods using FCN-based or transformer-based architec- accurately capture the salient object without being deceived ture by a large margin on all evaluation metrics. In particu- by the ice hill while FCN-based methods all fail in this case. Upsample DUTS-TE DUT-OMRON

Strategy MAE ↓ Fβ ↑ S ↑ MAE ↓ Fβ ↑ S ↑

Density set to 0

16× 0.043 0.855 0.878 0.059 0.776 0.835 Four 2× 0.041 0.861 0.883 0.059 0.775 0.836

Density set to 1

16× 0.036 0.875 0.896 0.052 0.794 0.850 4× and 4× 0.034 0.886 0.900 0.053 0.801 0.850 Ours 0.033 0.889 0.902 0.052 0.802 0.852 Input GT Dens.4 Dens.3 Dens.2 Dens.1 Dens.0 Ours Naive Density set to 4

Figure 6: The effect of dense connection during decoding. 16× 0.034 0.882 0.902 0.051 0.799 0.846 We gradually increase the density of connection between 4× and 4× 0.030 0.900 0.911 0.047 0.814 0.862 transformer output features and decoding convolution layer Ours 0.029 0.901 0.912 0.045 0.819 0.865 in all stages. The naive decoder (i.e., density 0) predicts Table 3: Qualitative evaluations on the upsampling strate- saliency maps with a lot of noises, which is severely influ- gies under different density settings. Gradually upsampling enced by the background. As the density increasing, the leads to better performance than the naive one and our up- saliency maps become more accurate and sharp, especially sampling strategy outperforms others under different den- near the boundaries. sity settings.

Density DUTS-TE DUT-OMRON

MAE ↓ Fβ ↑ S ↑ MAE ↓ Fβ ↑ S ↑ density 3 means that there are three connections (i.e., from Z ,Z ,Z ) added in each stage and only one (i.e., Z ) 0 0.043 0.855 0.878 0.059 0.776 0.835 i,4 i,3 i,2 i,4 1 0.033 0.889 0.902 0.052 0.802 0.852 for density 1. 2 0.031 0.895 0.907 0.050 0.807 0.857 The qualitative results are illustrated in Fig.6, and quan- 3 0.030 0.899 0.910 0.047 0.813 0.862 titative results are shown in Tab.2. When we use the naive 4 0.029 0.901 0.912 0.045 0.819 0.865 decoder without any connection, the results are much worse with lots of noises. The saliency maps are more sharp and Table 2: Quantitative evaluation on dense connection dur- accurate with the increase of decoding density, and the per- ing decoding. As can be seen, adding connections can formances also gradually increase on three evaluation met- bring significant performance gain compared to the naive rics. As each layer captures a global view of its previous one (i.e., density=0). As the density increasing, the perfor- layer, the representation differences are maximized among mance gradually improves on all three evaluation metrics. different transformer features. Thus, each transformer fea- The best performances are marked as red. ture can contribute individually to the predicted saliency maps. Effectiveness of Upsampling Strategy. To evaluate the Besides, saliency maps predicted by our method are more effect of our gradually upsampling strategy used in the de- complete, and aligned with the ground truths. coder, we compare different upsampling strategies on dif- 4.5. Ablation Study ferent density settings. 1) The Naive decoder (denoted as 16×) simply upsamples the last transformer features 16× In this subsection, we mainly evaluate the effectiveness for prediction. 2) When density is set to 0, we test an of dense connections and different upsampling strategies in upsampling strategy in stage-by-stage decoder (denoted as our decoder design on two challenging datasets: DUTS-TE Four 2×), where the output features of each stage are up- and DUT-OMRON. sampled 2× before sending to the next stage and there are Effectiveness of Dense Connections. We first conduct four 2× upsampling operations in total. 3) A sampling strat- an experiment to examine the effect of densely decoding egy (denoted as 4× and 4×) used in SETR [60] upsamples the features from each transformer encoding layer. Starting each transformer features 4×, and another 4× is applied from simply decoding the features of last transformer (i.e., to obtain the final predicted saliency map. 4) Our decoder naive decoder), we gradually add the fuse connections in (Ours) upsamples transformer features 8×, 4×, 2× for the each stage to increase the density until reaching our final layers connected to stage 1, stage 2, stage 3 respectively. decoder, denoted as density 0, 1, 2, 3. For example, the The quantitative results are reported in Tab3 with some interesting findings. Gradually upsampling always per- [4] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Re- forms better than the naive one with a single 16× upsam- verse attention for salient object detection. In ECCV, pages pling operation. With the help of our upsampling strategy, 234–250, 2018.4 the dense decoding manifests its advanced performance, [5] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Re- with less noise injected during decoding. Directly decod- verse attention for salient object detection. In European Con- ing the output feature from last transformer in Stage mode ference on Computer Vision, 2018.7 [6] Ming-Ming Cheng, Niloy J Mitra, Xiaolei Huang, Philip HS only leads to limited performance, further indicating the im- Torr, and Shi-Min Hu. Global contrast based salient region portance of our dense connection. detection. IEEE TPAMI, 37(3):569–582, 2014.1 4.6. The Statistics of Timing [7] Celine Craye, David Filliat, and Jean-Franc¸ois Goudou. En- vironment exploration for object-based visual saliency learn- Although the computational cost of transformer encoder ing. In ICRA, pages 2303–2309. IEEE, 2016.1 may be high, thanks to the parallelism of attention architec- [8] Robert Desimone and John Duncan. Neural mechanisms of ture, the inference speed is still comparable to recent meth- selective visual attention. Annual review of neuroscience, 18(1):193–222, 1995.2 ods, shown in Tab.4. Besides, as a pioneer work, we want [9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, to highlight the power of transformer on salient object de- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, tection (SOD) task and leave time complexity reduction as Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- the future work. vain Gelly, et al. An image is worth 16x16 words: Trans- Method DSS R3Net BASNet SCRN EGNet Ours formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.2,3,4,5 Speed(FPS) 24 19 28 16 7 15 [10] Yuming Du, Yang Xiao, and Vincent Lepetit. Learning to Table 4: Running time of different methods. We test all methods better segment objects from unseen classes with unlabeled on a single RTX2080Ti. videos. arXiv preprint arXiv:2104.12276, 2021.7 [11] Wolfgang Einhauser¨ and Peter Konig.¨ Does luminance- 5. Conclusion contrast contribute to a saliency map for overt visual atten- tion? European Journal of Neuroscience, 17(5):1089–1097, In this paper, we explore the unified learning of global- 2003.1 local representations in salient object detection with an [12] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali attention-based model using transformers. With the power Borji. Structure-measure: A new way to evaluate foreground of modeling long range dependency in transformer, we maps. In ICCV, pages 4548–4557, 2017.6 overcome the limitations caused by the locality of CNN [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. in previous salient object detection methods. We propose Deep residual learning for image recognition. In CVPR, an effective decoder to densely decode transformer features pages 770–778, 2016.3 and gradually upsample to predict saliency maps. It fully [14] Shengfeng He, Jianbo Jiao, Xiaodan Zhang, Guoqiang Han, uses all transformer features due to the high representa- and Rynson WH Lau. Delving into salient object subitizing tion differences and less redundancy among different trans- and detection. In ICCV, pages 1059–1067, 2017.2 former features. We adopt a gradually upsampling strategy [15] Shengfeng He and Rynson WH Lau. Saliency detection with to keep less noise injection during decoding to magnify the flash and no-flash image pairs. In ECCV, pages 110–124. Springer, 2014.1 ability of locating accurate salient region. We conduct ex- [16] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, periments on five widely used datasets and three evaluation Zhuowen Tu, and Philip HS Torr. Deeply supervised salient metrics to demonstrate that our method significantly outper- object detection with short connections. In CVPR, pages forms state-of-the-art methods by a large margin. 3203–3212, 2017.2,7 [17] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo- References cal relation networks for image recognition. In ICCV, pages 3464–3473, 2019.3 [1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, [18] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, Tianyu Wang, and and Quoc V Le. Attention augmented convolutional net- Pheng-Ann Heng. Sac-net: Spatial attenuation context for works. In ICCV, pages 3286–3295, 2019.3 salient object detection. IEEE Transactions on Circuits and [2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Systems for Video Technology, 2020. to appear.7 Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- [19] G. Li and Y. Yu. Visual saliency based on multiscale deep to-end object detection with transformers. In ECCV, pages features. In CVPR, pages 5455–5463, June 2015.5 213–229. Springer, 2020.3 [20] Guanbin Li and Yizhou Yu. Deep contrast learning for salient [3] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo object detection. In CVPR, pages 478–487, 2016.1,2 matching network. In Proceedings of the IEEE Conference [21] Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and on Computer Vision and Pattern Recognition, pages 5410– Alan L Yuille. The secrets of salient object segmentation. 5418, 2018.7 In CVPR, pages 280–287, 2014.5 [22] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng, [39] Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, and Jianmin Jiang. A simple pooling-based design for real- and Xiang Ruan. Saliency detection with recurrent fully con- time salient object detection. In CVPR, pages 3917–3926, volutional networks. In ECCV, pages 825–841. Springer, 2019.2,7 2016.1,2 [23] Nian Liu, Junwei Han, and Ming-Hsuan Yang. Picanet: [40] Tiantian Wang, Ali Borji, Lihe Zhang, Pingping Zhang, and Learning pixel-wise contextual attention for saliency detec- Huchuan Lu. A stagewise refinement model for detecting tion. In CVPR, pages 3089–3098, 2018.2 salient objects in images. In ICCV, pages 4019–4028, 2017. [24] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully 1,2 convolutional networks for semantic segmentation. In [41] Tiantian Wang, Lihe Zhang, Shuo Wang, Huchuan Lu, Gang CVPR, pages 3431–3440, 2015.1 Yang, Xiang Ruan, and Ali Borji. Detect globally, refine [25] Wenfeng Luo, Meng Yang, and Weishi Zheng. Weakly- locally: A novel approach to saliency detection. In CVPR, supervised semantic segmentation with saliency and incre- pages 3127–3135, 2018.2 mental supervision updating. Pattern Recognition, page [42] Wenguan Wang, Shuyang Zhao, Jianbing Shen, Steven CH 107858, 2021.1 Hoi, and Ali Borji. Salient object detection with pyramid at- [26] Zhiming Luo, Akshaya Mishra, Andrew Achkar, Justin tention and salient edges. In CVPR, pages 1448–1457, 2019. Eichel, Shaozi Li, and Pierre-Marc Jodoin. Non-local deep 3 features for salient object detection. In CVPR, pages 6609– [43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- 6617, 2017.2 ing He. Non-local neural networks. In CVPR, pages 7794– [27] Roey Mechrez, Eli Shechtman, and Lihi Zelnik-Manor. 7803, 2018.3 Saliency driven image manipulation. Machine Vision and [44] Jun Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, Applications, 30(2):189–202, 2019.1 and Qi Tian. Label decoupling framework for salient object CVPR [28] Clifford Thomas Morgan. Physiological psychology. 1943. detection. In , June 2020.4,7 2 [45] Zhe Wu, Li Su, and Qingming Huang. Cascaded partial de- coder for fast and accurate salient object detection. In CVPR, [29] Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. June 2019.7 Multi-scale interactive network for salient object detection. [46] Zhe Wu, Li Su, and Qingming Huang. Stacked cross re- In CVPR, June 2020.4,7 finement network for edge-aware salient object detection. In [30] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, ICCV, pages 7264–7273, 2019.3 Masood Dehghan, and Martin Jagersand. Basnet: Boundary- [47] Z. Wu, L. Su, and Q. Huang. Stacked cross refinement net- aware salient object detection. In CVPR, June 2019.4,7 work for edge-aware salient object detection. In ICCV, pages [31] Sucheng Ren, Chu Han, Xin Yang, Guoqiang Han, and 7263–7272, 2019.7 Shengfeng He. Tenet: Triple excitation network for video [48] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and salient object detection. In ECCV, pages 212–228. Springer, Ming-Hsuan Yang. Saliency detection via graph-based man- 2020.2 ifold ranking. In CVPR, pages 3166–3173, 2013.5 [32] John H Reynolds and Robert Desimone. Interacting roles of [49] Yang Yang, Jimei Yang, Junjie Yan, Shengcai Liao, Dong attention and visual salience in v4. Neuron, 37(5):853–863, Yi, and Stan Z Li. Salient color names for person re- 2003.2 identification. In ECCV, pages 536–551. Springer, 2014.1 [33] Jianping Shi, Qiong Yan, Li Xu, and Jiaya Jia. Hierarchical [50] Yanhong Zeng, Jianlong Fu, and Hongyang Chao. Learning image saliency detection on extended cssd. IEEE TPAMI, joint spatial-temporal transformations for video inpainting. 38(4):717–729, 2015.5 In ECCV, pages 528–543. Springer, 2020.3 [34] Karen Simonyan and Andrew Zisserman. Very deep convo- [51] Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. A lutional networks for large-scale image recognition. arXiv bi-directional message passing model for salient object de- preprint arXiv:1409.1556, 2014.2,3 tection. In CVPR, pages 1741–1750, 2018.1,2,7 [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- [52] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Il- and Xiang Ruan. Amulet: Aggregating multi-level convolu- lia Polosukhin. Attention is all you need. arXiv preprint tional features for salient object detection. In ICCV, pages arXiv:1706.03762, 2017.2,3 202–211, 2017.2 [36] Bo Wang, Wenxi Liu, Guoqiang Han, and Shengfeng He. [53] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, Learning long-term structural dependencies for video salient and Xiang Ruan. Amulet: Aggregating multi-level convo- object detection. IEEE TIP, 29:9017–9031, 2020.2 lutional features for salient object detection. In ICCV, Oct [37] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, 2017.7 Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand- [54] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, alone axial-attention for panoptic segmentation. In ECCV, and Baocai Yin. Learning uncertain convolutional features pages 108–126. Springer, 2020.3 for accurate saliency detection. In ICCV, Oct 2017.7 [38] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, [55] Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de- and Gang Wang. Progressive attention guided recurrent net- tect salient objects with image-level supervision. In CVPR, work for salient object detection. In CVPR, pages 714–722, pages 136–145, 2017.5 2018.2,4 [56] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In CVPR, pages 10076– 10085, 2020.3 [57] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao, Jufeng Yang, and Ming-Ming Cheng. Egnet: Edge guidance network for salient object detection. In ICCV, pages 8779– 8788, 2019.2,3,5,6,7 [58] Rui Zhao, Wanli Oyang, and Xiaogang Wang. Person re- identification by saliency learning. IEEE TPAMI, 39(2):356– 370, 2016.1 [59] Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. Suppress and balance: A simple gated network for salient object detection. In ECCV, pages 35–51, 2020.1, 2,7 [60] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers. arXiv preprint arXiv:2012.15840, 2020.3,4,5,7, 8