Arxiv:2105.09511V3 [Eess.IV] 2 Jun 2021 Released At
Total Page:16
File Type:pdf, Size:1020Kb
Medical Image Segmentation Using Squeeze-and-Expansion Transformers Shaohua Li1∗ , Xiuchao Sui1 , Xiangde Luo2 , Xinxing Xu1 , Yong Liu1 , Rick Goh1 1Institute of High Performance Computing, A*STAR, Singapore 2University of Electronic Science and Technology of China, Chengdu, China {shaohua, xiuchao.sui}@gmail.com, [email protected], {xuxinx, liuyong, gohsm}@ihpc.a-star.edu.sg Abstract the same time, i.e., learn image features that incorporate large context while keep high spatial resolutions to output fine- Medical image segmentation is important for grained segmentation masks. However, these two demands computer-aided diagnosis. Good segmentation de- pose a dilemma for CNNs, as CNNs often incorporate larger mands the model to see the big picture and fine context at the cost of reduced feature resolution. A good mea- details simultaneously, i.e., to learn image features sure of how large a model “sees” is the effective receptive field that incorporate large context while keep high spa- (effective RF) [Luo et al., 2016], i.e., the input areas which tial resolutions. To approach this goal, the most have non-negligible impacts to the model output. widely used methods – U-Net and variants, ex- Since the advent of U-Net [Ronneberger et al., 2015], it tract and fuse multi-scale features. However, the has shown excellent performance across medical image seg- fused features still have small effective receptive mentation tasks. A U-Net consists of an encoder and a de- fields with a focus on local image cues, limiting coder, in which the encoder progressively downsamples the their performance. In this work, we propose Seg- features and generates coarse contextual features that focus tran, an alternative segmentation framework based on contextual patterns, and the decoder progressively upsam- on transformers, which have unlimited effective re- ples the contextual features and fuses them with fine-grained ceptive fields even at high feature resolutions. The local visual features. The integration of multiple scale fea- core of Segtran is a novel Squeeze-and-Expansion tures enlarges the RF of U-Net, accounting for its good per- transformer: a squeezed attention block regularizes formance. However, as the convolutional layers deepen, the the self attention of transformers, and an expan- impact from far-away pixels decay quickly. As a results, the sion block learns diversified representations. Ad- effective RF of a U-Net is much smaller than its theoretical ditionally, we propose a new positional encoding RF. As shown in Fig.2, the effective RFs of a standard U-Net scheme for transformers, imposing a continuity in- and DeepLabV3+ are merely around 90 pixels. This implies ductive bias for images. Experiments were per- that they make decisions mainly based on individual small formed on 2D and 3D medical image segmentation patches, and have difficulty to model larger context. How- tasks: optic disc/cup segmentation in fundus im- ever, in many tasks, the heights/widths of the ROIs are greater ages (REFUGE’20 challenge), polyp segmentation than 200 pixels, far beyond their effective RFs. Without “see- in colonoscopy images, and brain tumor segmen- ing the bigger picture”, U-Net and other models may be mis- tation in MRI scans (BraTS’19 challenge). Com- led by local visual cues and make segmentation errors. pared with representative existing methods, Segtran consistently achieved the highest segmentation ac- Many improvements of U-Net have been proposed. A few [ ] curacy, and exhibited good cross-domain general- typical variants include: U-Net++ Zhou et al., 2018 and U- [ ] ization capabilities. The source code of Segtran is Net 3+ Huang et al., 2020 , in which more complicated skip arXiv:2105.09511v3 [eess.IV] 2 Jun 2021 released at https://github:com/askerlee/segtran. connections are added to better utilize multi-scale contextual features; attention U-Net [Schlemper et al., 2019], which em- ploys attention gates to focus on foreground regions; 3D U- 1 Introduction Net [Çiçek et al., 2016] and V-Net [Milletari et al., 2016], which extend U-Net to 3D images, such as MRI volumes; Automated Medical image segmentation, i.e., automated de- Eff-UNet [Baheti et al., 2020], which replaces the encoder of lineation of anatomical structures and other regions of inter- U-Net with a pretrained EfficientNet [Tan and Le, 2019]. est (ROIs), is an important step in computer-aided diagnosis; for example it is used to quantify tissue volumes, extract key Transformers [Vaswani et al., 2017] are increasingly pop- quantitative measurements, and localize pathology [Schlem- ular in computer vision tasks. A transformer calculates per et al., 2019; Orlando et al., 2020]. Good segmentation the pairwise interactions (“self-attention”) between all input demands the model to see the big picture and fine details at units, combines their features and generates contextualized features. The contextualization brought by a transformer is ∗Corresponding Author. analogous to the upsampling path in a U-Net, except that Input Encoder Transformer Segmentation Coordinates CNN backbone head 푥푖, 푦푗 Reshaping to Input FPN original shape … Hi-res features Output FPN + Learnable sinusoidal Visual features … Contextualized features positional encoding Squeeze-and-Expansion Transformer layers Spatially flattening … Local features Figure 1: Segtran architecture. It extracts visual features with a CNN backbone, combines them with positional encodings of the pixel coordinates, and flattens them into a sequence of local feature vectors. The local features are contextualized by a few Squeeze-and-Expansion transformer layers. To increase spatial resolution, an input FPN and an output FPN upsamples the features before and after the transformers. 250 500 200 a 3D image segmentation task: brain tumor segmentation 400 300 150 in MRI scans of the BraTS’19 challenge. Segtran has con- 200 100 sistently shown better performance than U-Net and its vari- 100 50 ants (UNet++, UNet3+, PraNet, and nnU-Net), as well as 0 0 (a) Fundus image (b) Ground truth (c) U-Net (d) DeeplabV3+ (e) Segtran DeepLabV3+ [Chen et al., 2018]. Figure 2: Effective receptive fields of 3 models, indicated by non- negligible gradients in blue blobs and light-colored dots. Gradients 2 Related Work are back-propagated from the center of the image. Segtran has non- negligible gradients dispersed across the whole image (light-colored Our work is largely inspired by DETR [Carion et al., 2020]. dots). U-Net and DeepLabV3+ have concentrated gradients. Input DETR uses transformer layers to generate contextualized fea- image: 576 × 576. tures that represent objects, and learns a set of object queries to extract the positions and classes of objects in an image. Al- though DETR is also explored to do panoptic segmentation it has unlimited effective receptive field, good at capturing [Kirillov et al., 2019], it adopts a two-stage approach which long-range correlations. Thus, it is natural to adopt trans- is not applicable to medical image segmentation. A followup formers for image segmentation. In this work, we present work of DETR, Cell-DETR [Prangemeier et al., 2020] also Segtran, an alternative segmentation architecture based on employs transformer for biomedical image segmentation, but transformers. A straightforward incorporation of transform- its architecture is just a simplified DETR, lacking novel com- ers into segmentation only yields moderate performance. As ponents like our Squeeze-and-Expansion transformer. Most transformers were originally designed for Natural Language recently, SETR [Zheng et al., 2021] and TransU-Net [Chen et Processing (NLP) tasks, there are several aspects that could al., 2021] were released concurrently or after our paper sub- be improved to better suit image applications. To this end, we mission. Both of them employ a Vision Transformer (ViT) propose a novel transformer design Squeeze-and-Expansion [Dosovitskiy et al., 2021] as the encoder to extract image Transformer, in which a squeezed attention block helps reg- features, which already contain global contextual informa- ularize the huge attention matrix, and an expansion block tion. A few convolutional layers are used as the decoder to learns diversified representations. In addition, we propose a generate the segmentation mask. In contrast, in Segtran, the learnable sinusoidal positional encoding that imposes a conti- transformer layers build global context on top of the local im- nuity inductive bias for the transformer. Experiments demon- age features extracted from a CNN backbone, and a Feature strate that they lead to improved segmentation performance. Pyramid Network (FPN) generates the segmentation mask. We evaluated Segtran on two 2D medical image segmen- [Murase et al., 2020] extends CNNs with positional en- tation tasks: optic disc/cup segmentation in fundus images coding channels, and evaluates them on segmentation tasks. of the REFUGE’20 challenge, and polyp segmentation in Mixed results were observed. In contrast, we verified through colonoscopy images. Additionally, we also evaluated it on an ablation study that positional encodings indeed help Seg- tran to do segmentation to a certain degree. 푥1 푥2 푥3 ⋯ 푥푁 푥 Receptive fields of U-Nets may be enlarged by adding 1 푥1′ 푥 more downsampling layers. However, this increases the num- 2 푥2′ ber of parameters and adds the risk of overfitting. Another 푥3 푥3′ way of increasing receptive fields is using larger stride sizes ⋮ ⋮ of the convolutions in the downsampling path. Doing so, 푥푁 푥푁′ however, sacrifices spatial precision of feature maps, which is (a) often disadvantageous for segmentation [Liu and Guo, 2020]. 푐1 ⋯ 푐푀 푥1 푥 ′ 푥1 푥2 푥3 ⋯ 푥푁 1 푥 3 Squeeze-and-Expansion Transformer 2 푐1′ 푥2′ 푥3 ⋮ 푥3′ The core concept in a transformer is Self Attention, which ⋮ 푐 ′ ⋮ can be understood as computing an affinity matrix between 푀 푥푁 푥푁′ different units, and using it to aggregate features: (b) N×N Att_weight(X; X) = f(K(X); Q(X)) 2 R ; (1) Figure 3: (a) Full self-attention (N × N) vs. (b) Squeezed At- Attention(X) = Att_weight(X; X) · V (X); (2) tention Block (SAB).