Medical Image Segmentation Using Squeeze-and-Expansion Transformers Shaohua Li1∗ , Xiuchao Sui1 , Xiangde Luo2 , Xinxing Xu1 , Yong Liu1 , Rick Goh1 1Institute of High Performance Computing, A*STAR, Singapore 2University of Electronic Science and Technology of China, Chengdu, China {shaohua, xiuchao.sui}@gmail.com,
[email protected], {xuxinx, liuyong, gohsm}@ihpc.a-star.edu.sg Abstract the same time, i.e., learn image features that incorporate large context while keep high spatial resolutions to output fine- Medical image segmentation is important for grained segmentation masks. However, these two demands computer-aided diagnosis. Good segmentation de- pose a dilemma for CNNs, as CNNs often incorporate larger mands the model to see the big picture and fine context at the cost of reduced feature resolution. A good mea- details simultaneously, i.e., to learn image features sure of how large a model “sees” is the effective receptive field that incorporate large context while keep high spa- (effective RF) [Luo et al., 2016], i.e., the input areas which tial resolutions. To approach this goal, the most have non-negligible impacts to the model output. widely used methods – U-Net and variants, ex- Since the advent of U-Net [Ronneberger et al., 2015], it tract and fuse multi-scale features. However, the has shown excellent performance across medical image seg- fused features still have small effective receptive mentation tasks. A U-Net consists of an encoder and a de- fields with a focus on local image cues, limiting coder, in which the encoder progressively downsamples the their performance. In this work, we propose Seg- features and generates coarse contextual features that focus tran, an alternative segmentation framework based on contextual patterns, and the decoder progressively upsam- on transformers, which have unlimited effective re- ples the contextual features and fuses them with fine-grained ceptive fields even at high feature resolutions.