Segan: Adversarial Network with Multi-Scale L1 Loss for Medical Image Segmentation
Total Page:16
File Type:pdf, Size:1020Kb
Neuroinformatics manuscript No. (will be inserted by the editor) SegAN: Adversarial Network with Multi-scale L1 Loss for Medical Image Segmentation Yuan Xue · Tao Xu · Han Zhang · L. Rodney Long · Xiaolei Huang Received: date / Accepted: date Abstract Inspired by classic generative adversarial network- We tested our SegAN method using datasets from the MIC- s (GAN), we propose a novel end-to-end adversarial neural CAI BRATS brain tumor segmentation challenge. Exten- network, called SegAN, for the task of medical image seg- sive experimental results demonstrate the effectiveness of mentation. Since image segmentation requires dense, pixel- the proposed SegAN with multi-scale loss: on BRATS 2013 level labeling, the single scalar real/fake output of a classic SegAN gives performance comparable to the state-of-the-art GAN’s discriminator may be ineffective in producing stable for whole tumor and tumor core segmentation while achieves and sufficient gradient feedback to the networks. Instead, we better precision and sensitivity for Gd-enhance tumor core use a fully convolutional neural network as the segmentor segmentation; on BRATS 2015 SegAN achieves better per- to generate segmentation label maps, and propose a novel formance than the state-of-the-art in both dice score and pre- adversarial critic network with a multi-scale L1 loss func- cision. tion to force the critic and segmentor to learn both global and local features that capture long- and short-range spa- tial relationships between pixels. In our SegAN framework, 1 Introduction the segmentor and critic networks are trained in an alternat- ing fashion in a min-max game: The critic takes as input Advances in a wide range of medical imaging technologies a pair of images, (original image ∗ predicted label map, o- have revolutionized how we view functional and patholog- riginal image ∗ ground truth label map), and then is trained ical events in the body and define anatomical structures in by maximizing a multi-scale loss function; The segmentor which these events take place. X-ray, CAT, MRI, Ultrasound, is trained with only gradients passed along by the critic, nuclear medicine, among other medical imaging technolo- with the aim to minimize the multi-scale loss function. We gies, enable 2D or tomographic 3D images to capture in- show that such a SegAN framework is more effective and vivo structural and functional information inside the body stable for the segmentation task, and it leads to better perfor- for diagnosis, prognosis, treatment planning and other pur- mance than the state-of-the-art U-net segmentation method. poses. One fundamental problem in medical image analysis is Yuan Xue and Tao Xu are Co-first Authors. image segmentation, which identifies the boundaries of ob- Yuan Xue · Tao Xu · Xiaolei Huang jects such as organs or abnormal regions (e.g. tumors) in im- Department of Computer Science and Engineering, Lehigh University, ages. Since manually annotation can be very time-consuming Bethlehem, PA, USA and subjective, an accurate and reliable automatic segmen- E-mail: fyux715, tax313, [email protected] tation method is valuable for both clinical and research pur- Han Zhang pose. Having the segmentation result makes it possible for Department of Computer Science, Rutgers University, Piscataway, NJ, shape analysis, detecting volume change, and making a pre- USA E-mail: [email protected] cise radiation therapy treatment plan. L. Rodney Long In the literature of image processing and computer vi- National Library of Medicine, National Institutes of Health, Bethesda, sion, various theoretical frameworks have been proposed for MD, USA automatic segmentation. Traditional unsupervised methods E-mail: [email protected] such as thresholding [25], region growing [1], edge detec- 2 Yuan Xue et al. tion and grouping [3], Markov Random Fields (MRFs) [21], [4] as a refinement to enforce spatial contiguity in the output active contour models [14], Mumford-Shah functional based label maps. Many previous methods [11, 13, 26] address this frame partition [23], level sets [20], graph cut [29], mean issue by training CNNs on image patches and using multi- shift [6], and their extensions and integrations [10, 15, 16] scale, multi-path CNNs with different input resolutions or usually utilize constraints about image intensity or objec- different CNN architectures. Using patches and multi-scale t appearance. Supervised methods [22, 5, 8, 30, 27, 11], on inputs could capture spatial context information to some ex- the other hand, directly learn from labeled training samples, tent. Nevertheless, as described in U-net [27], the compu- extract features and context information in order to perform tational cost for patch training is very high and there is a a dense pixel (or voxel)-wise classification. trade-off between localization accuracy and the patch size. Convolutional Neural Networks (CNNs) have been wide- Instead of training on small image patches, current state- ly applied to visual recognition problems in recent years, of-the-art CNN architectures such as U-net are trained on and they are shown effective in learning a hierarchy of fea- whole images or large image patches and use skip connec- tures at multiple scales from data. For pixel-wise seman- tions to combine hierarchical features for generating the la- tic segmentation, CNNs have also achieved remarkable suc- bel map. They have shown potential to implicitly learn some cess. In [18], Long et al. first proposed a fully convolu- local dependencies between pixels. However, these method- tional networks (FCNs) for semantic segmentation. The au- s are still limited by their pixel-wise loss function, which thors replaced conventional fully connected layers in CNNs lacks the ability to enforce the learning of multi-scale spa- with convolutional layers to obtain a coarse label map, and tial constraints directly in the end-to-end training process. then upsampled the label map with deconvolutional layers Compared with patch training, an issue for CNNs trained to get per pixel classification results. Noh et al. [24] used on entire images is label or class imbalance. While patch an encoder-decoder structure to get more fine details about training methods can sample a balanced number of patches segmented objects. With multiple unpooling and deconvo- from each class, the numbers of pixels belonging to differ- lutional layers in their architecture, they avoided the coarse- ent classes in full-image training methods are usually im- to-fine stage in [18]. However, they still needed to ensemble balanced. To mitigate this problem, U-net uses a weighted with FCNs in their method to capture local dependencies be- cross-entropy loss to balance the class frequencies. Howev- tween labels. Lin et al. [17] combined Conditional Random er, the choice of weights in their loss function is task-specific Fields (CRFs) and CNNs to better explore spatial correla- and is hard to optimize. In contract to the weighted loss in U- tions between pixels, but they also needed to implement a net, a general loss that could avoid class imbalance as well dense CRF to refine their CNN output. as extra hyper-parameters would be more desirable. In the field of medical image segmentation, deep CNNs In this paper, we propose a novel end-to-end Adversar- have also been applied with promising results. Ronneberg- ial Network architecture, called SegAN, with a multi-scale er et al. [27] presented a FCN, namely U-net, for segment- L1 loss function, for semantic segmentation. Inspired by the ing neuronal structures in electron microscopic stacks. With original GAN [9], the training procedure for SegAN is sim- the idea of skip-connection from [18], the U-net achieved ilar to a two-player min-max game in which a segmentor very good performance and has since been applied to many network (S) and a critic network (C) are trained in an al- different tasks such as image translation [12]. In addition, ternating fashion to respectively minimize and maximize an Havaei et al. [11] obtained good performance for medical objective function. However, there are several major differ- image segmentation with their InputCascadeCNN. The In- ences between our SegAN and the original GAN that make putCascadeCNN has image patches as inputs and uses a cas- SegAN significantly better for the task of image segmenta- cade of CNNs in which the output probabilities of a first- tion. stage CNN are taken as additional inputs to a second-stage – In contrast to classic GAN with separate losses for gen- CNN. Pereira et al. [26] applied deep CNNs with small ker- erator and discriminator, we propose a novel multi-scale nels for brain tumor segmentation. They proposed different loss function for both segmentor and critic. Our critic architectures for segmenting high grade and low grade tu- is trained to maximize a novel multi-scale L1 objective mors, respectively. Kamnitsas et al. [13] proposed a 3D C- function that takes into account CNN feature differences NN using two pathways with inputs of different resolutions. between the predicted segmentation and the ground truth 3D CRFs were also needed to refine their results. segmentation at multiple scales (i.e. at multiple layers). Although these previous approaches using CNNs for seg- – We use a fully convolutional neural network (FCN) as mentation have achieved promising results, they still have the segmentor S, which is trained with only gradients limitations. All above methods utilize a pixel-wise loss, such flowing through the critic, and with the objective of min- as softmax, in the last layer of their networks, which is insuf- imizing the same loss function as for the critic. ficient to learn both local and global contextual relations be- – Our SegAN is an end-to-end architecture trained on w- tween pixels. Hence they always need models such as CRFs hole images, with no requirements for patches, or inputs SegAN: Adversarial Network with Multi-scale L1 Loss for Medical Image Segmentation 3 of multiple resolutions, or further smoothing of the pre- the multi-scale objective loss function L is defined as: dicted label maps using CRFs.