Deep Dual Learning for Semantic Image Segmentation

Deep Dual Learning for Semantic Image Segmentation Ping Luo2∗ Guangrun Wang1;2∗ Liang Lin1;3 Xiaogang Wang2 1Sun Yat-Sen University 2The Chinese University of Hong Kong 3SenseTime Group (Limited) [email protected] [email protected] [email protected] [email protected] Abstract L L L I I I I Deep neural networks have advanced many computer T T Tˆ T vision tasks, because of their compelling capacities to learn (a) multitask + missing (b) multitask + pseudo (c) dual image segmentation label label from large amount of labeled data. However, their per- s s : CNN segmentation : CNN reconstruction : forward propagation formances are not fully exploited in semantic image segmentation as the scale of training set is limited, where per- Figure 1: Comparisons of recent semi-supervised learning set- pixel labelmaps are expensive to obtain. To reduce labeling tings. I, L, and T denote an image, a per-pixel labelmap, and a efforts, a natural solution is to collect additional images vector of image-level tags respectively, where the labelmap L can from Internet that are associated with image-level tags. be missing in training. (a) treats L as missing label in multitask Unlike existing works that treated labelmaps and tags as learning, where its gradient is not computed in back-propagation independent supervisions, we present a novel learning set- (BP). (b) regards L as latent variable that can be inferred by tags ting, namely dual image segmentation (DIS), which consists and used as ground truth in BP. We propose (c), which infers the of two complementary learning problems that are jointly missing label L not only by recovering clean tags T^, but also solved. One predicts labelmaps and tags from images, by reconstructing the image to capture accurate object shape and and the other reconstructs the images using the predicted boundary. labelmaps. DIS has three appealing properties. 1) Giv- en an image with tags only, its labelmap can be inferred by leveraging the images and tags as constraints. The where the per-pixel labelmaps are difficult and expensive to estimated labelmaps that capture accurate object classes obtain. and boundaries are used as ground truths in training to To reduce efforts of data annotations, a usual way is to boost performance. 2) DIS is able to clean tags that have automatically collect images from Internet, which are asso- noises. 3) DIS significantly reduces the number of per- ciated with image-level tags. As a result, the entire dataset pixel annotations in training, while still achieves state-of- contains two parts, including a small number of fully anno- the-art performance. Extensive experiments demonstrate tated images with per-pixel labelmaps and a large number of the effectiveness of DIS, which outperforms an existing best- weakly annotated images with image-level tags. Learning performing baseline by 12.6% on Pascal VOC 2012 test set, from this dataset follows a semi-supervised scenario. To without any post-processing such as CRF/MRF smoothing. disclose the differences among existing works, we introduce necessary notations that will be used throughout this paper. Let I be an image and L; T represent its labelmap and tags. Two superscripts w and f are utilized to distinguish 1. Introduction the weakly and fully annotated images respectively. For w Deep convolutional networks (CNNs) have improved example, I represents a weakly labeled image that only w w f performances of many computer vision tasks, because they has tags T , but its labelmap L is missing. I indicates f have compelling modeling complexity to learn from large an image that is fully annotated with both labelmap L and f number of supervised data. However, their capabilities are tags T . not fully explored in the task of semantic image segmenta- Let θ be a set of parameters of a CNN. The objective tion, which is to assign a semantic label such as ‘person’, function of semi-supervised segmentation is often formu- ‘table’, and ‘cat’ to each pixel in an image. This is due to lated as maximizing a log-likelihood with respect to θ. ∗ f w f w f w the training set of semantic image segmentation is limited, We have θ = arg maxθ log p(L ;L ;T ;T jI ;I ; θ), where the probability measures the similarity between the ∗The first two authors share first-authorship. ground truths and the predicted labelmaps and tags pro- 1 duced by the CNN. a good day’ in English (En) and ‘bonne journee’´ in French Previous works can be divided into two categories ac- (Fr), but there exists unlimited monolingual sentences in En cording to different factorizations of the above probabili- and Fr that are not labeled as pairs on the Internet. [9] ty. In the first category, methods [13, 23, 28] employed leveraged the unlabeled data to improve performance of multitask learning with missing labels, where the labelmap machine translation from En to Fr, denoted as En!Fr. This Lw is missing and treated as an unobserved variable. is achieved by designing two translation models, including In this case, the likelihood is factorized into two terms, a model of En!Fr and a reverse model of Fr!En. p(Lf ;T f ;T wjIf ;Iw) / p(Lf jIf ) · p(T f ;T wjIf ;Iw), In particular, when a pair of bilingual sentences is pre- which correspond to two tasks as shown in Fig.1 (a). One sented, both models can be updated in a fully-supervised learns to predict labelmaps Lf using the fully annotated scenario. However, when a monolingual En sentence is p- images If , and the other is trained to classify tags, T f resented, a Fr sentence is estimated by first applying En!Fr and T w, using both fully and weakly annotated images If and then Fr!En0. If the input sentence En and the repro- and Iw. These two tasks are jointly optimized in a single duced sentence En0 are close, the estimated Fr sentence is CNN. More specific, back-propagation of the first task is likely to be a good translation of En. Thus, they can be not performed when a weakly labeled image is presented. used as a bilingual pair in training. Training models on In general, these approaches learn shared representation to these large number of synthesized pairs, performance can improve segmentation accuracy by directly combining data be greatly improved as shown in [9]. In general, the above with strong and weak supervisions. two models behave as a loop to estimate the missing Fr Unlike the above approaches, Lw is treated as a la- sentences. tent variable in the second category [21,4, 15]. In In this work, DIS extends [9] to four tuples, I, L, T , this case, the likelihood function can be represented by and T^, where T^ denotes the clean tags that are recovered p(Lf ;Lw;T f ;T wjIf ;Iw) / p(Lf jIf ) · p(LwjIw;T w) · from the noisy tags T . As shown in Fig.1 (c), DIS has three p(T f ;T wjIf ;Iw), where the first and the third terms are “translation” models as a loop, including 1) I ! (L; T^), 2) identical as above. The second term estimates the missing T^ ! (L; T ), and 3) L ! I, where the first one predicts the labelmap given an image and its tags. In other words, when labelmap L and clean tags T^ given an image I, the second a weakly labeled image Iw is presented, the CNN produces one refines L according to T^, and the last one reconstructs a labelmap Lw, as its parameters are learned to do so from I using the refined labelmap L. the fully annotated images. The predicted labelmap is then Intuitively, DIS treats both Lw and T^w as latent vari- refined with respect to the tags T w, as shown in Fig.1 (b). ables, other than only one latent variable Lw as in Fig.1 We take Pascal VOC12 [6] as an example, which has 20 (b). For example, when a weakly labeled image Iw is object classes and 1 class of background. A labelmap Lw is presented, we can estimate Lw and T^w by performing of n × m × 21, where n; m denote width and height of the Iw ! (Lw; T^w), T^w ! (Lw;T w), and Lw ! Iw0. response map, and each entry in Lw indicates the possibility As a result, when Iw and Iw0 are close, Lw and T^w can of the presence of a class on a pixel. For instance, when Lw be used as a strong and a weak supervisions for training, is confused with ‘cat’ and ‘dog’ by assigning them similar because they not only capture precise object classes that probabilities, and T w tells that only ‘dog’ is appeared in present in image, but also capture accurate object boundary the image but not ‘cat’, we can refine Lw by decreasing the and shape in order to reconstruct the image. Leveraging probability of ‘cat’. An implementation to achieve this is these synthesized ground truths in training, DIS is able to by convolving the n × m × 21 labelmap with a 1 × 1 × 21 substantially reduce the number of fully annotated images, kernel (a vector of tags). After refinement, Lw is used while still attaining state-of-the-art performance. as ground truth in back-propagation, which significantly Different from Fig.1 (a,b), DIS iteratively optimizes two boosts performance as demonstrated in [21]. learning problems as visualized in (c). Its objective function However, the recent approaches still have two weakness- contains two parts, log p(Lf ;Lw;T f ;T w; T^wjIf ;Iw) + es.

Load more