Unsupervised Image Segmentation by Backpropagation

UNSUPERVISED IMAGE SEGMENTATION BY BACKPROPAGATION Asako Kanezaki National Institute of Advanced Industrial Science and Technology (AIST) 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan ABSTRACT challenge. The desired feature representation highly depends We investigate the use of convolutional neural networks on the content of the target image. For instance, if the goal (CNNs) for unsupervised image segmentation. As in the is to detect zebras as a foreground, the feature representation case of supervised image segmentation, the proposed CNN should be reactive to black-white vertical stripes. Therefore, assigns labels to pixels that denote the cluster to which the the pixel-level features should be descriptive of colors and pixel belongs. In the unsupervised scenario, however, no textures of a local region surrounding each pixel. Recently, training images or ground truth labels of pixels are given convolutional neural networks (CNNs) have been success- beforehand. Therefore, once when a target image is input, fully applied to semantic image segmentation (in supervised we jointly optimize the pixel labels together with feature rep- learning scenarios) for autonomous driving or augmented re- resentations while their parameters are updated by gradient ality games, for example. CNNs are not often used in fully descent. In the proposed approach, we alternately iterate unsupervised scenarios; however, they have great potential for label prediction and network parameter learning to meet the extracting detailed features from image pixels, which is nec- following criteria: (a) pixels of similar features are desired to essary for unsupervised image segmentation. Motivated by be assigned the same label, (b) spatially continuous pixels are the high feature descriptiveness of CNNs, we present a joint desired to be assigned the same label, and (c) the number of learning approach that predicts, for an arbitrary image input, unique labels is desired to be large. Although these criteria unknown cluster labels and learns optimal CNN parameters are incompatible, the proposed approach finds a plausible for the image pixel clustering. Then, we extract a group of solution of label assignment that balances well the above cri- image pixels in each cluster as a segment. teria, which demonstrates good performance on a benchmark Now, we describe the problem formulation that we solve {x ∈ Rp}N p dataset of image segmentation. for image segmentation. Let n n=1 be a set of - dimensional feature vectors of image pixels, where N denotes Index Terms — Convolutional neural networks, Unsuper- the number of pixels in an input image. We assign cluster la- vised learning, Feature clustering N bels {cn ∈ Z}n=1 to all of the pixels by cn = f(xn), where f : Rp → Z denotes a mapping function. Here, f can, for 1. INTRODUCTION instance, be the assignment function that returns the ID of the Image segmentation has attracted attention in computer vision cluster centroid closest to xn among k centroids, which are research for decades. The applications of image segmentation obtained by, e.g., k-means clustering. For the case in which include object detection, texture recognition, and image com- f and the feature representation {xn} are fixed, {cn} are ob- pression. In the supervised scenario, in which a set of pairs tained by the above equation. On the other hand, if f and of images and pixel-level semantic labels (such as “sky” or {xn} are trainable, whereas {cn} are given (fixed), then the “bicycle”) is used for training, the goal is to train a system above equation can be regarded as a standard supervised clas- that classifies the labels of known categories for image pix- sification problem. The parameters for f and {xn} in this els. On the other hand, in the unsupervised scenario, image case can be optimized by gradient descent if f and the feature segmentation is used to predict more general labels, such as extraction functions for {xn} are differentiable. However, in “foreground” and “background”. The latter case is more chal- the present study, we predict unknown {cn} while training lenging than the former, and furthermore, it is extremely hard the parameters of f and {xn} in a fully unsupervised manner. to segment an image into an arbitrary number (≥ 2) of plausi- To put this into practice, we alternatively solve the following ble regions. The present study considers a problem in which two sub-problems: prediction of the optimal {cn} with fixed an image is partitioned into an arbitrary number of salient or f and {xn} and training of the parameters of f and {xn} meaningful regions without any previous knowledge. with fixed {cn}. Once the pixel-level feature representation is obtained, Let us now discuss the characteristics of the cluster la- image segments can be obtained by clustering the feature vec- bels {cn} necessary for good image segmentation. Similar to tors. However, the design of feature representation remains a previous studies on unsupervised image segmentation [1, 2], we assume that a good image segmentation solution matches where yn,i denotes the ith element of yn. This is equivalent well a solution that a human would provide. When a human to assigning each pixel to the closest point among the q rep- is asked to segment an image, he/she would most likely cre- resentative points, which are placed at infinite distance on the ate segments, each of which corresponds to the whole or a respective axis in the q-dimensional space. Note that Ci can (salient) part of a single object instance. An object instance be ∅, and therefore the number of unique cluster labels is ar- tends to contain large regions of similar colors or texture pat- bitrary from 1 to q. terns. Therefore, grouping spatially continuous pixels that have similar colors or texture patterns into the same cluster 2.2. Constraint on spatial continuity is a reasonable strategy for image segmentation. On the other The basic concept of image pixel clustering is to group simi- hand, in order to separate segments from different object in- lar pixels into clusters (as shown in Sec. 2.1). In image seg- stances, it is better to assign different cluster labels to neigh- mentation, however, it is preferable for the clusters of im- boring pixels of dissimilar patterns. To facilitate the cluster age pixels to be spatially continuous. Here, we add an ad- separation, we also consider a strategy in which a large num- ditional constraint that favors cluster labels that are the same ber of unique cluster labels is desired. In conclusion, we in- as those of neighboring pixels. We first extract K fine su- K troduce the following three criteria for the prediction of {cn}: perpixels {Sk}k=1 (with a large K) from the input image N (a) Pixels of similar features are desired to be assigned the same label. I = {vn}n=1, where Sk denotes a set of the indices of pix- (b) Spatially continuous pixels are desired to be assigned the same label. els that belong to the kth superpixel. Then, we force all of (c) The number of unique cluster labels is desired to be large. the pixels in each superpixel to have the same cluster label. |c | Note that these criteria are incompatible so that they are never More specifically, letting n n∈Sk be the number of pixels satisfied perfectly. However, through the gradual optimization in Sk that belong to the cnth cluster, we select the most fre- c |c | ≥|c | that considers all three criteria simultaneously, the proposed quent cluster label max, where max n∈Sk n n∈Sk for system finds a plausible solution of {cn} that balance well all cn ∈{1,...,q}. The cluster labels are then replaced by these criteria. In Section 2, we describe the proposed iterative cmax for n ∈Sk. In the present paper, we use SLIC [4] with approach to predict {cn} that satisfy the above criteria. K =10, 000 for the superpixel extraction. 2.3. Constraint on the number of unique cluster labels 2. METHOD In the unsupervised image segmentation, there is no clue as to 2.1. Constraint on feature similarity how many segments should be generated in an image. There- Let us consider the first criterion, which assigns the same la- fore, the number of unique cluster labels should be adap- bel to pixels of similar features. The proposed solution is tive to the image content. As described in Sec. 2.1, the pro- to apply a linear classifier that classifies the features of each posed strategy is to classify pixels into an arbitrary number pixel into q classes. In the present paper, we assume the input q(1 ≤ q ≤ q) of clusters, whereas q is the possibly max- I = {v ∈ R3}N to be an RGB image n n=1, where each pixel imum value of q .Alargeq indicates oversegmentation, value is normalized to [0, 1]. whereas a small q indicates undersegmentation. The afore- We compute a p-dimensional feature map {xn} from mentioned criteria (a) and (b) only facilitate the grouping of {vn} through M convolutional components, each of which pixels, which could lead to a naive solution that q =1.To consists of a 2D convolution, ReLU activation function, and a prevent this kind of undersegmentation failure, we introduce batch normalization function, where a batch corresponds to N the third criterion (c), which is the preference for a large q. pixels of a single input image. Here, we set p filters of region Our solution is to insert the intra-axis normalization pro- 3 × 3 M size for all of the components. Note that these com- cess for the response map {yn} before assigning cluster la- ponents for feature extraction are able to be replaced by alter- bels via argmax classification.

Unsupervised Image Segmentation by Backpropagation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support