Scene Classification Via a Gradient Boosting Random Convolutional
Total Page:16
File Type:pdf, Size:1020Kb
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1 Scene Classification via a Gradient Boosting Random Convolutional Network Framework Fan Zhang, Bo Du, Senior Member, IEEE, and Liangpei Zhang, Senior Member, IEEE Abstract—Due to the recent advances in satellite sensors, a large from these VHR images effectively and efficiently has drawn amount of high-resolution remote sensing images is now being great interest in the remote sensing field [1]. However, due to obtained each day. How to automatically recognize and analyze the complex composition and the large number of land-cover scenes from these satellite images effectively and efficiently has become a big challenge in the remote sensing field. Recently, a types, it is still a difficult task. lot of work in scene classification has been proposed, focusing on There have been many attemptsover the yearsto recognize and deep neural networks, which learn hierarchical internal feature analyze scenes from VHR images effectively and efficiently. representations from image data sets and produce state-of-the-art The majority of the early approaches, such as the bag-of-visual- performance. However, most methods, including the traditional words (BoVW) based scene classification method [2], simply shallow methods and deep neural networks, only concentrate on training a single model. Meanwhile, neural network ensembles count the occurrences of the low-level features [e.g., color histo- have proved to be a powerful and practical tool for a number of gram, local binary pattern, and scale-invariant feature transform different predictive tasks. Can we find a way to combine different (SIFT)] in an image and assign a semantic label to the image ac- deep neural networks effectively and efficiently for scene classi- cording to its content. Although the BoVW approach is highly fication? In this paper, we propose a gradient boosting random efficient, a large semantic gap between the low-level features and convolutional network (GBRCN) framework for scene classifica- tion, which can effectively combine many deep neural networks. the high-level semantic meanings still exists, which severely As far as we know, this is the first time that a deep ensemble limits the ability to handle complex image scenes. The tradi- framework has been proposed for scene classification. Moreover, tional BoVW-based methods are mainly based on low-level in the experiments, the proposed method was applied to two features [2]–[5], and low-level features cannot precisely repre- challenging high-resolution data sets: 1) the UC Merced data set sent the semantics of the images because of the so-called seman- containing 21 different aerial scene categories with a submeter resolution and 2) a Sydney data set containing eight land-use tic gap. To bridge this gap, some authors have applied latent categories with a 1.0-m spatial resolution. The proposed GBRCN Dirichlet allocation [6]–[8], a hierarchical probabilistic model, framework outperformed the state-of-the-art methods with the for the unsupervised learning of the topic allocation, based on UC Merced data set, including the traditional single convolutional the visual word distribution. Under these models, each image is network approach. For the Sydney data set, the proposed method represented as a finite mixture of an intermediate set of topics, again obtained the best accuracy, demonstrating that the proposed framework can provide more accurate classification results than which are expected to summarize the scene information [9]. the state-of-the-art methods. These approaches extract the low-level feature representations from the image and build mid-level features to overcome the Index Terms—Convolutional networks (CNets), gradient boost- ing machine (GBM), scene classification. semantic gap, but they still ignore the high-level features and the hierarchical feature representations of the image. Recently, a lot of work in scene classification has been I. INTRODUCTION focused on the learning of hierarchical internal feature repre- ECENT years have witnessed a boom in the numbers of sentations from image data sets [10], [11]. These methods are R different satellite sensors, which now provide us with a called “deep learning” methods. Good internal feature repre- large volume of very high resolution (VHR) remote sensing sentations are hierarchical. In an image, pixels are assembled images. How to automatically recognize and analyze scenes into edgelets, edgelets are assembled into motifs, motifs are assembled into parts, parts are assembled into objects, and fi- nally, objects are assembled into scenes [10]. This suggests that Manuscript received May 28, 2015; revised August 30, 2015; accepted recognizing and analyzing scenes from VHR images should September 30, 2015. This work was supported in part by the National Basic involve multiple trainable feature extraction stages stacked on Research Program of China (973 Program) under Grants 2011CB707105 and top of each other, and we should learn the hierarchical internal 2012CB719905, by the National Natural Science Foundation of China under Grants 61471274 and 41431175, and by the Natural Science Foundation of feature representations from the image. Convolutional networks Hubei Province under Grant 2014CFB193. (CNets) are a popular deep learning method and have recently F. Zhang and L. Zhang are with the State Key Laboratory of Information enjoyed great success in large-scale visual recognition [11]. Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China (e-mail: [email protected]; [email protected]). To train a neural network model, the model weight is a B. Du is with the School of Computer Science, Wuhan University, Wuhan nonconvex optimization problem with many local minima. The 430072, China (e-mail: [email protected]). randomness from different initial points tends to differentiate Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. the errors of the networks. Ensembles of networks have been Digital Object Identifier 10.1109/TGRS.2015.2488681 proved to be able to alleviate the problem [12]. An earlier 0196-2892 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING overview of different ways of combining neural networks was reported in [13]. We can also consider an ensemble as building a set of “strong” models, which can be further combined to produce a better prediction. The most common ensemble techniques, such as random forests, rely on simple averaging of the models in the ensemble. However, how to combine the clas- sifiers effectively and efficiently is a critical problem. Recently, a boosting framework called gradient boosting machine (GBM) has shown considerable promise in a wide range of practical applications. GBM is a useful practical tool for different pre- dictive tasks, and it can consistently provide more accurate results than the conventional single machine learning models Fig. 1. Typical CNet architecture with two feature extraction stages. [14]. During the last few years, many empirical studies have been published, which use decision trees as the base learners for GBM [15], [16]. All of these studies have reported impressive The CNet [17], [18] is a trainable multilayer architecture improvements in the generalization behavior, suggesting that composed of multiple feature extraction stages. Each stage con- GBM is a powerful machine learning technique. sists of three layers: 1) a convolutional layer; 2) a nonlinearity Although much usefulwork hasbeen done, mostof the studies layer; and 3) a pooling layer. The architecture of a CNet is have focused on decision trees, and no related research into CNets designed to take advantage of the 2-D structure of the input has been reported in the literature. For the first time, this paper image. A typical CNet is composed of one, two, or three of such optimally incorporates GBM into a CNet framework. Moreover, feature extraction stages, followed by one or more traditional due to the extreme flexibility of GBM, we introduce a new multi- fully connected layers and a final classifier layer, as shown in class soft-max loss function to combine the GBM with CNet Fig. 1. Each layer type is described in the following. framework. However, because of the limitation of the training Convolutional Layer: The input to the convolutional layer data in the satellite image and the high computation complexity is a 3-D array with r 2-D feature maps of size m × n. Each i of the ensemble CNet framework, we also develop a novel ran- component is denoted as xm,n, and each feature map is denoted i dom CNet (RCNet), which can reuse the weight which is learned as x . The output is also a 3-D array m1 × n1 × k, composed in each CNet model and reduce the computation complexity. of k feature maps of size m1 × n1. The convolutional layer has In this paper, we systematically consider the following ques- k trainable filters of size l × l × q, also called the filter bank W , tions. Are “strong” models like CNets suitable for an ensemble which connect the input feature map to the output feature framework? Does the GBM framework work well for CNets? map. The convolutional layer computes the output feature map s q s ∗ i ∗ Due to the flexibility of GBM and the architecture of CNets, z = i Wi x + bs,where is the 2-D discrete convolution how should we choose the loss function for scene classification? operator and b is a trainable bias parameter. Because of the high computational complexity of CNets, is Nonlinearity Layer: In a traditional CNet, this layer simply there a way to reuse the weight which is learned in each CNet consists of a pointwise nonlinearity function applied to each model and reduce the number of parameters? component in the feature map.