remote sensing

Article Scene Classification Based on a Deep Random-Scale Stretched Convolutional Neural Network

Yanfei Liu, Yanfei Zhong * ID , Feng Fei, Qiqi Zhu and Qianqing Qin State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China; [email protected] (Y.L.); [email protected] (F.F.); [email protected] (Q.Z.); [email protected] (Q.Q.) * Correspondence: [email protected]; Tel./Fax: +86-27-6877-9969

Received: 10 January 2018; Accepted: 19 February 2018; Published: 12 March 2018

Abstract: With the large number of high-resolution images now being acquired, high spatial resolution (HSR) remote sensing imagery scene classification has drawn great attention but is still a challenging task due to the complex arrangements of the ground objects in HSR imagery, which leads to the semantic gap between low-level features and high-level semantic concepts. As a feature representation method for automatically learning essential features from image data, convolutional neural networks (CNNs) have been introduced for HSR remote sensing image scene classification due to their excellent performance in natural image classification. However, some scene classes of remote sensing images are object-centered, i.e., the scene class of an image is decided by the objects it contains. Although previous methods based on CNNs have achieved comparatively high classification accuracies compared with the traditional methods with handcrafted features, they do not consider the scale variation of the objects in the scenes. This makes it difficult to directly utilize CNNs on those remote sensing images belonging to object-centered classes to extract features that are robust to scale variation, leading to wrongly classified scene images. To solve this problem, scene classification based on a deep random-scale stretched convolutional neural network (SRSCNN) for HSR remote sensing imagery is proposed in this paper. In the proposed method, patches with a random scale are cropped from the image and stretched to the specified scale as the input to train the CNN. This forces the CNN to extract features that are robust to the scale variation. Furthermore, to further improve the performance of the CNN, a robust scene classification strategy is adopted, i.e., multi-perspective fusion. The experimental results obtained using three datasets—the UC Merced dataset, the Google dataset of SIRI-WHU, and the Wuhan IKONOS dataset—confirm that the proposed method performs better than the traditional scene classification methods.

Keywords: convolutional neural network; scene classification; deep random-scale stretched convolutional neural network; multi-perspective fusion

1. Introduction Remote sensing image scene classification, which involves dividing images into different categories without semantic overlap [1], has recently drawn great attention due to the increasing availability of high-resolution remote sensing data. However, it is difficult to achieve satisfactory results in scene classification directly based on low-level features, such as the spectral, textural, and geometrical attributes, because of the so-called “semantic gap” between low-level features and high-level semantic concepts, which is caused by the object category diversity and the distribution complexity in the scene. To overcome the semantic gap, many different scene classification techniques have been proposed. The bag-of-visual-words (BoVW) model, which is derived from document classification in text analysis, is one of the most popular methods in scene classification. In consideration of the spatial information

Remote Sens. 2018, 10, 444; doi:10.3390/rs10030444 www.mdpi.com/journal/remotesensing Remote Sens. 2018, 10, 444 2 of 23 loss, many BoVW extensions have been proposed [2–8]. Spatial pyramid co-occurrence [2] was proposed to compute the co-occurrences of visual words with respect to “spatial predicates” over a hierarchical spatial partitioning of an image, to capture both the absolute and relative spatial arrangement of the words. The topic model [9–16], as used for document modeling, text classification, and collaborative filtering, is another popular model for scene classification. P-LDA and F-LDA [9] were proposed based on the latent Dirichlet allocation (LDA) model for scene classification. Because of the weak representative power of a single feature, multifeature fusion has also been adopted in recent years. The semantic allocation level (SAL) multifeature fusion strategy based on the probabilistic topic model (PTM) (SAL-PTM) [10] was proposed to combine three complementary features, i.e., the spectral, texture, and scale-invariant feature transform (SIFT) features. However, in the traditional methods, feature representation plays a big role and involves low-level feature selection and middle-level feature design. As a lot of prior information is required, this limits the application of these methods in other datasets or other fields. To automatically learn features from the data directly, without low-level feature selection and middle-level feature design, a number of methods based on deep learning have recently been proposed. A new branch of machine learning theory—deep learning—has been widely studied and applied to facial recognition [17–20], scene recognition [21,22], image super-resolution [23], video analysis [24], and drug discovery [25]. Differing from the traditional feature extraction methods, deep learning, as a feature representation learning method, can directly learn features from raw data without prior information. In the remote sensing field, deep learning has also been studied for hyperspectral pixel classification [26–29], object detection [30–32] and scene classification [33–41]. The auto-encoder [37] was the first deep learning method introduced to remote sensing image scene classification, combining a sample selection strategy based on saliency with the auto-encoder to extract features from the raw data, achieving a better classification accuracy than the traditional remote sensing image scene classification methods such as BoVW and LDA. The highly efficient “enforcing lifetime and population sparsity” (EPLS) algorithm, another sparse strategy, was introduced into the auto-encoder to ensure two types of feature sparsity: population and lifetime sparsity [38]. Zou et al. [39] translated the feature selection problem into a feature reconstruction problem based on a deep belief network (DBN) and proposed a deep learning-based method, where the features learned by the DBN are eliminated once their reconstruction errors exceed the threshold. Although the auto-encoder-based algorithms can acquire excellent classification accuracies when compared with the traditional methods, they do need pre-training, and can be regarded as convolutional neural networks (CNNs) in practice. Zhang et al.[40] combined multiple CNN models based on boosting theory, achieving a better accuracy than the auto-encoder methods. However, most previous methods based on deep learning model such as CNN took images with single scale for training, by cropping the fixed scale patches from image or resizing the image to fixed single and did not consider the object scale variation problem in image caused by the altitude or angle change of the sensor, etc. The features extracted by these methods are not robust to the object scale, leading to unsatisfactory remote sensing image scene classification results. An example of scale variation of object is given in Figure1, where the scale of the storage tank and airplane changed a lot which may be caused by the altitude or angle change of the sensor. In order to solve the object scale variation problem, scene classification based on a deep random- scale stretched convolutional neural network (SRSCNN) is proposed in this paper. In SRSCNN, patches with a random scale are cropped from the image, and patch stretching is applied, which simulates the object scale variation to ensure that the CNN learns robust features from the remote sensing images. Remote Sens. 2018, 10, 444 3 of 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 3 of 23

FigureFigure 1. 1.An An exampleexample ofof objectobject scalescale variationvariationthat that maymay bebe causedcaused byby sensorsensor altitudealtitudeor orangle anglevariation. variation.

In order to solve the object scale variation problem, scene classification based on a deep random- The major contributions of this paper are as follows: scale stretched convolutional neural network (SRSCNN) is proposed in this paper. In SRSCNN, (1) The random spatial scale stretching strategy. In SRSCNN, random spatial scale stretching is patches with a random scale are cropped from the image, and patch stretching is applied, which proposed to solve the problem of the dramatic scale variation of objects in the scene. Some objects simulates the object scale variation to ensure that the CNN learns robust features from the remote such as airplanes and storage tanks are very important in remote sensing image scene classification. sensing images. However, the object scale variation can lead to weak feature representation for some scenes, which in The major contributions of this paper are as follows: turn leads to wrong classification. In order to solve this problem, the assumption that the object scale (1) The random spatial scale stretching strategy. In SRSCNN, random spatial scale stretching is variation satisfies a uniform or a Gaussian distribution is adopted in this paper, and the strategy of proposed to solve the problem of the dramatic scale variation of objects in the scene. Some objects cropping patches with a random scale following a certain distribution is applied to simulate the object such as airplanes and storage tanks are very important in remote sensing image scene classification. variation in the scene, forcing the model to extract features that are robust to scale variation. However, the object scale variation can lead to weak feature representation for some scenes, which (2) The robust scene classification strategy. To fuse the information from different-scale patches in turn leads to wrong classification. In order to solve this problem, the assumption that the object with different locations in the image, thereby further improving the performance, SRSCNN applies scale variation satisfies a uniform or a Gaussian distribution is adopted in this paper, and the strategy multiple views of one image and conducts fusion of the multiple views. The image is thus classified of cropping patches with a random scale following a certain distribution is applied to simulate the multiple times and its label decided by voting. According to VGGNet [42] and GoogLeNet [43], object variation in the scene, forcing the model to extract features that are robust to scale variation. using a large set of crops of multiple scales can improve classification accuracy. However, unlike (2) The robust scene classification strategy. To fuse the information from different-scale patches GoogLeNet, where only four scales are applied in the test phase, more continuous scales are adopted with different locations in the image, thereby further improving the performance, SRSCNN applies in the proposed method. In this paper, patches with a random position and scale are cropped and then multiple views of one image and conducts fusion of the multiple views. The image is thus classified stretched to a fixed scale to be classified by the trained CNN model. Finally, the whole image’s label is multiple times and its label decided by voting. According to VGGNet [42] and GoogLeNet [43], using decided by selecting the label that occurs the most. a large set of crops of multiple scales can improve classification accuracy. However, unlike The proposed SRSCNN was evaluated and compared with conventional remote sensing image GoogLeNet, where only four scales are applied in the test phase, more continuous scales are adopted scene classification methods and methods based on deep learning. Three datasets—the 21-class UC in the proposed method. In this paper, patches with a random position and scale are cropped and Merced (UCM) dataset, the 12-class Google dataset of SIRI-WHU, and the 8-class Wuhan IKONOS then stretched to a fixed scale to be classified by the trained CNN model. Finally, the whole image’s dataset—were used in the testing. The experimental results show that the proposed method can obtain label is decided by selecting the label that occurs the most. a better classification accuracy than the other methods, which demonstrates the superiority of the The proposed SRSCNN was evaluated and compared with conventional remote sensing image proposed method. scene classification methods and methods based on deep learning. Three datasets—the 21-class UC The remainder of this paper is organized as follows. Section2 makes a brief introduction to CNNs. Merced (UCM) dataset, the 12-class Google dataset of SIRI-WHU, and the 8-class Wuhan IKONOS Section3 presents the proposed scene classification method based on a deep random-scale stretched dataset—were used in the testing. The experimental results show that the proposed method can convolutional neural network, namely SRSCNN. In Section4, the experimental results are provided. obtain a better classification accuracy than the other methods, which demonstrates the superiority of The discussion is given in Section5. Finally, the conclusion is given in Section6. the proposed method. 2. ConvolutionalThe remainder Neural of this Networks paper is organized as follows. Section 2 makes a brief introduction to CNNs. Section 3 presents the proposed scene classification method based on a deep random-scale CNNs are a popular type of deep learning model consisting of convolutional layers, pooling stretched convolutional neural network, namely SRSCNN. In Section 4, the experimental results are layers, fully connected layers, and a softmax layer, as shown in Figure2, which is described in Section3. provided. The discussion is given in Section 5. Finally, the conclusion is given in Section 6. CNNs are appropriate for large-scale image processing because of the sparse interactions and weight sharing2. Convolutional resulting fromNeural the Networks convolution computation. In CNNs, the sparse interaction means that the units in a higher layer are connected with units with a limited scale, i.e., the receptive field in the lowerCNNs layer. are a Weight popular sharing type of means deep thatlearning the units model in consisting the same layer of convolutional share the same layers, connection pooling weights,layers, fully i.e., theconnected convolution layers, filters, and adecreasing softmax layer, the numberas shown of in parameters Figure 2, which in the is CNN. described The convolution in Section 3. computationCNNs are appropriate makes it possiblefor large-scale for the image response processing of the unitsbecause of theof the deepest sparse layer interactions of thenetwork and weight to sharing resulting from the convolution computation. In CNNs, the sparse interaction means that the units in a higher layer are connected with units with a limited scale, i.e., the receptive field in the

Remote Sens. 2017, 9, x FOR PEER REVIEW 4 of 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 4 of 23 lower layer. Weight sharing means that the units in the same layer share the same connection weights, Remotelower Sens. layer.2018, Weight10, 444 sharing means that the units in the same layer share the same connection weights,4 of 23 i.e.,i.e., the the convolution convolution filters, filters, decreasing decreasing the the numb numberer of of parameters parameters in in the the CNN. CNN. The The convolution convolution computationcomputation makes makes it itpossible possible for for the the response response of of the the units units of of the the deepest deepest layer layer of of the the network network to to be be bedependentdependent dependent only only only upon upon the the shape shape of ofof the thethe stimulus stimulusstimulus patt pattern,pattern,ern, and and and they they they are are are not not affected affected by by thethe the positionposition position wherewherewhere thethe the patternpattern pattern isis presented. ispresented. presented. GivenGiven Given aa convolutional convolutionaa convolutionallayer, layer,l layer, its its its output output output can can can be be be obtained obtained obtained as as follows:as follows: follows: = Xtt Conv= (,) Xt-1 W (1) XXttt= Conv(,)( XXt-1t− W1, W t) (1)(1) where Xt denotes the output of layer t , X t −1 is the input of layer t , and Wt is the convolution wherewhereXt denotesXt denotes the outputthe output of layer of layert, Xt− 1t ,is X thet −1 input is the of input layer oft, andlayerW ttis, and the convolutionWt is the convolution filters of layerfiltersfilterst .of of layer layer t .t . In recent years, more complex structures and deeper networks have been developed. Lin et al. [44] InIn recent recent years, years, more more complex complex structures structures and and deeper deeper networks networks have have been been developed. developed. Lin Lin et etal. al. [44] [44] proposed a “network in network” structure to enforce the abstract ability of the convolutional layer. proposedproposed a “networka “network in in network” network” structure structure to to enforc enforce thee the abstract abstract ability ability of of the the convolutional convolutional layer. layer. Based on the method proposed by Lin et al. [44], Szegedy et al. [43] designed a more complex network BasedBased on on the the method method proposed proposed by by Lin Lin et etal. al. [44], [44], Szegedy Szegedy et etal. al. [43] [43] designed designed a morea more complex complex network network structure by applying an inception module. In AlexNet [45], there are eight layers, in addition to the structurestructure by by applying applying an an inception inception module. module. In In AlexNe AlexNet [45],t [45], there there are are eight eight layers, layers, in in addition addition to to the the pooling and contrast normalization layers; however, in VGGNet and GoogLeNet, the depths are 19 poolingpooling and and contrast contrast normalization normalization layers; layers; however, however, in in VGGNet VGGNet and and GoogLeNet, GoogLeNet, the the depths depths are are 19 19 and and and 22, respectively. More recently, much deeper networks have been proposed [46,47]. 22,22, respectively. respectively. More More recently, recently, much much deeper deeper networks networks have have been been proposed proposed [46,47]. [46,47].

FigureFigureFigure 2.2. Convolutional2.Convolutional Convolutional neuralneural neural networknetwor network forfork for scenescene scene classification.classification. classification.

Due to the excellent performance of CNNs in image classification, a number of methods based DueDue to theto the excellent excellent performance performance of CNNs of CNNs in image in image classification, classification, a number a number of methods of methods based based on on CNNs for remote sensing scene classification have been proposed, significantly improving CNNson CNNs for remote for sensingremote scenesensing classification scene classification have been proposed,have been significantly proposed, improving significantly classification improving classification accuracy. However, these methods do not consider the object scale variation problem accuracy.classification However, accuracy. these However, methods do these not methods consider do the not object consider scale variationthe object problem scale variation in the remote problem in the remote sensing data, which means that the extracted features lack robustness to the object scale sensingin thedata, remote which sensing means data, that which the extracted means that features the extracted lack robustness features to lack the objectrobustness scale change,to the object leading scale change, leading to misclassification of the scene image. The objects contained in the scene image play tochange, misclassification leading to of misclassification the scene image. of Thethe scene objects im containedage. The objects in the scenecontained image in playthe scene an important image play anan important important role role in in scene scene classification classification [48,49]. [48,49]. Ho However,wever, the the same same objects objects in in remote remote sensing sensing images images rolecan have in scene different classification scales, as [48 shown,49]. However,in Figure 3, the resulting same objects in them in being remote difficult sensing to imagescapture. can It is have easy differentcan have scales, different as shown scales, in Figureas shown3, resulting in Figure in 3, them resulting being in difficult them being to capture. difficult It is to easy capture. to classify It is easy a toto classify classify a scenea scene image image containing containing objects objects with with a commona common scale; scale; however, however, those those images images containing containing sceneobjects image with containinga changed objectsscale can with easily a common be wrongly scale; classified. however, Taking those images the two containing images ofobjects storage with tanks a changedobjects scalewith cana changed easily be scale wrongly can easily classified. be wrongly Taking classified. the two images Taking of the storage two images tanks andof storage airplanes tanks andand airplanes airplanes in in Figure Figure 3 as3 as examples, examples, the the right right images images are are easy easy to to classify, classify, but but it itis isalmost almost impossible impossible into Figurerecognize3 as the examples, left images the rightas storage images tanks/airpla are easy tones classify, because but of it the is almost different impossible scales of tothe recognize images. theto left recognize images asthe storage left images tanks/airplanes as storage tanks/airpla because ofnes the differentbecause of scales the different of the images. scales of the images.

Figure 3. Object scale variation leading to difficulty in remote sensing scene classification. FigureFigure 3. Object3. Object scale scale variation variation leading leading to to difficulty difficulty in remotein remote sensing sensing scene scene classification. classification.

3. 3.Scene Scene Classification Classification for for HSR HSR Imagery Imagery Based Based on on a Deepa Deep Random-Scale Random-Scale Stretched Stretched ConvolutionalConvolutional Neural Neural Network Network

Remote Sens. 2018, 10, 444 5 of 23

3. Scene Classification for HSR Imagery Based on a Deep Random-Scale Stretched Convolutional Neural Network To solve this object scale problem, the main idea in this paper is to generate multiple-scale samples to train the CNN model, forcing the trained CNN model to correctly classify the same image with different scales. To help make things easier, the scale variation is modeled as follows in this paper:

S = Lα (2) where L is the true scale; α denotes the scale variation factor, obeying a normal or uniform distribution, i.e., α ∼ N1, σ2 or α ∼ U(in f , sup); and S represents the changed scale. Once S and α are known, L can be obtained by L = S/α, and given an image I with scale S × S, image I0 is obtained by stretching I into L × L and then feeding it into the CNN to extract the features. According to this assumption, the goal of making the features extracted by the CNN robust to scale has been transformed into making the features robust to α, i.e., the trained CNN can correctly recognize image I0 with various values of α. To achieve this, for each image, multiple samples α are sampled, and multiple corresponding stretched images I0 are obtained to force the CNN to be robust to α in the training phase. Meanwhile, in the CNN, in order to increase the number of image samples, the patches cropped from the image are fed into the CNN, instead of the whole image. Assuming that the scale of the patches is R × R, the patches should be cropped from the stretched image with scale L × L. Finally, there are three steps to generating the patches that are fed into the CNN, as follows: (1) Sample α from α ∼ N1, σ2 or α ∼ U(in f , sup); (2) Stretch an image of S × S to L × L, where L = S/α; (3) Crop a patch with scale R × R from the stretched image obtained in step 2. However, in practice, due to the linearity of Equation (2), the above steps can be translated into the following steps to save memory: (1) Sample α from α ∼ N1, σ2 or α ∼ U(in f , sup); (2) Compute the cropping scale PS = Rα and crop a patch with scale PS × PS from image I. In this step, the upper-left corner (w, h) of the cropped patch in image I is determined by randomly selecting an element from D = {(w, h)|0 ≤ w ≤ S − Ps + 1, 0 ≤ h ≤ S − Ps + 1} where S is the samples and lines of the whole image I. (3) Stretch the cropped patch obtained in step 2 into R × R. For every image in the dataset, we repeat the above three steps and finally obtain multiple-scale samples from the images, as shown in Figure4. In this paper, the two kinds of distribution for α are respectively tested, and the results are discussed in Section5. The parameter sup is set to 1.2 by experience, and bilinear interpolation is adopted in the random-scale stretching as a trade-off between computational efficiency and classification accuracy. In the proposed method, there will be many patches cropped from every image. However, the number of patches is not determined explicitly. Instead, we determine the number of patches by determining the number of training iterations. Assuming that the number of training samples is m, for training, the number of iterations is set as N, where N is 70 K in this paper, and for each iteration, there are n patches cropped from the n (set as 60 in this paper) images and fed into the CNN model. Therefore, for the training, the number of patches required is N × n, which means that the number of patches from every training image is N × n/m. In testing, the cropped patch number is set by experience. On the one hand, this can be regarded as a data augmentation strategy for remote sensing image scene classification, which could be combined with other data augmentation methods such as occlusion, rotation, mirroring, etc. On the other hand, the random-scale stretching is a process which adds structure to the input. Differing from the denoising auto-encoders, which add value noise to the input to force the model to learn robust features, SRSCNN focuses on structure noise, i.e., the scale change of the objects. Similar to the denoising auto-encoders, the input of the model is corrupted with random noise following a definite statistical distribution, i.e., a uniform or normal distribution. Remote Sens. 2018, 10, 444 6 of 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 6 of 23

Figure 4.4. Random-scale stretchingstretching ofof aa scenescene image.image.

3.1.3.1. SRSCNN TheThe proposedproposed SRSCNNSRSCNN methodmethod integratesintegrates random-scale stretching and a CNN, and cancan bebe divideddivided intointo threethree parts:parts: (1) datadata pre-processing;pre-processing; (2)(2) random-scalerandom-scale stretching;stretching; and (3)(3) CNNCNN modelmodel training.training. InIn thisthis paper,paper, the the image imageI isI represented is represented as aas tensor a tensor with with dimensions dimensions of H of× HW × ×W3, × where3, whereH andH andW Wrepresent represent the the height height and and width width of theof the image, image, respectively. respectively. DataData pre-processing:pre-processing: To accelerateaccelerate thethe learninglearning convergenceconvergence ofof thethe CNN,CNN, normalizationnormalization isis required.required. This This is is usually usually followed followed by by the the z-score z-score to centralize to centralize the thedata data distribution. distribution. However, However, to keep to keepthings things simple, simple, in this inpaper, this pixels paper, are pixels normalized are normalized to [0, 1] by to dividing [0, 1] by them dividing by the them maximum by the pixel maximum value: pixel value: = I In I (3) I Max= (3) n Max where In is the normalized image, and Max is the maximum pixel value in the image. where In is the normalized image, and Max is the maximum pixel value in the image. Random-scale stretching: In the proposed method, random-scale stretching is integrated with Random-scale stretching: In the proposed method, random-scale stretching is integrated with random rotation, which are common data augmentation methods in CNN training. Given a random rotation, which are common data augmentation methods in CNN training. Given a normalized normalized image In , the image output p of the random-scale stretching is obtained as follows: image In, the image output p of the random-scale stretching is obtained as follows: = pRssI()n (4) p = Rss(In) (4)

= p = Rot(p) (5) pRotp() (5) Rss Rot d wherewhere Rssstands stands for for the the random-scale random-scale stretching, stretching, and and denotesRot denotes the random the random rotation rotation of degrees, of d d wheredegrees,is where 0, 90, 180,d is or 0, 270. 90, 180, or 270. CNN model training: p CNN model training:After After the the random-scale random-scale stretching, stretching, the the image imageis fedp is into fed the into CNN the model, CNN which consists of convolutional layers, pooling layers, a fully connected layer, and a softmax layer. model, which consists of convolutional layers, pooling layers, a fully connected layer, and a softmax layer. Convolutional layers: Differing from the traditional features utilized in scene classification, where Convolutional layers: Differing from the traditional features utilized in scene classification, where spectral features such as the spectral mean and standard deviation, and spatial features such as spectral features such as the spectral mean and standard deviation, and spatial features such as SIFT, SIFT, are selected and extracted artificially, the convolutional layers can automatically extract features are selected and extracted artificially, the convolutional layers can automatically extract features from from the data. There are multiple convolution filters in each layer, and every convolution filter the data. There are multiple convolution filters in each layer, and every convolution filter corresponds corresponds to a feature detector. For convolutional layer t (t ≥ 1), given the input Xt−1, the output Xt to a feature detector. For convolutional layer t (t ≥ 1), given the input X − , the output X of of convolutional layer t can be obtained by Equations (6) and (7): t 1 t convolutional layer t can be obtained by Equations (6) and (7): K Ki = ik ∗ k + i iikki=∗+Ct ∑ wt xt−1 bt (6) Cwxbtttt k −1 (6) k i i xt = f (Ct) (7) ii= xfCtt() (7)

Remote Sens. 2018, 10, 444 7 of 23

ik k where wt represents the kth convolutional filter corresponding to the ith output feature map, xt−1 i denotes the kth feature map of Xt−1, bt is the bias corresponding to the ith output feature map, ∗ i is the convolution computation, and f is the activation function, xt is ith output feature map of Xt. As Equation (6) shows, the convolution is a linear mapping, and is always followed by a non-linear activation function f such as sigmoid, tanh, SoftPlus, ReLU, etc. In this paper, ReLU is adopted as the activation f due to its superiority in convergence and its sparsity compared with the other activation functions [50–52]. ReLU is a piecewise function: ( c c ≥ 0 y = (8) 0 c < 0

Pooling layers: Following the convolutional layers, the pooling layers expect to achieve a more compact representation by aggregating the local features, which is an approach that is more robust to noise and clutter and invariant to image transformation [53]. The popular pooling methods include average pooling and max pooling. Max pooling is adopted in this paper:

k k Y = max(X[1,ker]×[1,ker]) (9) where ker is the scale of the pooling kernel. Fully connected layer: The fully connected layer can be viewed as a special convolutional layer whose output height and width are 1, i.e., the output is a vector. In addition, in this paper, the fully connected layer is followed by dropout [54], which is a regularizer to prevent co-adaptation since a unit cannot rely on the presence of other particular units because of the random presence of them in the training phase. Softmax layer: In this paper, softmax is added to classify the features extracted by the CNN. As a generalization of logistic regression for a multi-class problem, softmax outputs a vector whose every element represents the possibility of a sample belonging to each class, as shown in Equation (10):

 i   θT x  p(y = 1 x, θ) e 1 i T x  p(y = 2 x, θ)   eθ2           .  1  .  h(x) =   =   (10) . C θT x  .    j     ∑ e    .  j=1  .  i T p(y = C x, θ) eθc x

where p(yi = t x, θ) represents the possibility of sample i’s label being t. Based on maximum- likelihood theory, loss function of softmax can be obtained as follows:

θT xi 2 −1 m k e j λ k ( ) = [ { i = } ] + J θ ∑ ∑ 1 y j log T i ∑ θj (11) m k θl x 2 i=1 j=1 ∑l=1 e j=1 where m is the number of samples, and the second term is the regularization term, which is always the L2-norm, to avoid overfitting. λ is the coefficient to balance these two terms. The chain rule is applied in the back-propagation for computing the gradient to update the parameters, i.e., the weight W and bias b in CNN. Take the layer t as an example. To determine the values of Wt and bias bt in layer t, the stochastic gradient descent method is performed: (1) Compute the objective loss function J in Equation (11); (2) Based on the chain rule, compute the partial derivative of Wt and bias bt:

∂J ∂Xt+n ∂Ct+n ∂Xt+1 ∂Ct+1 ∂Xt ∂Ct ∇Wt = ... (12) ∂Xt+n ∂Ct+n ∂Xt+n−1 ∂Ct+1 ∂Xt ∂Ct ∂Wt Remote Sens. 2018, 10, 444 8 of 23

∂J ∂Xt+n ∂Ct+n ∂Xt+1 ∂Ct+1 ∂Xt ∇bt = ... (13) ∂Xt+n ∂Ct+n ∂Xt+n−1 ∂Ct+1 ∂Xt ∂Ct

(3) Update the values of Wt and bias bt:

Wt = Wt − lr · ∇Wt (14)

bt = bt − lr · ∇bt (15) where lr is the learning rate. In the training phase, the above three steps are repeated until the required number of iterations (set as 70 K in this paper) is reached. The procedure of SRSCNN is presented in Algorithm 1, where random rotation is integrated with random-scale stretching.

Algorithm 1. The SRSCNN procedure Input: -input dataset D = {(I, y) | I∈RH×W×3, y∈{1, 2, 3, . . . C} -scale of patch R -batch size m -scale variation factor α obeying distribution d -defined CNN structure Net -maximum iteration times L Output: -Trained CNN model Algorithm: 1. randomly initialize Net 2. for l = 1 to L do 3. generate a small dataset P with |P| = m from D 4. for I = 1 to m do 5. normalize Ii∈P to [0, 1] by dividing by the maximum pixel value to obtain the normalized image Iin. 6. generate a random-scale stretched patch pi from Iin. 7. P = P − {Ii}, P = P∪{pi} 8. end for 9. Feed P to Net, update parameters W and b in Net 10. end for

According to recent deep learning theory [42,43], representation depth is beneficial for classification accuracy. However, the structure like GoogLeNet and VGGNet cannot be directly used because of the small size of remote sensing datasets compared with the ImageNet dataset with millions of images. Therefore, in this paper, a network structure similar to VGGNet is adopted, making a trade-off between the depth and the number of images in the dataset, where every two convolutional layers are stacked. The main structure configuration adopted in this paper is shown in Table1. In Table1, the convolutional layer is denoted as “Conv N–K”, where N denotes the convolutional kernel size and K denotes the number of convolutional kernels. The fully connected layer is denoted as “FC–L”, where L denotes the number of units in the fully connected layer. The effect of L on the scene classification accuracy is analyzed in Section5. Remote Sens. 2017, 9, x FOR PEER REVIEW 9 of 23

Remote Sens. 2018, 10, 444 9 of 23 Table 1. The main CNN structure configuration adopted in SRSCNN.

Input (181 181 RGB image) Table 1. The main CNN structure configuration adopted in SRSCNN. Conv3–32 Input (181 ×Conv3–64181 RGB image) Conv3–32Maxpooling Conv3–64 MaxpoolingConv3–96 Conv3–96Conv2–128 Conv2–128Maxpooling Maxpooling Conv3–160 Conv3–160 MaxpoolingMaxpooling FC-600FC-600 SoftmaxSoftmax

3.2.3.2. Remote SensingSensing SceneScene ClassificationClassification InIn thethe proposed proposed method, method, the differentthe different patches patche froms the from same the image same contain image different contain information, different soinformation, every feature so every extracted feature from extracted a patch canfrom be a viewedpatch can as be a perspective viewed as a of perspective the scene label of the of scene the image label whereof the theimage patch where is cropped the patch from. Tois furthercropped improve from. To the scenefurther classification improve the performance, scene classification a robust sceneperformance, classification a robust strategy scene classi is adopted,fication i.e., strategy multi-perspective is adopted, i.e., fusion. multi-perspective SRSCNN randomly fusion. SRSCNN crops a patchrandomly with crops a random a patch scale with from a random the image scale and from predicts the image its label and using predicts the trained its label CNN using model, the trained under theCNN assumption model, under that thethe patch’sassumption label isthat identical the patch’s to the label image’s is identical label. For to simplicity, the image’s in thislabel. paper, For thesimplicity, perspectives in this have paper, identical the perspectives weights, andhave voting identical is finallyweights, adopted, and voting which is finally means adopted, that SRSCNN which classifiesmeans that a sample SRSCNN multiple classifies times a sample and decides multiple its label times by and selecting decides the its class label for by which selecting it is classified the class thefor mostwhich times. it is classified In Figure 5the, multiple most times. patches In Figure with different 5, multiple scales patches and different with different locations scales are cropped and different from thelocations image, are and cropped these patches from the are image, then stretched and these to patc a specifiches are scale then as stretched the input to of a the specific trained scale model as the to predictinput of its the label. trained Finally, model thelabels to predict belonging its label. to these Finally, patches the arelabels counted belonging and the to label these occurring patches theare mostcounted times and is the selected label asoccurring the label the of most the whole times image is selected (Class as 2 the isselected label of inthe the whole example image in (Class Figure 25 ).is Toselected demonstrate in the example the effectiveness in Figure of5). the To multi-perspectivedemonstrate the effectiveness fusion, three of tests the weremulti-perspective performed on fusion, three datasets,three tests which were areperformed described on in three Section data4.sets, which are described in Section 4.

FigureFigure 5.5. The framework of the testing of the scene imageimage involved with the multi-perspective fusion.fusion.

4.4. Experiments and ResultsResults TheThe UCMUCM dataset,dataset, the the Google Google dataset dataset of of SIRI-WHU, SIRI-WHU, and and the the Wuhan Wuhan IKONOS IKONOS dataset dataset were were used used to testto test the proposedthe proposed method. method. The traditional The traditional methods methods of BoVW, of LDA,BoVW, the LDA, spatial the pyramid spatial co-occurrence pyramid co- kernel++occurrence (SPCK++) kernel++ [2(SPCK++)], the efficient [2], the spectral-structural efficient spectral-structural bag-of-features bag-of-f sceneeatures classifier scene (SSBFC) classifier [7],

Remote Sens. 2018, 10, 444 10 of 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 10 of 23

(SSBFC)locality-constrained [7], locality-constrained linear coding linear (LLC) coding [7], SPM+SIFT(LLC) [7], SPM+SIFT [7], SAL-PTM [7], SAL-PTM [10], the [10], Dirichlet-derived the Dirichlet- derivedmultiple multiple topic model topic (DMTM)model (DMTM) [14], ,[14], SIFT+SC , SIFT+SC [55], [55], the the local local Fisher Fisher kernel-linear kernel-linear (LFK-Linear),(LFK-Linear), the Fisher kernel-linear (FK-Linear), the Fisher ke kernelrnel incorporating spatial information information (FK-S) (FK-S) [56], [56], and methods methods based based on on deep deep learning, learning, i.e., i.e., saliency-gui saliency-guidedded unsupervised unsupervised feature feature learning learning (S-UFL) (S-UFL) [37], [ 37the], gradientthe radient boosting boosting random random convol convolutionalutional network network (GBRCN) (GBRCN) [40], the [40 ],large the patch large convolutional patch convolutional neural networkneural network (LPCNN) (LPCNN) [41], and [41 the], and multiview the multiview deep conv deepolutional convolutional neural neural network network (M-DCNN) (M-DCNN) [36], were [36], werecompared. compared. In the In experiments, the experiments, 80% of 80% the of samples the samples were were randomly randomly selected selected from from the dataset the dataset as the as trainingthe training samples, samples, and and the therest rest were were used used as the as the test test samples. samples. The The experiments experiments were were performed performed on on a personala personal computer computer equipped equipped with with dual dual Intel Intel Xeon Xeon E5-2650 E5-2650 v2 v2 processors, processors, a a single single Tesla Tesla K20m K20m GPU, GPU, and 64 GB of RAM, running Centos 6.6 with the CUDA 6.5 release. For each experiment, the training was stopped after 70 K iterations, taking about 2.5 h. Each experiment on each dataset was repeated fivefive times,times,and and the the average average classification classification accuracy accuracy was was recorded. recorded To keep. To thingskeep simple,things simple, CCNN denotesCCNN denotesthe common the common CNN without CNN random-scale without random-scale stretching stretching and multi-perspective and multi-perspective fusion, CNNV fusion, denotes CNNV the denotescommon the CNN common with multi-perspective CNN with multi-perspective fusion but not fusion random-scale but not random-scale stretching, SRSCNN-NV stretching, SRSCNN- denotes NVthe CNNdenotes with the random-scale CNN with random-scale stretching but stre nottching multi-perspective but not multi-perspective fusion. fusion.

4.1. Experiment 1: The The UCM UCM Dataset Dataset To demonstrate the effectiveness of the proposedproposed SRSCNN, the 21-class UCM land-use dataset collected from large optical images of the U.S. GeologicalGeological Survey was used. The UCM dataset covers various regions of the United States, and includes 21 scene categories, with 100 scene images per category. Each scene image consists ofof 256 × 256256 pixels pixels,, with with a a spatial spatial resolution resolution of of one one foot foot per per pixel. pixel. Figure6 6 shows shows representative representative images images of eachof each class, class, i.e., agriculture,i.e., agriculture, airplane, airplane, baseball baseball diamond, diamond, beach, beach,buildings, buildings, chaparral, chaparral, dense residential,dense residential, forest, freeway,forest, freeway, golf course, golf harbor,course, intersection,harbor, intersection, medium mediumresidential, residential, mobile home mobile park, home overpass, park, parkingoverpass, lot, parking river, runway, lot, river, sparse runway, residential, sparse storage residential, tanks, storageand tennis tanks, court. and tennis court.

Figure 6. ClassClass representatives representatives of the UCM dataset: ( (aa)) agriculture, agriculture, ( (bb)) airplane, airplane, ( (cc)) baseball baseball diamond, diamond, (d) beach, ( (ee)) buildings, buildings, (f ()f )chaparral, chaparral, (g ()g dense) dense residential, residential, (h) ( hforest,) forest, (i) freeway, (i) freeway, (j) golf (j) golf course, course, (k) (harbor,k) harbor, (l) intersection, (l) intersection, (m ()m medium) medium residential, residential, (n ()n mobile) mobile home home park, park, (o (o) )overpass, overpass, ( (pp)) parking parking lot, lot, (q) river, ((rr)) runway,runway, ((ss)) sparsesparse residential,residential, ((tt)) storagestorage tanks,tanks, and and ( u(u)) tennis tennis court. court.

From Table 2 2,, itit cancan bebe seen seen thatthat SRSCNNSRSCNN performsperforms wellwell andand achievesachieves thethe bestbest classificationclassification accuracy. TheThe main main reason reason for thefor comparativelythe comparatively high accuracyhigh accuracy achieved achieved by SRSCNN by SRSCNN when compared when withcompared the non-deep with the learning non-deep methods learning such methods as BoVW, such LDA, as BoVW, and LLC LDA, is that and SRSCNN, LLC is that asa SRSCNN, deep learning as a deepmethod, learning can extract method, the essential can extract features the from essential the data features directly, from without the priordata information,directly, without and is aprior joint information,learning method and whichis a joint combines learning feature method extraction which co andmbines scene feature classification extraction into and a whole, scene leadingclassification to the intointeraction a whole, between leading feature to the interaction extraction andbetween scene feature classification extraction during and thescene training classification phase. Comparedduring the training phase. Compared with the methods based on deep learning, i.e., S-UFL, GBRCN, LPCNN, and M-DCNN, SRSCNN again achieves a comparatively high classification accuracy. The reason

Remote Sens. 2018, 10, 444 11 of 23 with the methods based on deep learning, i.e., S-UFL, GBRCN, LPCNN, and M-DCNN, SRSCNN again achieves a comparatively high classification accuracy. The reason SRSCNN achieves a better accuracy is that the objects in scene classification make a big difference; however, the previous methods based on deep learning do not consider the scale variation of the objects, resulting in a challenge for the scene classification. In contrast, in order to solve this problem, random-scale stretching is adopted in SRSCNN during the training phase to ensure that the extracted features are robust to scale variation. In order to further demonstrate the effectiveness of the random-scale stretching, two experiments were conducted based on CCNN and SRSCNN-NV, respectively. From Table2, it can be seen that SRSCNN-NV outperforms CCNN by 1.02%, which demonstrates the effectiveness of the random-scale stretching for remote sensing images. Meanwhile, compared with SRSCNN-NV, SRSCNN is 2.99% better, which shows the power of the multi-perspective fusion. According to the above comparison and analysis, it can be seen that the random-scale stretching and multi-perspective fusion enable SRSCNN to obtain the best results in remote sensing image scene classification. The Kappa of SRSCNN for the UCM is 0.95.

Table 2. Comparison with the previous reported accuracies with the UCM dataset.

Classification Method Classification Accuracy (%) BoVW 72.05 PLSA 80.71 LDA 81.92 SPCK++ [2] 76.05 LLC [7] 82.85 Non-deep learning methods SPM+SIFT [7] 82.30 SSBFC [7] 91.67 SAL-PTM [10] 88.33 DMTM [14] 92.92 SIFT+SC [55] 81.67 FK-S [56] 91.63 M-DCNN [36] 93.48 S-UFL [37] 82.72 GBRCN [40] 94.53 LPCNN [41] 89.90 Deep learning methods CCNN 91.56 SRSCNN-NV 92.58 CNNV 93.92 SRSCNN 95.57

From Figure7a, it can be seen that the 21 classes can be recognized with an accuracy of at least 85%, and most categories are recognized with an accuracy of 100%, i.e., agriculture, airplane, baseball diamond, beach, etc. From Figure7b, it can be seen that some representative scene images on the left are recognized correctly by SRSCNN and CNNV. It can also be seen that when the scale of the objects varies, CNNV cannot correctly recognize them, but SRSCNN does, due to its ability to extract features that are robust to scale variation. Taking the airplane scenes as an example, the two images on the left are easy to classify, because the scales of the airplanes in these images are normal; however, the airplanes in the right images are very difficult to recognize, because of their different scale, leading to misclassification by CNNV. However, the proposed SRSCNN can still recognize them correctly due to its ability to extract features that are robust to scale change. Remote Sens. 2017, 9, x FOR PEER REVIEW 12 of 23

RemoteRemote Sens. Sens. 20172018, 9,, 10x FOR, 444 PEER REVIEW 1212 of of23 23

(a) (b) (a) (b) Figure 7. (a) The confusion matrix for the UCM dataset based on SRSCNN. (b) Some of the classification results of CNNV and SRSCNN with the UCM dataset. Left: correctly recognized images FigureFigure 7. 7. ((aa)) TheThe confusionconfusion matrix matrix for for the UCMthe UCM dataset dataset based onbased SRSCNN. on SRSCNN. (b) Some ( ofb) the Some classification of the for both strategies. Right: images correctly recognized by SRSCNN, but incorrectly classified by classificationresults of CNNV results and of CNNV SRSCNN and withSRSCNN the UCM with th dataset.e UCM Left:dataset. correctly Left: correctly recognized recognized images images for both CNNV. forstrategies. both strategies. Right: images Right: correctlyimages correctly recognized recognized by SRSCNN, by butSRSCNN, incorrectly but classifiedincorrectly by classified CNNV. by CNNV. 4.2. The Google Dataset of SIRI-WHU 4.2. The Google Dataset of SIRI-WHU 4.2. TheThe Google Google Dataset dataset of SIRI-WHUof SIRI-WHU designed by the Intelligent Data Extraction and Analysis of The Google dataset of SIRI-WHU designed by the Intelligent Data Extraction and Analysis of Remote Sensing (RSIDEA) Group of Wuhan University contains 12 scene classes, with 200 scene RemoteThe Google Sensing dataset (RSIDEA) of SIRI-WHU Group of designed Wuhan University by the Intelligent contains Data 12 scene Extraction classes, and with Analysis 200 scene of images per category, and each scene image consists of 200 × 200 pixels, with a spatial resolution of 2 m Remoteimages Sensing per category, (RSIDEA) and each Group scene of image Wuhan consists Universi of 200ty contains× 200 pixels, 12 scene with aclasses, spatial with resolution 200 scene of 2 m per pixel. Figure 8 shows representative images of each class, i.e., meadow, pond, harbor, industrial, imagesper pixel. per Figurecategory,8 shows and each representative scene image images consists of of each 200 class, × 200 i.e.,pixels, meadow, with a pond,spatial harbor,resolution industrial, of 2 m park, river, residential, overpass, agriculture, commercial, water, and idle land. perpark, pixel. river, Figure residential, 8 shows overpass, representative agriculture, images commercial, of each class, water, i.e., meadow, and idle land. pond, harbor, industrial, park, river, residential, overpass, agriculture, commercial, water, and idle land.

FigureFigure 8. 8.ClassClass representatives representatives of ofthe the Google Google dataset dataset of ofSIRI-WHU: SIRI-WHU: (a) ( ameadow,) meadow, (b ()b pond,) pond, (c) ( charbor,) harbor, (Figured()d industrial,) industrial, 8. Class ( erepresentatives ()e park,) park, (f) ( friver,) river, of (g (the)g residential,) residential,Google dataset (h ()h overpass,)overpass, of SIRI-WHU: (i) ( agriculture,i) agriculture, (a) meadow, (j) (commercial,j) commercial,(b) pond, ((ck)) (harbor, kwater,) water, and(dand) industrial, (l) ( lidle) idle land. land. (e) park, (f) river, (g) residential, (h) overpass, (i) agriculture, (j) commercial, (k) water, and (l) idle land. From Table 3, it can be seen that SRSCNN performs well and achieves the best classification From Table3, it can be seen that SRSCNN performs well and achieves the best classification accuracy. The main reason for the comparatively high accuracy achieved by SRSCNN when accuracy.From Table The main 3, it reason can be for seen the that comparatively SRSCNN perf highorms accuracy well achievedand achieves by SRSCNN the best when classification compared compared with the traditional scene classification methods such as BoVW, LDA, and LLC is that accuracy.with the The traditional main reason scene classificationfor the comparatively methods such high as accuracy BoVW, LDA, achieved and LLCby SRSCNN is that SRSCNN, when compared with the traditional scene classification methods such as BoVW, LDA, and LLC is that

Remote Sens. 2018, 10, 444 13 of 23 as a deep learning method, can extract the essential features from the data directly, without prior information, and combines feature extraction and scene classification into a whole, leading to the interaction between feature learning and scene classification. Compared with the deep learning methods, S-UFL and LPCNN, SRSCNN performs 19.92% and 4.88% better, respectively. The reason SRSCNN achieves a better accuracy is that S-UFL and LPCNN do not consider the scale variation of objects, resulting in a challenge for the scene classification. In contrast, in order to meet this challenge, random-scale stretching is adopted in SRSCNN during the training phase to ensure that the extracted features are robust to scale variation. Compared with SSBFC, DMTM, and FK-S, with only 50% of the samples randomly selected from the dataset as the training samples, the proposed method still performs the best, which is shown in Table4. In order to further demonstrate the effectiveness of the random-scale stretching, two experiments were conducted based on CCNN and SRSCNN-NV, respectively. From Table3, it can be seen that SRSCNN-NV outperforms CCNN by 2.80%, demonstrating the effectiveness of the random-scale stretching for remote sensing images. Meanwhile, compared with SRSCNN-NV, SRSCNN is 3.70% better, showing the power of the multi-perspective fusion. According to the above comparison and analysis, it can be seen that the random-scale stretching and multi-perspective fusion enable SRSCNN to obtain the best results in remote sensing image scene classification. The Kappa of SRSCNN for the Google dataset of SIRI-WHU is 0.9364 when 80% training samples selected.

Table 3. Comparison between the previous reported accuracies with the Google dataset of SIRI-WHU, with 80% training samples selected.

Classification Method Classification Accuracy (%) BoVW 73.93 SPM-SIFT 80.26 Non-deep learning methods LLC 70.89 LDA 66.85 S-UFL [37] 74.84 LPCNN [41] 89.88 CCNN 88.26 Deep learning methods SRSCNN-NV 91.06 CNNV 90.69 SRSCNN 94.76

Table 4. Comparison between the published accuracies with the Google dataset of SIRI-WHU, with 50% training samples selected.

Classification Method Classification Accuracy (%) SSBFC [7] 90.86 DMTM [14] 91.52 LFK-Linear 88.42 FK-Linear 87.53 FK-S [56] 90.40 SRSCNN 93.44

From Figure9a, it can be seen that most of the classes can be recognized with an accuracy of at least 92.5%, except for river, overpass, and water. Meadow, harbor, and idle land are recognized with an accuracy of 100%. From Figure9b, it can be seen that some representative scene images on the left are recognized correctly by SRSCNN and CNNV, i.e., CNN with voting but not random-scale stretching. It can also be seen that when the object scale varies, CNNV cannot correctly recognize the objects, but SRSCNN correctly recognizes them due to its ability to extract features that are robust to scale variation. RemoteRemote Sens. Sens. 20172018, 9,,10 x ,FOR 444 PEER REVIEW 1414 of of 23 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 14 of 23

(a) (b) (a) (b) Figure 9. (a) The confusion matrix for the Google dataset of SIRI-WHU based on SRSCNN. (b) Some FigureFigure 9. 9.( a(a)) The The confusion confusion matrix matrix for for the the Google Google dataset dataset of of SIRI-WHU SIRI-WHU based based on on SRSCNN. SRSCNN. (b) ( Someb) Some of of the classification results of CNNV and SRSCNN with the Google dataset of SIRI-WHU. Right: theof classificationthe classification results results of CNNV of CNNV and SRSCNN and SRSC withNN the with Google the datasetGoogle ofdataset SIRI-WHU. of SIRI-WHU. Right: correctly Right: correctly recognized images for both strategies. Left: images correctly recognized by SRSCNN, but recognizedcorrectly recognized images for images both strategies. for both Left:strategies. images Left: correctly images recognized correctly re bycognized SRSCNN, by but SRSCNN, incorrectly but incorrectlyclassified byclassified CNNV. by CNNV. incorrectly classified by CNNV. 4.3. Experiment 3: The Wuhan IKONOS Dataset 4.3.4.3. ExperimentExperiment 3:3: TheThe WuhanWuhan IKONOSIKONOS DatasetDataset The HSR images in the Wuhan IKONOS dataset were acquired over the city of Wuhan in China TheThe HSRHSR imagesimages inin thethe WuhanWuhan IKONOSIKONOS datasetdataset werewere acquiredacquiredover over thethe citycity ofof WuhanWuhan in inChina China by the IKONOS sensor in June 2009. The spatial resolutions of the panchromatic images and the byby thethe IKONOSIKONOS sensorsensor inin JuneJune 2009.2009. TheThe spatialspatial resolutionsresolutions ofof thethe panchromaticpanchromatic imagesimages andand thethe multispectral images are 1 m and 4 m, respectively. In this paper, all the images in the Wuhan multispectralmultispectral images images are are 1 m1 andm and 4 m, 4 respectively. m, respectively. In this In paper, this allpaper, the imagesall the inimages the Wuhan in the IKONOS Wuhan IKONOS dataset were obtained by Gram–Schmidt pan-sharpening with ENVI 4.7 software. The datasetIKONOS were dataset obtained were by Gram–Schmidtobtained by Gram–Schmidt pan-sharpening pan-sharpening with ENVI 4.7 software.with ENVI The 4.7 Wuhan software. IKONOS The Wuhan IKONOS dataset contains eight scene classes, with 30 scene images per category, and each datasetWuhan contains IKONOS eight dataset scene contains classes, eight with 30scene scene classes, images with per 30 category, scene images and each per scene category, image and consists each scene image consists of 150 × 150 pixels, with a spatial resolution of 1 m per pixel. These images were ofscene 150 ×image150 pixels,consists with of 150 a spatial × 150 resolutionpixels, with of a 1 spatial m per pixel.resolution These of images 1 m per were pixel. cropped These fromimages a largewere cropped from a large image obtained over the city of Wuhan in China by the IKONOS sensor in June imagecropped obtained from a overlarge the image city obtained of Wuhan over in Chinathe city by of the Wuhan IKONOS in China sensor by inthe June IKONOS 2009, withsensor a sizein June of 2009, with a size of 6150 × 8250 pixels. Figure 10 shows representative images of each class, i.e., dense 61502009,× with8250 a pixels.size of 6150 Figure × 8250 10 shows pixels. representative Figure 10 shows images representative of each class, images i.e., denseof each residential, class, i.e., dense idle, residential, idle, industrial, medium residential, parking lot, commercial, vegetation, and water. industrial,residential, medium idle, industrial, residential, medium parking residential, lot, commercial, parking vegetation, lot, commercial, and water. vegetation, and water.

Figure 10. Class representatives of the Wuhan IKONOS dataset: (a) dense residential, (b) idle, (c) Figure 10. Class representatives of the Wuhan IKONOS dataset: (a) dense residential, (b) idle, (c) industrial,Figure 10. (d)Class medium representatives residential, (e of) parking the Wuhan lot, (f IKONOS) commercial, dataset: (g) vegetation, (a) dense ( residential,h) water. (b) idle, (cindustrial,) industrial, (d ()d medium) medium residential, residential, (e) ( eparking) parking lot, lot, (f) ( fcommercial,) commercial, (g ()g vegetation,) vegetation, (h ()h water.) water. From Table 5, it can be seen that SRSCNN performs well and achieves a better classification From Table 5, it can be seen that SRSCNN performs well and achieves a better classification accuracyFrom than Table BoVW,5, it canLDA, be P-LDA, seen that FK-Linear, SRSCNN and performs LFK-Linear. well and However, achieves compared a better classificationwith BoVW accuracy than BoVW, LDA, P-LDA, FK-Linear, and LFK-Linear. However, compared with BoVW andaccuracy P-LDA, than the BoVW, methods LDA, based P-LDA, on FK-Linear,CNN such and as LFK-Linear.CCNN and However,SRSCNN-NV compared perform with worse, BoVW and and and P-LDA, the methods based on CNN such as CCNN and SRSCNN-NV perform worse, and accuracyP-LDA, improvement the methods based is limited on CNN in SRSCNN. such as CCNNThe reas andon is SRSCNN-NV that the large perform dataset worse,is usually and required accuracy accuracy improvement is limited in SRSCNN. The reason is that the large dataset is usually required toimprovement train the CNN is limitedmodel, inbut SRSCNN. there are The only reason about is 200 that images the large for dataset training. is usuallyIn Table required 5, there to is trainno big the to train the CNN model, but there are only about 200 images for training. In Table 5, there is no big differenceCNN model, in accuracy but there between are only CCNN about and 200 imagesSRSCNN-NV. for training. The main In Table reason5, there for this is no is that big differencethis dataset in difference in accuracy between CCNN and SRSCNN-NV. The main reason for this is that this dataset isaccuracy croppedbetween from the CCNN same andimage, SRSCNN-NV. and the effect The of main the reasonscale change for this problem is that this is datasetsmall. However, is cropped is cropped from the same image, and the effect of the scale change problem is small. However, comparedfrom the samewith image,SRSCNN-NV, and the SRSCNN effect of the performs scale change 10.03% problem better, isand small. still However, performs comparedthe best. The with compared with SRSCNN-NV, SRSCNN performs 10.03% better, and still performs the best. The Kappa of SRSCNN for the Wuhan IKONOS dataset is 0.8333. Kappa of SRSCNN for the Wuhan IKONOS dataset is 0.8333.

Remote Sens. 2018, 10, 444 15 of 23

SRSCNN-NV, SRSCNN performs 10.03% better, and still performs the best. The Kappa of SRSCNN for Remote Sens. 2017, 9, x FOR PEER REVIEW 15 of 23 the Wuhan IKONOS dataset is 0.8333.

Table 5.5. ComparisonComparison betweenbetween thethe previousprevious reportedreported accuraciesaccuracies withwith thethe WuhanWuhan IKONOSIKONOS dataset. dataset. Classification Method Classification Accuracy (%) ClassificationBoVW Method Classification80.75 Accuracy (%) BoVWLDA 77.34 80.75 P-LDALDA 84.69 77.34 P-LDA 84.69 FK-LinearFK-Linear 78.23 78.23 LFK-LinearLFK-Linear 79.69 79.69 CCNN 74.45 74.45 SRSCNN-NVSRSCNN-NV 74.97 74.97 CNNV 79.60 SRSCNNCNNV 79.60 85.00 SRSCNN 85.00

AnAn annotationannotation experimentexperiment onon thethe largelarge IKONOSIKONOS image was conducted with the trained model. DuringDuring thethe annotationannotation ofof thethe largelarge image,image, thethe largelarge imageimage waswas splitsplit intointo aa setset ofof scenescene imagesimages wherewhere thethe imageimage sizesize andand spacingspacing are set to 150150× 150150pixelpixel and and 100100× 100× 100 pixels,pixels, respectively. respectively. The The set set of ofscene scene images images is isdenoted denoted as asDsDs. .Then Then the the images images in in DsDs were classifiedclassified by the trained model. Finally,Finally, thethe annotationannotation resultresult cancan bebe obtained by assembling the images in in DsDs.. InIn addition,addition, forfor thethe overlappingoverlapping pixelspixels between between adjacent adjacent images, images, their thei classr class labels labels were were determined determined by the by majority the majority voting rule.voting The rule. annotation The annotation result is result shown is inshown Figure in 11Figure. 11.

FigureFigure 11.11. ImageImage annotationannotation usingusing thethe WuhanWuhan IKONOSIKONOS dataset.dataset. ((a)) False-colorFalse-color imageimage ofof thethe largelarge imageimage withwith 61506150 ×× 8250 pixels. ( (bb)) Annotated Annotated large large image. image.

5.5. DiscussionDiscussion ToTo avoidavoid overfitting,overfitting, dropoutdropout technologytechnology isis appliedapplied inin thethe proposedproposed method.method. FourFour differentdifferent dropoutdropout ratesratesp werep were tested tested to analyze to analyze how this how affects this classificationaffects classification accuracy. accuracy. After feature After extraction, feature theextraction, number the of units number in the of lastunits fully in connectedthe last fully layer connected decides thelayer length decides of the the feature, length which of the plays feature, an importantwhich plays role an inimportant the classification role in the results. classification In Section results.4, we In analyzed Section 4, the we classification analyzed the results classification when theresults length when of thethe featurelength ofL (i.e.,the feature the number L (i.e., of units the number in the fully of units connected in the fully layer connected of the adopted layer CNNof the adopted CNN model in Table 1) was equal to 600. In this section, the sensitivity analysis between length of feature and classification accuracy is described. In SRSCNN, we add random-scale stretching to the CNN to improve the remote sensing image scene classification. To investigate the effect of the distribution of the scale variation factor α, a sensitivity analysis of different distributions

Remote Sens. 2018, 10, 444 16 of 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 16 of 23 ismodel undertaken. in Table In1) wasaddition, equal for to 600. a scene In this image section, with the a sensitivityfixed scale, analysis the crop between rate Cr length decides of feature the approximateand classification crop scale, accuracy so three is described. different In crop SRSCNN, rates are we analyzed. add random-scale In SRSCNN, stretching the random to the CNNspatial to scaleimprove stretching the remote (Rss) sensing and random image rotation scene classification. (Rot) are adopted, To investigate the analysis the effect about of the effectiveness distribution of of α randomthe scale spatial variation scale factor stretching, a sensitivity and random analysis rotation of different is described. distributions To test the is undertaken.influence of the In addition, spatial Cr resolution,for a scene a imagesensitivity with analysis a fixed scale,of different the crop resolution rate isdecides undertaken. the approximate In training phase, crop scale, the number so three ofdifferent training crop samples rates arecropped analyzed. from In each SRSCNN, image theis decided random spatialby the scaleiteration stretching times, (Rss)so the and different random iterationsrotation (Rot)are analyzed. are adopted, the analysis about effectiveness of random spatial scale stretching and random rotation is described. To test the influence of the spatial resolution, a sensitivity analysis of 5.1.different Sensitivity resolution Analysis is in undertaken. Relation to the In trainingDropout phase,Rate the number of training samples cropped from each image is decided by the iteration times, so the different iterations are analyzed. According to dropout theory, breaking the co-adaptation helps to make each hidden unit more robust,5.1. Sensitivity i.e., decreasing Analysis the in Relation reliance to on the other Dropout unit Rates and forcing the units to create useful features. Dropout is adopted in the proposed method, which involves the parameter dropout rate p . In the According to dropout theory, breaking the co-adaptation helps to make each hidden unit more p trainingrobust, process, i.e., decreasing a unit is thedeleted reliance withon probability other units , and and forcingin order the to make units sure to create that the useful expectation features. ofDropout the hidden is adopted unit is unchanged, in the proposed the output method, of the which hidden involves unit is the multiplied parameter by dropout(1− p ) in rate the ptesting.. In the Totraining study process,the sensitivity a unit isof deleted SRSCNN with in probability relation to pthe, and dropout in order rate, to make experiments sure that were the expectation undertaken of withthe hiddenthe UCM unit and is Google unchanged, datasets, the output respectively, of the hiddenand the unitother is parameters multiplied bywere (1 −keptp) inthe the same. testing. From To Figurestudy 12, the when sensitivity the dropout of SRSCNN rate increases in relation from to the 0.35 dropout to 0.5, rate,classification experiments accuracy were undertakenof UCM improves; with the however,UCM and it Google begins datasets, to decrease respectively, when the and dropou the othert rate parameters continues were to keptincrease, the same. achieving From Figurethe best 12 , classificationwhen the dropout result ratewith increases a dropout from rate 0.35 of to0.5. 0.5, Fo classificationr the Google accuracydataset, although of UCM improves; the change however, trend is it notbegins as obvious to decrease as for when UCM, the because dropout of rate the continuesslight decline to increase, at 0.5, it achieving can be seen the that best when classification the dropout result ratewith increases a dropout from rate 0.35 of 0.5.to 0.65, For classification the Google dataset, accuracy although for the theGoogle change dataset trend improves, is not as obviousand it begins as for toUCM, decrease because when of the the dropout slight decline rate continues at 0.5, it can to beincr seenease, that achieving when the the dropout best classification rate increases result from with 0.35 ato dropout 0.65, classification rate of 0.65. accuracy for the Google dataset improves, and it begins to decrease when the dropout rate continues to increase, achieving the best classification result with a dropout rate of 0.65.

0.962 Google

0.957 UCM

0.952

0.947

0.942 Classification AccuracyClassification

0.937

0.932 0.35 0.5 0.65 0.8 Dropout Rate p FigureFigure 12. 12. TheThe relationship relationship between between dropout dropout ra ratete and and classification classification accuracy. accuracy.

5.2.5.2. Sensitivity Sensitivity Analysis Analysis in in Relation Relation to to the the Length Length of of the the Feature Feature ToTo investigate investigate the the effect effect of of the the length length of of the the feature feature LL onon classification classification accuracy, accuracy, the the other other parametersparameters were were again again kept kept the the same. same. The The value value of ofthe the length length of offeature feature L L waswas varied varied as as follows: follows: LL = = {200,{200, 600,600, 1000,1000, andand 1400}1400}.. FromFrom Figure 13,13, as as th thee length length increases increases from from 200 200 to to 600, 600, classification classification accuracyaccuracy with with the the UCM UCM dataset dataset increases, increases, and and it it then then decreases decreases when when length length LL>> 600. 600. However, However, accuracyaccuracy at at 1000 1000 is is still still better better than than at at 200. 200. The The reason reason for for this this may may be bethat that a feature a feature with with L L == 1000 1000 hashas a amore more powerful powerful representation representation ability ability with with the the 21-class 21-class UCM UCM dataset. dataset. The The optimal optimal value value of of LLis isaround around 600 600 for for UCM. UCM. For For the the Google Google dataset, dataset, it hasit has the the same same change change trend trend as UCM,as UCM, achieving achieving the the best bestaccuracy accuracy at L at= 600.L = 600.

Remote Sens. 2017, 9, x FOR PEER REVIEW 17 of 23

0.96 Remote Sens. 2018, 10, 444 UCM 17 of 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 17 of 23 0.955 Google

0.96 0.95 UCM

0.955 Google 0.945

0.95

Classification Accuracy Classification 0.94

0.945 0.935 200 600 1000 1400 Classification Accuracy Classification 0.94 Feature Length L

Figure 13. The relationship0.935 between feature length L and classification accuracy. 200 600 1000 1400 5.3. Sensitivity Analysis in Relation to the DistributionFeature of the Length Scale LVariation Factor α

The distributionFigureFigure of 13. 13.α ThemakesThe relationship relationship a great difference between between featurefeature in HSR leng length remoteth LL and and sensing classification classification image accuracy. scene accuracy. classification. When α obeys a normal distribution, i.e., ασ ~N () 1, 2 , the standard deviation σ plays an 5.3. Sensitivity Analysis in Relation to the Distribution of the Scale Variation Factor α important5.3. Sensitivity role in Analysis the classification in Relation to theresults, Distribution and when of the Scaleα obeys Variation a uniform Factor α distribution, i.e., The distribution of α makes a great difference in HSR remote sensing image scene classification. α ~Uinf ()The, 1.2 distribution, inf greatly of α makes affects a great classification difference inaccu HSRracy. remote We sensingtested the image UCM scene and classification. Google 2 When α obeys a normal distribution, i.e., ασ ~N () 1,2 , the standard deviation σ plays an datasetsWhen withα obeys the two a normal kinds distribution,of distribution, i.e., withα ∼ theN other1, σ parameters, the standard kept the deviation same. Fromσ plays Figure an important14a, in general,importantrole in with the rolethe classification improvementin the classification results, of inf andof whenresults, the uniformα obeysand distribution,when a uniform α obeys distribution, cla ssificationa uniform i.e., accuracy distribution,α ∼ U with(in fthe ,i.e., 1.2 ) , α () UCMin~ fdatasetUinfgreatly , decreases, 1.2 affects, inf classification but greatly then leaps affects accuracy. on someclassification Wepoints, tested which accu the racy.is UCM the sameWe and tested for Google the the Google datasets UCM dataset. withand Google theBoth two thedatasetskinds UCMof and with distribution, Google the two datasets kinds with of the obtaindistribution, other their parameters worstwith the cl kept assificationother the parameters same. accuracy From kept Figure at the 0.8. same. 14 However,a, From in general, Figure the UCM with 14a, thein datasetgeneral,improvement obtains with itsthe of bestin improvement f of result the uniform at 0.7, of and distribution,inf ofthe the Google uniform classification dataset distribution, at accuracy 0.55. When cla withssification theα satisfiesUCM accuracy dataset a Gaussian decreases,with the but then leaps on some points, which is the same for the Google dataset. Both the UCM and Google distribution,UCM dataset the decreases, broad trend but thenof remote leaps on sensing some points,classification which isaccuracy the same with for the the Google UCM dataset. dataset Bothis datasets obtain their worst classification accuracy at 0.8. However, the UCM dataset obtains its best slightlythe UCM upwards and Google in most datasets places, obtain although their worstleaps cldoassification exist. Compared accuracy withat 0.8. the However, UCM dataset,the UCM result at 0.7, and the Google dataset at 0.55. When α satisfies a Gaussian distribution, the broad classificationdataset obtains accuracy its best with result the Googleat 0.7, and dataset the Googlefluctuates dataset significantly, at 0.55. Whenbut the α trend satisfies is upwards a Gaussian in trend of remote sensing classification accuracy with the UCM dataset is slightly upwards in most Figuredistribution, 14b. In this the paper, broad when trend αof~ remoteUinf () , sensing1.2, the cl smallerassification the value accuracy of inf with, the the bigger UCM the dataset stretch is places, although leaps do exist. Compared with the UCM dataset, classification accuracy with the slightly upwardsασ in() most2 places, although leapsσ do exist. Compared with the UCM dataset, range.Google When dataset ~ fluctuatesN 1, , significantly,the bigger the but value the of trend , the is upwards bigger the in stretch Figure range. 14b. In Therefore, this paper, from when classification accuracy with the Google dataset fluctuates significantly, but the trend is upwards in Figureα ∼ 14,U (thein f ,conclusion 1.2), the smaller can be themade value that of ain la f ,rge the stretch bigger range the stretch helps range. to improve When classificationα ∼ N 1, σ 2 , Figure 14b. In this paper, when α ~Uinf (), 1.2, the smaller the value of inf , the bigger the stretch accuracy.the bigger the value of σ, the bigger the stretch range. Therefore, from Figure 14, the conclusion can be 2 range.made When that a largeασ ~ stretchN () 1, range, the helps bigger to the improve value classificationof σ , the bigger accuracy. the stretch range. Therefore, from Figure 14, the conclusion can be made that a large stretch range helps to improve classification 0.97 0.98 accuracy. UCM UCM 0.965 0.97 Google Google 0.96 0.96

0.955 0.95 0.97 0.98 UCM 0.94 UCM 0.950.965 0.97 Google Google 0.93 0.9450.96 0.96 0.92 0.940.955 0.95 Classification Accuracy Classification

Classification Accuracy 0.91 0.9350.95 0.94 0.9 0.930.945 0.93 0.89 0.92 0.9250.94 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Classification Accuracy Classification

Classification Accuracy 0.91 Standard Deviation σ of Gaussian 0.935 Inf of Uniform Distribution 0.9 0.93 (a) (b) 0.89 0.925 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.5 0.55 0.6 0.65 0.7 0.75 0.8 FigureFigure 14. (a 14.) The( relationshipa) The relationship between inf between of the uniforminf of distribution the uniform and distribution classificationStandard Deviation and accuracy. σ of classification Gaussian (b) Inf of Uniform Distribution Theaccuracy. relationship (b) Thebetween relationship the standard between deviation the standard α of the deviation Gaussianα distributionof the Gaussian and distributionclassification and (a) (b) accuracy.classification accuracy. Figure 14. (a) The relationship between inf of the uniform distribution and classification accuracy. (b) The relationship between the standard deviation α of the Gaussian distribution and classification accuracy.

Remote Sens. 2018, 10, 444 18 of 23

Remote Sens. 2017, 9, x FOR PEER REVIEW 18 of 23 5.4. Sensitivity Analysis in Relation to the Crop Rate Cr 5.4. Sensitivity Analysis in Relation to the Crop Rate Cr In this paper, the ratio between the specified scale and the original image scale S is defined as the cropIn this rate paper,Cr, i.e., theCr ratio= Rbetween/S, where theS specis 256ified for scale the UCMand the dataset original and image 200 forscale the S Google is defined dataset. as = theDifferent crop rate crop Cr rates, i.e.,Cr Crlead to R/ different S , where basic S is 256 scales for ofthe patches UCM dataset cropped and from 200 anfor image,the Google and dataset. patches Differentwith different crop scalesrates mayCr contain lead to different information.basic scales of We patches therefore cropped analyzed from how an image, the crop and rate patches affects withclassification different accuracy, scales may and contain the results different are shown inform in Tableation.6 .We The therefore other parameters analyzed were how again the crop kept rate the affectssame. Forclassification the UCM accuracy, and Google and datasets, the results when are shownCr ranges in Table between 6. The 50–70%, other parameters classification were accuracy again keptfluctuates the same. within For a the narrow UCM range and Google and get datasets, the optimal when or Cr suboptimal ranges between result. 50–70%, In addition, classification when Cr accuracyreaches around fluctuates 90%, within it begins a narrow to rapidly range fall. and get the optimal or suboptimal result. In addition, when Cr reaches around 90%, it begins to rapidly fall. Table 6. Sensitivity analysis in relation to the crop rate Cr. Table 6. Sensitivity analysis in relation to the crop rate Cr. Dataset Cr = 50% Cr = 70% Cr = 90% Dataset Cr =50% Cr =70% Cr =90% UCMUCM 0.9486 0.9557 0.9557 0.9398 0.9398 Google 0.9476 0.9468 0.933 Google 0.9476 0.9468 0.933

5.5. Sensitivity Sensitivity Analysis in Relation to the Rss and Rot In this paper, the random spatial scale stretching (Rss) (Rss) and and random random rotation rotation (Rot) (Rot) are are adopted, adopted, and we analyzed how the Rss and Rot affect classificationclassification accuracy. First, the CNN with only with random spatialspatial scale scale stretching stretching or randomor random rotation rotation were denoted were denoted as RssCNN as andRssCNN RotCNN and respectively RotCNN respectivelyand the CNN and without the CNN random without spatial random scale spatial stretching scale and stretching random and rotation random was rotation denoted was as NrrCNN. denoted asThe NrrCNN. result is The shown result in Figure is shown 15. in From Figure Figure 15. From15, it canFigure be seen15, it thatcan therebe seen is nothat difference there is no between difference the betweenRSSCNN, the RotCNN RSSCNN, and RotCNN NrrCNN and for someNrrCNN scene for classes some suchscene as classes agricultural, such as chaparral agricultural, and chaparral overpass, andwhich overpass, means thatwhich these means scene that classes these are scene largely classes unaffected are largely by theunaffected random by spatial the random scale stretching spatial scale and stretchingrandom rotation. and random However, rotation. for theHowever, scene classes for the suchscene as classes airplane such and as airplane storage tank, and storage the spatial tank, scale the spatialstretching scale and stretching random and rotation random had rotation a relatively had large a relatively impact. large The mainimpact. reason The ismain that reason the scene is that classes the scenesuch asclasses airplane, such storage as airplane, tank, arestorage object-centered, tank, are obje i.e.,ct-centered, the scene classi.e., the of anscene image class is decidedof an image by the is decidedobject it containsby the object and theit contains random and spatial the scalerandom stretching spatial scale or random stretching rotation or random can help rotation make thecan CNN help maketo capture the CNN the objectsto capture for the objects object-centered for the object-centered scene. scene.

Figure 15. TheThe class class accuracy accuracy of of the RssCNN, RotCNN and NrrCNN on the UCM.

5.6. Sensitivity Analysis in Relation to the Spatial Resolution 5.6. Sensitivity Analysis in Relation to the Spatial Resolution To test the effectives of the proposed method, three datasets with different spatial resolution, To test the effectives of the proposed method, three datasets with different spatial resolution, i.e., i.e., the UCM with spatial resolution 0.3 m, the Google dataset of SIRI-WHU with 2 m and the Wuhan the UCM with spatial resolution 0.3 m, the Google dataset of SIRI-WHU with 2 m and the Wuhan IKONOS with 1m, are tested. From the experiment in Section 4, it can be seen that the proposed IKONOS with 1m, are tested. From the experiment in Section4, it can be seen that the proposed method can work well on these three different spatial resolution images. In addition, to test the method can work well on these three different spatial resolution images. In addition, to test the influence of the spatial resolution in accuracy, four lower spatial resolution datasets are derived from the UCM and the Google dataset of SIRI-WHU by factor of 4 and 8 down-sampling. We denote the

Remote Sens. 2018, 10, 444 19 of 23 influence of the spatial resolution in accuracy, four lower spatial resolution datasets are derived from theRemote UCM Sens. and 2017 the, 9, Googlex FOR PEER dataset REVIEW of SIRI-WHU by factor of 4 and 8 down-sampling. We denote19 the of 23 original data as OD, and the data obtained by factor of 4 and 8 down-sampling are denoted as OD-4 andoriginal OD-8. data The accuracyas OD, and comparison the data obtained is shown by in factor Table 7of. 4 and 8 down-sampling are denoted as OD-4 and OD-8. The accuracy comparison is shown in Table 7. Table 7. Sensitivity analysis in relation to the spatial resolution. Table 7. Sensitivity analysis in relation to the spatial resolution Dataset OD OD-4 OD-8 Dataset OD OD-4 OD-8 UCMUCM 0.95570.9557 0.95000.9500 0.9071 0.9071 Google 0.9476 0.9396 0.8958 Google 0.9476 0.9396 0.8958

FromFrom the the Table Table7, it7, canit can be be seen seen that that accuracy accuracy fell fell when when the the spatial spatial resolution resolution decreases decreases and and accuracyaccuracy falls falls rapidly rapidly with with factor factor 8 down-sampling.8 down-sampling. To To explore explore how how the the scene scene classes classes are are affected affected by by thethe spatial spatial resolution, resolution, we we compare compare the the class class accuracy accuracy of of OD OD and and OD-8 OD-8 in in Figure Figure 16 16.. From From Figure Figure 16 16,, itit can can be be seen seen that that the the accuracy accuracy of of most most scene scene classes classes falls falls when when the the spatial spatial resolution resolution decreased, decreased, particularlyparticularly for for the the object-center object-center scene scene classes classes such such as as airplane, airplane, storage storage tank. tank. The The main main reason reason is is that that whenwhen the the resolution resolution decreases, decreases, it willit will be be difficult difficult to to capture capture the the key key objects objects such such as as airplane, airplane, storage storage tanktank for for the the object-center object-center scene. scene.

(a) (b)

FigureFigure 16. 16.The The class class accuracy accuracy on on OD OD and and OD-8. OD-8. (a )(a UCM) UCM dataset; dataset; (b )(b Google) Google data data of of SIRI-WHU. SIRI-WHU.

5.7.5.7. Sensitivity Sensitivity Analysis Analysis in in Relation Relation to to the the Training Training Iteration Iteration InIn SRSCNN, SRSCNN, the the training training samples samples are are cropped cropped from from each each image image and and the the training training samples samples within within eacheach image image may may be be different. different. Given Given the the total total training training sample sample number numbern, n the, the batch batch size sizem ,m the, the number number ofof training training samples samples extracted extracted from from each each image image is determinedis determined by: by: NumS=× N m n (16) where NumS denotes the number of trainingNumS =samplesN × m extracted/n from each image. From Equation(16) (16), it can be seen that the NumS is proportional to the iteration N. To explore how the number of whereiterationNumS affectsdenotes accuracy, the number the accuracy of training comparison samples extractedbetween different from each training image. iteration From Equation times is (16),shown it can in beFigure seen 17. that From the NumSFigure is17, proportional it can be seen to that the accuracy iteration increasesN. To explore with howiteration the numbertimes before of iteration60000 and affects the accuracy, standard the deviation accuracy falls comparison as the iterat betweenion times different get larger. training When iteration the iteration times is shown times is inequal Figure to 17 70,000,. From there Figure will 17 be, it about can be 2200 seen training that accuracy sample increases extracted with from iteration each image times for before training, 60000 and andit can the standardget the optimal deviation or fallssuboptimal as the iteration result and times the get standard larger. Whendeviation the iterationbecome smaller. times is equalThe main to 70,000,reason there the willstandard be about deviation 2200 training falls as samplethe iteration extracted times from get eachlarger image is the for NumS training, will and become it can larger, get thewhich optimal means or suboptimalthere will be resultmore training and the samples standard extracted deviation from become each image. smaller. The main reason the standard deviation falls as the iteration times get larger is the NumS will become larger, which means there will be more training samples extracted from each image.

Remote Sens. 2018, 10, 444 20 of 23 Remote Sens. 2017, 9, x FOR PEER REVIEW 20 of 23

Figure 17. The overall accuracy of the SRSCNN on the UCM and Google dataset of SIRI-WHU with different iterations.

6. Conclusions 6. Conclusions In this paper, SRSCNN, a novel algorithm based on a deep convolutional neural network, has In this paper, SRSCNN, a novel algorithm based on a deep convolutional neural network, has been proposed and implemented to solve the object scale change problem in remote sensing scene been proposed and implemented to solve the object scale change problem in remote sensing scene classification. In the proposed algorithm, random-scale stretching is proposed to force the CNN classification. In the proposed algorithm, random-scale stretching is proposed to force the CNN model model to learn a feature representation that is robust to object scale variation. During the course of to learn a feature representation that is robust to object scale variation. During the course of the the training, a patch with a random scale and location is cropped and then stretched into a fixed scale training, a patch with a random scale and location is cropped and then stretched into a fixed scale to train the CNN model. In the testing stage, multiple patches with a random scale and location are to train the CNN model. In the testing stage, multiple patches with a random scale and location are cropped from each image to be the inputs of the trained CNN. Finally, under the assumption that the cropped from each image to be the inputs of the trained CNN. Finally, under the assumption that patches have the same label as the image, multi-perspective fusion is adopted to combine the the patches have the same label as the image, multi-perspective fusion is adopted to combine the prediction of every patch to decide the final label of the image. prediction of every patch to decide the final label of the image. To test the performance of the SRSCNN, three datasets, mainly covering the urban regions of To test the performance of the SRSCNN, three datasets, mainly covering the urban regions of the the United States and China, i.e., the 21-class UCM dataset, the 12-class Google dataset of SIRI-WHU, United States and China, i.e., the 21-class UCM dataset, the 12-class Google dataset of SIRI-WHU, and and the Wuhan IKONOS dataset, are used. The results of the experiments consistently showed that the Wuhan IKONOS dataset, are used. The results of the experiments consistently showed that the the proposed method can obtain a high classification accuracy. When compared with the traditional proposed method can obtain a high classification accuracy. When compared with the traditional methods such as BoVW and LDA, and other methods based on deep learning, such as S-UFL and methods such as BoVW and LDA, and other methods based on deep learning, such as S-UFL GBRCN, SRSCNN obtains the best classification accuracy, demonstrating the effectiveness of and GBRCN, SRSCNN obtains the best classification accuracy, demonstrating the effectiveness of random-scale stretching for remote sensing image scene classification. random-scale stretching for remote sensing image scene classification. Nevertheless, the large dataset is required when the CNN model is trained for remote sensing Nevertheless, the large dataset is required when the CNN model is trained for remote sensing scene classification. In addition, it is still a challenge for the proposed method when only the small scene classification. In addition, it is still a challenge for the proposed method when only the small dataset is available for training. Semi-supervised learning, as one of the popular techniques for dataset is available for training. Semi-supervised learning, as one of the popular techniques for training training with the labeled data and unlabeled data, should be considered to meet this challenge. In with the labeled data and unlabeled data, should be considered to meet this challenge. In practical use, practical use, multiple spatial resolution scenes, where the spatial resolution may be 0.3 m for some multiple spatial resolution scenes, where the spatial resolution may be 0.3 m for some scene classes scene classes such as tennis court, and 30 m for a scene class such as island, are often processed such as tennis court, and 30 m for a scene class such as island, are often processed together. Hence, together. Hence, in our future work, we plan to explore multiple spatial resolution scene in our future work, we plan to explore multiple spatial resolution scene classification, based on the classification, based on the semi-supervised CNN model. semi-supervised CNN model. Acknowledgments: The authors would like to thank the editor and the anonymous reviewers for their commentsAcknowledgments: and suggestions.The authors This wouldwork was like supported to thank the by editor National and theNatural anonymous Science reviewers Foundation for of their China comments under Grantand suggestions. Nos. 41622107 This and work 41771385, was supported National by Key The Research Application and Development Research Of The Program Remote of SensingChina under Technology Grant On Global Energy Internet No. JYYKJXM(2017)011, National Natural Science Foundation of China under No.Grant 2017YFB0504202, Nos. 41622107 and and Natural 41771385, Science National Foundation Key Research of Hubei and Province Development in China Program under Grant of China No. 2016CFA029. under Grant AuthorNo. 2017YFB0504202, Contributions: and All Natural the authors Science made Foundation significan oft Hubeicontributions Province to in the China work. under Yanfei Grant Liu, No. Yanfei 2016CFA029. Zhong, andAuthor Fei Feng Contributions: designed theAll research the authors and analyzed made significant the results. contributions Qianqing Qin to theprovided work. advice Yanfei for Liu, the Yanfei preparation Zhong, ofand the Fei paper. Feng designed the research and analyzed the results. Qianqing Qin provided advice for the preparation of the paper. Conflicts of Interest: The authors declare no conflict of interest. Conflicts of Interest: The authors declare no conflict of interest. References

1. Bosch, A.; Munoz, X.; Marti, R. Which is the best way to organize/classify images by content? Image Vis. Comput. 2007, 25, 778–791.

Remote Sens. 2018, 10, 444 21 of 23

References

1. Bosch, A.; Munoz, X.; Marti, R. Which is the best way to organize/classify images by content? Image Vis. Comput. 2007, 25, 778–791. [CrossRef] 2. Yang, Y.; Newsam, S. Spatial pyramid co-occurrence for image classification. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. 3. Zhao, L.; Tang, P.; Huo, L. A 2-D wavelet decomposition-based bag-of-visual-words model for land-use scene classification. Int. J. Remote Sens. 2014, 35, 2296–2310. 4. Zhao, L.; Tang, P.; Huo, L. Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4620–4631. [CrossRef] 5. Chen, S.; Tian, Y. Pyramid of spatial relations for scene-level land use classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1947–1957. [CrossRef] 6. Hu, F.; Xia, G.; Wang, Z.; Huang, X.; Zhang, L.; Sun, H. Unsupervised feature learning via spectral clustering of multidimensional patches for remotely sensed scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8.[CrossRef] 7. Zhao, B.; Zhong, Y.; Zhang, L. A spectral-structural bag-of-features scene classifier for very high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2016, 116, 73–85. [CrossRef] 8. Zhong, Y.; Wu, S.; Zhao, B. Scene Semantic Understanding Based on the Spatial Context Relations of Multiple Objects. Remote Sens. 2017, 9, 1030. [CrossRef] 9. Zhao, B.; Zhong, Y.; Zhang, L. Scene classification via latent Dirichlet allocation using a hybrid generative/ discriminative strategy for high spatial resolution remote sensing imagery. Remote Sens. Lett. 2013, 4, 1204–1213. [CrossRef] 10. Zhong, Y.; Zhu, Q.; Zhang, L. Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6207–6222. [CrossRef] 11. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. 12. Lienou, M.; Maître, H.; Datcu, M. Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geosci. Remote. Sens. Lett. 2010, 7, 28–32. [CrossRef] 13. Luo, W.; Li, H.; Liu, G.; Zeng, L. Semantic annotation of satellite images using author–genre–topic model. IEEE Trans. Geosci. Remote. Sens. 2014, 52, 1356–1368. [CrossRef] 14. Zhao, B.; Zhong, Y.; Xia, G.S.; Zhang, L. Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 2016, 54, 2108–2123. [CrossRef] 15. Zhu, Q.; Zhong, Y.; Zhang, L.; Li, D. Scene Classification Based on the Sparse Homogeneous-Heterogeneous Topic Feature Model. IEEE Trans. Geosci. Remote. Sens. 2018.[CrossRef] 16. Zhu, Q.; Zhong, Y.; Zhang, L.; Li, D. Scene Classification Based on the Fully Sparse Semantic Topic Model. IEEE Trans. Geosci. Remote. Sens. 2017, 55, 5525–5537. [CrossRef] 17. Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014. 18. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. 19. Sun, Y.; Wang, X.; Tang, X. Deep learning face representation from predicting 10,000 classes. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014. 20. Sun, Y.; Liang, D.; Wang, X.; Tang, X. Deepid3: Face recognition with very deep neural networks. arXiv 2015, arXiv:1502.00873. 21. Kontschieder, P.; Fiterau, M.; Criminisi, A.; Rota Bulo, S. Deep neural decision forests. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. 22. Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; Oliva, A. Learning deep features for scene recognition using places database. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. Remote Sens. 2018, 10, 444 22 of 23

23. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [CrossRef][PubMed] 24. Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [CrossRef][PubMed] 25. Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. arXiv 2015, arXiv:1510.02855. 26. Zhong, Y.; Ma, A.; Ong, Y.; Zhu, Z.; Zhang, L. Computational Intelligence in Optical Remote Sensing Image Processing. Appl. Soft Comput. 2018, 64, 75–93. [CrossRef] 27. Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2014, 7, 2094–2107. [CrossRef] 28. Ma, X.; Wang, H.; Geng, J. Spectral–Spatial Classification of Hyperspectral Image Based on Deep Auto-Encoder. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2016, 9, 4073–4085. [CrossRef] 29. Slavkovikj, V.; Verstockt, S.; De Neve, W.; Van Hoecke, S.; Van de Walle, R. Hyperspectral image classification with convolutional neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015. 30. Han, J.; Zhang, D.; Cheng, G.; Guo, L.; Ren, J. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote. Sens. 2015, 53, 3325–3337. [CrossRef] 31. Salberg, A.B. Detection of seals in remote sensing images using features extracted from deep convolutional neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Milan, Italy, 26–31 July 2015. 32. Song, X.; Rui, T.; Zha, Z.; Wang, X.; Fang, H. The AdaBoost algorithm for vehicle detection based on CNN features. In Proceedings of the 7th International Conference on Internet Multimedia Computing and Service, Zhangjiajie, China, 19–21 August 2015. 33. Zhong, Y.; Fei, F.; Liu., Y.; Zhao, B.; Jiao, H.; Zhang, P. SatCNN: Satellite Image Dataset Classification Using Agile Convolutional Neural Networks. Remote Sens. Lett. 2017, 8, 136–145. [CrossRef] 34. Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [CrossRef] 35. Castelluccio, M.; Poggi, G.; Sansone, C.; Verdoliva, L. Land use classification in remote sensing images by convolutional neural networks. arXiv 2015, arXiv:1508.00092. 36. Luus, F.P.S.; Salmon, B.P.; Van Den Bergh, F.; Maharaj, B.T.J. Multiview deep learning for land-use classification. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 2448–2452. [CrossRef] 37. Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote. Sens. 2015, 53, 2175–2184. [CrossRef] 38. Romero, A.; Gatta, C.; Camps-Valls, G. Unsupervised deep feature extraction for remote sensing image classification. IEEE Trans. Geosci. Remote. Sens. 2016, 54, 1349–1362. [CrossRef] 39. Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 2321–2325. [CrossRef] 40. Zhang, F.; Du, B.; Zhang, L. Scene classification via a gradient boosting random convolutional network framework. IEEE Trans. Geosci. Remote. Sens. 2016, 54, 1793–1802. [CrossRef] 41. Zhong, Y.; Fei, F.; Zhang, L. Large patch convolutional neural networks for the scene classification of high spatial resolution imagery. J. Appl. Remote Sens. 2016, 10, 025006. [CrossRef] 42. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. 43. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. 44. Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400. 45. Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012. 46. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. Remote Sens. 2018, 10, 444 23 of 23

47. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. arXiv 2016, arXiv:1603.05027. 48. Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; Lipson, H. Understanding neural networks through deep visualization. arXiv, 2015. 49. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Object detectors emerge in deep scene CNNs. arXiv, 2014. 50. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011. 51. Maas, L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013. 52. Zeiler, M.D.; Ranzato, M.A.; Monga, R.; Mao, M.; Yang, K.; Le, Q.V.; Nguyen, P.; Senior, A.; Vanhoucke, V.; Dean, J.; et al. On rectified linear units for speech processing. In Proceedings of the IEEE International Conference on , Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013. 53. Boureau, Y.L.; Ponce, J.; LeCun, Y. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010. 54. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv, 2012. 55. Cheriyadat, A.M. Unsupervised feature learning for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2014, 52, 439–451. [CrossRef] 56. Zhao, B.; Zhong, Y.; Zhang, L.; Huang, B. The Fisher Kernel coding framework for high spatial resolution scene classification. Remote. Sens. 2016, 8, 157. [CrossRef]

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).