MARTA Gans: Unsupervised Representation Learning for Remote Sensing Image Classiﬁcation Daoyu Lin, Kun Fu, Yang Wang, Guangluan Xu, and Xian Sun

Home , Unsupervised learning

MARTA GANs: Unsupervised Representation Learning for Remote Sensing Image Classiﬁcation Daoyu Lin, Kun Fu, Yang Wang, Guangluan Xu, and Xian Sun

Abstract—With the development of deep learning, supervised learning has frequently been adopted to classify remotely sensed Unlabeled images using convolutional networks (CNNs). However, due to Rm。』 m olum Fake -HH cde 旧 nud0 r3 the limited amount of labeled data available, supervised learning 』

is often difﬁcult to carry out. Therefore, we proposed an unsu- Discriminator (J)-DE pervised model called multiple-layer feature-matching generative 』句〉

E 。℃ adversarial networks (MARTA GANs) to learn a representation Generator cm Feature Match Loss 」 using only unlabeled data. MARTA GANs consists of both a UZωum G D D generative model and a discriminative model . We treat 」 as a feature extractor. To fit the complex properties of remote sensing data, we use a fusion layer to merge the mid-level and Figure 1: Overview of the proposed approach. The discrim- global features. G can produce numerous images that are similar inator (2) learns to make classifications between real and D to the training data; therefore, can learn better representations synthesized images, while the generator (1) learns to fool the of remotely sensed images using the training data provided by G. The classification results on two widely used remote sensing image discriminator. databases show that the proposed method significantly improves the classification performance compared with other state-of-the- proposed, which works by partitioning the image into in- art methods. creasingly fine sub-regions and computing histograms of local Index Terms—Unsupervised representation learning, genera- features found inside each sub-region. The above-mentioned tive adversarial networks, scene classification. methods have comprised the state of the art for several years in the remote sensing community [3], but they are based on hand-crafted features, which are difficult, time-consuming, and I.INTRODUCTION require domain expertise to produce. As satellite imaging techniques improve, an ever-growing Deep learning algorithms can learn high-level semantic number of high-resolution satellite images provided by special features automatically rather than requiring handcrafted fea- satellite sensors have become available. It is urgent to be tures. Some approaches [4], [5] based on convolutional neural able to interpret these massive image repositories in automatic networks (CNNs) [6] have achieved success in remote sensing and accurate ways. In recent decades, scene classification has scene classification, but those methods usually require an become a hot topic and is now a fundamental method for land- enormous amount of labeled training data or are fine-tuned resource management and urban planning applications. Com- from pre-trained CNNs. pared with other images, remote sensing images have several Several unsupervised representation learning algorithms special features. For example, even in the same category, the have been based on the autoencoder [7], [8], which receives objects we are interested in usually have different sizes, colors corrupted data as input and is trained to predict the original, and angles. Moreover, other materials around the target area uncorrupted input. Although training the autoencoder requires arXiv:1612.08879v3 [cs.CV] 21 Nov 2017 cause high intra-class variance and low inter-class variance. only unlabeled data, input reconstruction may not be the Therefore, learning robust and discriminative representations ideal metric for learning a general-purpose representation. The from remotely sensed images is difficult. concept of Generative Adversarial Networks (GANs) [9] is one Previously, the bag of visual words (BoVW) [1] method of the most exciting unsupervised algorithm ideas to appear in was frequently adopted for remote sensing scene classification. recent years; its purpose is to learn a generative distribution BoVW includes the following three steps: feature detection, of data through a two-player minimax game. In subsequent feature description, and codebook generation. To overcome work, a deep convolutional GAN (DCGAN) [10] achieved a the problems of the orderless bag-of-features image represen- high level of performance on image synthesis tasks, showing tation, the spatial pyramid matching (SPM) model [2] was that its latent representation space captures important variation factors. This work was supported in part by the National Natural Science Foundation GANs is a promising unsupervised learning method, yet of China under Grant No.41501485 and No.61331017. (Corresponding author: Kun Fu.) thus far, it has rarely been applied in the remote sensing field. The authors are with the Key Laboratory of Technology in Geo-spatial Due to the tremendous volume of remote sensing images, Information Processing and Application System, Chinese Academy of it would be prohibitively time-consuming and expensive to Sciences, Beijing, 100190, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; label all the data. To tackle this issue, GANs would be the [email protected]). excellent choice because it is an unsupervised learning method 2

3 3 conv 4×4 16 conv 4×4 Features 16 32 conv 4×4 896 × 4 × 4 = 14336 conv 4×4 64 conv 4×4 128 Multi-feature Layer 32 256 conv 4×4 64 512 128 + 256 + 512 = 896

128 Flatten Full 256 256 4 4 4 4 512 128 64 8 16 8 Identity 128 256 32 16 100 2×2 Max Pooling 32 4 4 8 8 16 16 32 64 4×4 Max Pooling 64 project and reshape 32 64 deconv 4×4 256 deconv 4×4 128 128 deconv 4×4 deconv 4×4 256 deconv 4×4 deconv 4×4 (a) generator (b) discriminator Figure 2: Network architectures of a generator and a discriminator: (a) a MARTA GANs generator is used for the UC-Merced Land-use dataset. The input is a 100-dimensional uniform distribution p (z) and the output is a 256 256-pixel RGB image; z × (b) a MARTA GANs discriminator is used for the UC-Merced Land-use dataset. The discriminator is treated as a feature extractor to extract features from the multi-feature layer. in which the required quantities of training data would be samples created by the generator. The discriminator emits a provided by its generator. Therefore, in this paper, we pro- probability value denoted by D(x; θd) with parameters θd, pose a multiple-layer feature-matching generative adversarial indicating the probability that x is a real training example networks (MARTA GANs) model to learn the representation rather than a fake sample drawn from the generator. During of remote sensing images using unlabeled data. the classification task, the discriminative model D is regarded Although based on DCGAN, our approach is rather different as the feature extractor. Then, additional training data so that in the following aspects. 1) DCGAN can, at most, produce the discriminator can learn a better representation is provided images with a 64 64 resolution, while our approach can by the generative model G. × produce remote sensing images with a resolution of 256 256 × A. Training the discriminator by adding two deconvolutional layers in the generator; 2) To avoid the problem of such deconvolutional layers producing When training the discriminator, the weights of the genera- checkerboard artifacts, the kernel sizes of our networks are tor are fixed. The goals of training the discriminator D(x) are 4 4, while those in DCGAN are 5 5; 3) We propose as follows: × × a multi-feature layer to aggregate the mid- and high-level 1) Maximize D(x) for every image from the real training information; 4) We combine both the perceptual loss and examples. feature matching loss to produce more accurate fake images. 2) Minimize D(x) for every image from the fake samples Based on the improvements above, our method can realize drawn from the generator. the better representation of remote sensing images among all Therefore, the objective function of training discriminator methods. Fig. 1 shows the overall model. is to maximize: The contributions of this paper are the following: Ex∼pdata(x) log D(x) + Ez∼pz (z)[log(1 D(G(z)))] . (1) 1) To our knowledge, this is the first time that GANs have − been applied to classify unsupervised remote sensing images. B. Training the generator 2) The results of experiments on the UC-Merced Land- When training the generator, the weights of the discrimi- use and Brazilian Coffee Scenes datasets showed that nator are fixed. The goal of training the generator G(z) is the proposed algorithm outperforms state-of-the-art un- to produce samples that fool D. The output of the generator supervised algorithms in terms of overall classification is an image that can be used as the input for the discrimi- accuracy. nator. Therefore, the generator wants to maximize D(G(z)) 3) We propose a multi-feature layer by combining perceptual (or equivalently, minimize 1 D(G(z))) because D is a − loss and loss of feature matching to learn better image probability estimate that ranges only between 0 and 1. We call representations. this concept perceptual loss; it encourages the reconstructed image to be similar to the samples drawn from the training II.METHOD set by minimizing the perceptual loss. A GAN is most straightforward to apply when the involved `perceptual = Ez∼pz (z)[log(1 D(G(z)))]. (2) models are both multilayer perceptrons; however, to apply − a GAN to remote sensing images, we used CNNs for both In summary, the discriminator D is shown an image pro- the generator and discriminator in this work. The generator duced from the generator G and adjusts its parameters to network directly produces samples x = G(z; θg) with pa- make its output, D(G(Z)), larger. But G(Z) will train itself to rameters θg and z, where z obeys a prior noise distribution produce images that fool D into thinking they are real. It does pz(z). Its adversary, the discriminator network, attempts to this by getting the gradient of D with respect to each sample distinguish between samples drawn from the training data and it produces. In other words, the G is trying to minimize the JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

3 Figure 3: Exemplary images produced by a generator trained on UC-Merced using the `fianl (Eqn. 4) objective. Generator JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 sampled with unprecedented resolution (256 256) Remote Sensing Images. ×

(a) real images (b) fake images Figure 4: Exemplary images produced by a generator trained onFigure UC-Merced 3: Exemplary using the images`fianl produced(Eqn. 4) by objective. a generator Generator trained on UC-Merced using the `fianl (Eqn. 4) objective. Generator sampled with unprecedented resolutionFigure 3: (256 Part of256 exemplary) Remote Sensingimages.sampled Images.(a) Ten with random unprecedented images resolutionfrom UC-Merced (256 256 data) Remote set. (b) SensingExemplary Images. images produced by generator trained× on UC-Merced using the `final (Eqn. 5) objective. × even human can not distinguishoutput between while them.D is trying to maximizeB. Brazilian it; consequently, Coffee Scenes it is a We set the kernel sizes to 4 4 and the stride to 2 in × In addition, we visualize theminimax image global game representations that is defined as follows:This dataset [1] is a compositionall the of scenes convolutional taken by and SPOT deconvolutional layers, because the encoded via MARTA GANs features for the UC-Merced sensor in 2005 over four countiesdeconvolutional in the State layers of Minas can avoid uneven overlap when the min max V (D,G) =Ex∼pdata(x) log D(x)+ dataset. Here, we compute featuresG for allD image scenes of Gerais, Brazil: Arceburgo, Guarankernelesia,´ size Guaxup is divisiblee´ and Monteby the stride [11]. In the generator, all the dataset, and then use the t-SNE algorithm [15] to embedEz∼pz (zSanto.)[log(1 TheD whole(G(z)))] image. set(3) oflayers each county use ReLU was activation partitioned except for the output layer, which − the high-dimensional image features in 2-D space. We show into multiple tiles of 64 64 pixels.uses the For tanh this function. dataset, it We was use LeakyReLU activation in the × these 2-D embedding points withTo different make colors the images and shapes generatedconsidered by generator only morethe green, similar red, anddiscriminator near-infrared for bands, all the which convolutional layers; the slope of the corresponding to their actual sceneto the categories. real images, The final we visu- train theare the generator most useful to match and representative the leak was ones set for to discriminating 0.2. We used batch normalization in both the expected values of the features in the multi-feature layer of alization results are shown in Fig. 7. This observation show vegetationFigure 4: Exemplaryareas. The creationimages produced ofgenerator the dataset by and a isgenerator the performed discriminator, trained as on and UC-Merced the decay factor using was the ` 0.9. (Eqn. 4) objective. Generator the discriminator. Letting f(x) denote activations on the multi- fianl that features extracted from matri-feature layers are high-level follows:sampled tiles with with unprecedented at least 85% resolution of coffee ( pixels256 were256) assignedRemote Sensing Images. feature layer of the discriminator, the loss of feature matching × features that contain abstract semantic information, because to the coffee class; tiles with less than 10% of coffeeIII.E pixelsXPERIMENTS those close classes are still veryfor close the in generator 2-D space. is defined as follows:were assigned to the non-coffee class, it has 1438 tiles of To verify the effectiveness of the proposed method, we In recent years, some approaches have been developed for coffeeeven human and 1438 can tiles not of distinguish uncoffee.2 between Fig. 8 shows them. some examples B. Brazilian Coffee Scenes `feature_match = Ex∼pdata(x)f(x) Ez∼pz (z)f(G(z)) 2 . trained MARTA GANs on two datasets: the UC Merced Land remote sensing scene classification and most of|| them have produced− by a generator|| trained on this dataset. In addition, we visualize(4) theUse image dataset global [12] representations and the BrazilianThis Coffee dataset Scenes [1] dataset is a composition [4]. of scenes taken by SPOT been verified on the UC-Merced dataset. Therefore, a lot encoded via MARTA GANs features for the UC-Merced Therefore, our final object (theTable combination III: Classification of Eqn. 2 accuraryand We (%) carried of representation out experiments ex- onsensor both datasets in 2005 using over a four5-fold counties in the State of Minas of data available can be compared with our results in this dataset. Here, we compute features for all image scenes of ´ ´ Eqn. 4) for training the generatortracted is tominimize by DCGANs Eqn. and 5 MARTA. cross-validation GANS on the Coffee protocol Scenes and a regularizedGerais, Brazil: linear Arceburgo, L2-SVM as Guaranesia, Guaxupe and Monte paper. In table II we report the overall accuracies for all these the dataset, and then use the t-SNE algorithm [15] to embed dataset. Best result in bold. a classifier. We implemented MARTASanto. GANs The inwhole TensorLayer image set1, of each county was partitioned comparable methods, as they appear in the`final original= `perceptual papers, + `thefeature high-dimensional_matching. image(5) features in 2-D space. We show 64 64 a deep learning and reinforcementinto learning multiple library tiles of extended pixels. For this dataset, it was together with the accuracy of our best MARTA GANs method. these 2-D embedding points with different colors and shapes × Architectures Designfrom Google TensorFlow Accuracy [13]. Weconsidered scaled the only input thegreen, image red, to and near-infrared bands, which The proposed method guarantees a better performance gain corresponding to their actual scene categories. The final visu- C. Network architectures DCGANs Without data augmentationthe range of [-1, 1] before 85.36 training.are All the the most models useful were and trained representative ones for discriminating w.r.t. to all references. alization resultsWith are datashown augmentation in Fig. 7. This observation 85.01 show by SGD with a batch size of 64,vegetation and we areas. used The the creation Adam of the dataset is performed as The details of the generator andthatMARTA discriminator features GANs extracted Without in MARTA from data augmentation matri-feature layers are high-level 87.69 follows: tiles with at least 85% of coffee pixels were assigned With data augmentation,optimizer` withonly a learning 87.73 rate of 0.0002 and a momentum Table II: Classification accuracyGANs (%) of referenceare as follows: and proposed features that contain abstract semanticperceptual information, because to the coffee class; tiles with less than 10% of coffee pixels With data augmentation,term useβ1 òffinal 0.5. 88.36 methods on the UC-Merced dataset.The Best generator result in takes bold. 100 randomthose numbers close classes drawn are from still a very close in 2-D space. were assigned to the non-coffee class, it has 1438 tiles of uniform distribution as input. Then, the result is reshaped TableIn recent III years,shows some the results approaches obtained have with beenthe developed proposed for coffee and 1438 tiles of uncoffee. Fig. 8 shows some examples Methodinto a four-dimensional Accuracy tensor. We used six deconvolutional A. UC Merced dataset techniques.remote sensing The most scene notable classification difference and w.r.t. most the of UC-Merced them have produced by a generator trained on this dataset. layers in our generator to learn its own spatial upsampling and HMFF [16] 92.38 casebeen is verified the data on augmentation the UC-Merced doesn’tThis dataset. improve dataset Therefore, consiststhe accuracy a of lot images of 21 land-use classes VLAT [17]upsample 94.30 the 4 4 feature maps to 256 256 remote sensing Table III: Classification accurary (%) of representation ex- UFL-SC [18] 90.26 × conspicuously.of data× available In can fact, be it’s compared just(100 a 2-class256 with our256 problem, results-pixel and inimages this it for each class). Some of the images. Fig. 2a shows a visualization of the generator. × tracted by DCGANs and MARTA GANS on the Coffee Scenes PSR [19] 89.10 haspaper. enough In table data II we to report train the overallimages network. accuracies from In general, this for dataset all results these are shown in Fig. 3a. We used MCMI-based [20]For the 88.20 discriminator, the first layer takes input images, dataset. Best result in bold. arecomparable significantly methods, worse as than they with appeara moderate UC-Merced, in the data original despite augmentation papers, the in this dataset via flipping MARTA GANs(proposed)including 95.00 both real and synthesized images. We use convo- 2-classtogether vs. with 21-class the accuracy problem. of our Thisimages best is MARTA horizontally a rather GANs challenging and method. verticallyArchitectures and rotating themDesign by 90 Accuracy lutions in our discriminator which allows it to learn its own dataset,The proposed due to a method large intra-class guaranteesdegrees variability a better to caused increase performance by differentthe effective gain training set size. Training spatial downsampling. As shown in Fig. 2b, by performing DCGANs Without data augmentation 85.36 w.r.t. to all references. takes approximately 4 hours on a single NVIDIAWith GTX data 1080 augmentation 85.01 4 4 max pooling, 2 2 max pooling and the identity × × GPU. MARTA GANs Without data augmentation 87.69 function separately in the last three convolutional layers, we With data augmentation, `perceptual only 87.73 Table II: Classification accuracy (%)To of evaluate reference the and quality proposed of the representations learned by the can produce feature maps that have the same spatial size, 4 4. With data augmentation, use `final 88.36 methods on the UC-Merced× dataset.multi-feature Best result layer, in bold. we trained on the UC-Merced data and Then, we concatenate the 4 4 feature maps through channel × extracted the features from differentTable multi-feature III shows layers. the results To obtained with the proposed dimension in the multi-feature layer. Finally, theMethod multi-feature improve Accuracy the clarity of the expression, we use f1 to denote layer is flattened and fed into a single sigmoid output. The techniques. The most notable difference w.r.t. the UC-Merced HMFF [16]the features 92.38 from the last convolutionalcase is the layer, dataf2 augmentationto denote doesn’t improve the accuracy multi-feature layer includes two functions: 1) theVLAT features [17] used features combined 94.30 from the last two convolutional layers’ for classification are extracted from the flattedUFL-SC multi-feature [18] 90.26 conspicuously. In fact, it’s just a 2-class problem, and it PSR [19]features, 89.10and so on. Based on the results shown in Fig. 4, layer; 2) when training the generator, we use feature matching has enough data to train the network. In general, results MCMI-based [20]we found 88.20 that f3 achieved the highestare significantly accuracy. These worse results than with UC-Merced, despite the loss (Eqn. 4) to evaluate the similarities of the featuresMARTA between GANs(proposed) 95.00 2-class vs. 21-class problem. This is a rather challenging the fake and real images in the flatted multi-feature layer. 1http://tensorlayer.readthedocs.io/en/latest/ dataset, due to a large intra-class variability caused by different 4

Table I: Classiﬁcation accuracy (%) in the form of the means standard deviation bars of DCGAN and MARTA GAN for ± every class.. The class labels are as follows: 1 = Mobile home park, 2 = Beach, 3 = Tennis courts, 4 = Airplane, 5 = Dense residential, 6 = Harbor, 7 = Buildings, 8= Forest, 9 = Intersection, 10 = River, 11 = Sparse residential, 12 = Runway, 13 = Parking lot, 14 = Baseball diamond, 15 = Agricultural, 16 = Storage tanks, 17 = Chaparral, 18 = Golf course, 19 = Freeway, 20 = Medium residential, and 21 = Overpass. Class 1 2 3 4 5 6 7 8 9 10 11 DCGAN 85 ± 5.0 94 ± 2.2 89 ± 4.2 95 ± 3.5 82 ± 2.7 91 ± 2.2 78 ± 2.7 83 ± 2.7 88 ± 2.7 90 ± 0.0 79 ± 2.2 MARTA GAN 95 ± 3.5 100 ± 0.0 96 ± 4.2 100 ± 0.0 89 ± 4.2 99 ± 2.2 86 ± 6.5 97 ± 2.7 98 ± 2.7 94 ± 2.2 89 ± 2.2 Class 12 13 14 15 16 17 18 19 20 21 DCGAN 89 ± 4.2 88 ± 2.7 95 ± 3.5 78 ± 4.5 93 ± 2.7 88 ± 2.7 97 ± 2.7 77 ± 2.7 95 ± 5.0 89 ± 4.2 MARTA GAN 94 ± 4.2 98 ± 2.7 100 ± 0.0 85 ± 5.0 100 ± 0.0 93 ± 2.7 100 ± 0.0 87 ± 5.7 97 ± 5.5 95 ± 5.0

1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 Average accuracy Average accuracy Average accuracy Average accuracy 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epoch Epoch Epoch Epoch (a) (b) (c) (d) Figure 4: The performance comparison uses different features. (a) f1; (b) f2; (c) f3; (d) f4. The red curves: training with `final and with data augmentation; Cyan curves: training with (a) (b) `perceptual and with data augmentation; Yellow curves: training with `final and without data augmentation; Green curves: Figure 5: Confusion matrix of (a):DCGAN, (b):MARTA GAN. training with `perceptual and without data augmentation. The class labels are same as Table I.

can be explained by two reasons. First, f3 has the same high-level information as f1 and f2, but it has more mid- level information compared with f1 and f2. However, f4 has too much low-level information, which leads to the "curse of dimensionality." Therefore, the features extracted from the last three convolutional layers in the discriminator resulted in the highest accuracy. As shown in Fig. 4, data augmentation is an effective way to reduce overfitting when training a large deep network. Augmentation generates more training image samples by rotating and flipping patches from original images. We also evaluated the performance between two types of loss: `perceptual (Eqn. 2) and `final (Eqn. 5) and found that using `final achieved the best performance. Synthesized remote sensing images when using `final are shown in Fig. 3b. Fig. 5 depicts the confusion matrix of classification results for the two GAN architectures, DCGAN and MARTA GAN. Figure 6: 2-D feature visualization of image global represen- DCGAN and MARTA GAN reached an overall accuracy of tations of the UC-Merced dataset. The class labels are same 87.76 0.64% and 94.86 0.80%, respectively. MARTA GAN as Table I. ± ± is approximately 7% better because it used the multi-feature layer to merge the mid-level and global features. To improve because those close classes are also very close in 2-D space. the comparison, the accuracy classification performances of Compared with the results of other tested methods, the the methods for each class are shown in Table I. Compared method proposed in this work achieves the highest classifi- to DCGAN, MARTA GAN achieves 100.00% accuracy in cation accuracy among the unsupervised methods. As shown some scene categories (e.g., Beach, Airplane, etc.). Moreover, in Table II, our method outperforms the SCMF [14] (a sparse MARTA GAN also achieves higher accuracy in some very coding based multiple-feature fusion method) by 3.82%. When close classes, such as dense residential, building, medium the classification accuracy of our method is compared with residential, sparse residential. LRFF [15] (an improved unsupervised feature learning algo- In addition, we visualized the global image representa- rithm based on spectral clustering), our method outperforms tions encoded via MARTA GANs features of the UC-Merced LRFF by more than 4%. While some of the supervised dataset. We computed the features for all the scenes of the methods [4], [5] achieved an accuracy above 99%, these dataset and then used the t-SNE algorithm to embed the high- methods are fine-tuned from pre-trained models, which are dimensional features in 2-D space. The final results are shown usually trained with a large amount of labeled data (such as in Fig. 6. This visualization shows that features extracted from ImageNet). Compared with those methods, our unsupervised the multi-feature layer contain abstract semantic information method requires fewer parameters. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

5 Figure 7: Exemplary images produced by a generator trained on UC-Merced using the `fianl (Eqn. 4) objective. Generator JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 sampled with unprecedented resolution (256 256) Remote Sensing Images. ×

(a) real images (b) fake images Figure 8: Exemplary images produced by a generator trained onFigure UC-Merced 7: Exemplary using the images`fianl produced(Eqn. 4) by objective. a generator Generator trained on UC-Merced using the `fianl (Eqn. 4) objective. Generator sampled with unprecedented resolutionFigure 7: (256 Parts256 of exemplary) Remote Sensing images:sampled Images. (a) tenwith random unprecedented images from resolution the Brazilian (256 256 Coffee) Remote Scenes Sensing dataset; Images. (b) exemplary images produced by× a generator trained on the Brazilian Coffee Scenes dataset using× the `final (Eqn. 5) objective.

[13] Martın Abadi, Ashish Agarwal, PaulTable Barham, II: Overall Eugene Brevdo, classification Zhifeng accuracy (%) of reference and REFERENCES Chen, Craig Citro, Greg S Corrado,proposed Andy Davis, methods Jeffrey Dean, on Matthieu the UC-Merced dataset and Coffee [1] J. Sivic, A. Zisserman et al., “Video google: A text retrieval approach Devin, et al., “Tensorflow: Large-scaleScenes machine dataset. learning Our on result heteroge- is in bold. neous distributed systems,” arXiv preprint arXiv:1603.04467, 2016. to object matching in videos.” [14] Yi Yang and Shawn Newsam, “Bag-of-visual-wordsDataSet Method and spatial exten- Description Parameters Accuracy [2] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: sions for land-use classification,” in ProceedingsSCMF of the [14] 18th SIGSPA- Unsupervised - 91.03±0.48 Spatial pyramid matching for recognizing natural scene categories,” in TIAL international conference on advances in geographicUFL-SC [15] information Unsupervised - 90.26±1.51 Computer vision and pattern recognition, 2006 IEEE computer society systems. ACM, 2010, pp. 270–279. UC-Merced OverFeatL + Caffe [4] Supervised 205M 99.43±0.27 conference on, vol. 2. IEEE, 2006, pp. 2169–2178. [15] Laurens van der Maaten and Geoffrey Hinton, “VisualizingGoogLeNet data[5] using Supervised 5M 99.47±0.50 [3] M. Lienou, H. Maitre, and M. Datcu, “Semantic annotation of satellite t-sne,” Journal of Machine Learning Research,MARTA vol. 9, no. GANs Nov, pp. Unsupervised 2.8M 94.86±0.80 images using latent dirichlet allocation,” IEEE Geoscience and Remote 2579–2605, 2008. BIC [4] Unsupervised - 87.03±1.07 Sensing Letters, vol. 7, no. 1, pp. 28–32, 2010. OverFeat +OverFeat [4]Supervised 289M 83.04±2.00 Coffee L S [4] O. A. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep features [16] Wen Shao, Wen Yang, Gui Song Xia, and Gang Liu, “A hierarchical Figure 8: Exemplary94.45 images±1.20 produced by a generator trained on UC-Merced using the `fianl (Eqn. 4) objective. Generator CaffeNet [5] Supervised 60M generalize from everyday objects to remote sensing and aerial scenes scheme of multiple feature fusion for high-resolutionMARTA satellite GANs scene Unsupervisedsampled with 0.18M unprecedented89.86±0.98 resolution (256 256) Remote Sensing Images. categorization,” Lecture Notes in Computer Science (including subseries domains?”× in Proceedings of the IEEE Conference on Computer Vision Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinfor- and Pattern Recognition Workshops, 2015, pp. 44–51. matics), vol. 7963 LNCS, pp. 324–333, 2013. [5] K. Nogueira, O. A. Penatti, and J. A. dos Santos, “Towards better B. Brazilian Coffee Scenes dataset exploiting convolutional neural networks for remote sensing scene [17] Romain Negrel, David Picard, and Philippe Henri Gosselin, “Evaluation [13] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng classification,” Pattern Recognition, vol. 61, pp. 539–556, 2017. of second-order visual features for land-useTo evaluate classification,” theProceedings generalization powerChen, Craigof our Citro, model, Greg S Corrado, we Andy Davis, Jeffrey Dean, Matthieu [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification - International Workshop on Content-Based Multimedia Indexing, 2014. Devin, et al., “Tensorflow: Large-scale machine learning on heteroge- also performed experiments using the Brazilian Coffee Scenes with deep convolutional neural networks,” in Advances in neural infor- [18] F Hu, G Xia, Z Wang, X Huang, L Zhang, and H Sun, “Unsupervised neous distributed systems,” arXiv preprint arXiv:1603.04467, 2016. mation processing systems, 2012, pp. 1097–1105. Feature Learning Via Spectral Clusteringdataset of [4], Multidimensional which is a Patches composition[14] Yi of Yang scenes and Shawn taken Newsam, by the “Bag-of-visual-words and spatial exten- [7] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, for Remotely Sensed Scene Classification,”SPOT sensorSelected in Topics the in green, Applied red, andsions near-infrared for land-use classification,” bands. in Proceedings of the 18th SIGSPA- “Stacked denoising autoencoders: Learning useful representations in a Earth Observations and Remote Sensing, IEEE Journal of, vol. 8, no. TIAL international conference on advances in geographic information This dataset has 2,876 multispectral high-resolution scenes. It deep network with a local denoising criterion,” Journal of Machine 5, pp. 2015–2030, 2015. systems. ACM, 2010, pp. 270–279. Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010. [19] Shizhi Chen and YingLi Tian, “Pyramidincludes of 1,438spatial relatonstiles of for coffee scene- and[15] 1,438 Laurens tiles of van non-coffee der Maaten with and Geoffrey Hinton, “Visualizing data using [8] A. Makhzani and B. Frey, “K-sparse autoencoders,” arXiv preprint level land use classification,” IEEEa 64 Transactions64-pixel on resolution.Geoscience and Fig. 7a showst-sne,” someJournal examples of Machine of Learning Research, vol. 9, no. Nov, pp. arXiv:1312.5663, 2013. Remote Sensing, vol. 53, no. 4, pp. 1947–1957,× apr 2015. 2579–2605, 2008. this dataset. We did not use data augmentation on this dataset [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [20] Jianfeng Ren, Xudong Jiang, and Junsong Yuan, “Learning LBP [16] Wen Shao, Wen Yang, Gui Song Xia, and Gang Liu, “A hierarchical S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in structure by maximizing the conditionalbecause mutual it contains information,” sufficientPattern data toscheme train the ofnetwork. multiple feature fusion for high-resolution satellite scene Advances in neural information processing systems, 2014, pp. 2672– Recognition, vol. 48, no. 10, pp. 3180–3190, 2015. categorization,” Lecture Notes in Computer Science (including subseries Table II shows the results obtained with the proposed 2680. Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinfor- method. In general, the results are significantly worse than [10] A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation matics), vol. 7963 LNCS, pp. 324–333, 2013. Learning with Deep Convolutional Generative Adversarial Networks,” those on the UC-Merced dataset,[17] despite Romain reducing Negrel, theDavid classifi- Picard, and Philippe Henri Gosselin, “Evaluation arXiv, pp. 1–15, 2015. [Online]. Available: http://arxiv.org/abs/1511. of second-order visual features for land-use classification,” Proceedings cation from a 21-class to a 2-class problem. Brazilian Coffee 06434 - International Workshop on Content-Based Multimedia Indexing, 2014. Scenes is a challenging dataset because of the high intra-class [11] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard [18] F Hu, G Xia, Z Wang, X Huang, L Zhang, and H Sun, “Unsupervised artifacts,” Distill, 2016. [Online]. Available: http://distill.pub/2016/ variability caused by different crop managementFeature Learning techniques, Via Spectral Clustering of Multidimensional Patches deconv-checkerboard for Remotely Sensed Scene Classification,” Selected Topics in Applied different plant ages and spectral distortions and shadows. [12] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions Earth Observations and Remote Sensing, IEEE Journal of, vol. 8, no. Nevertheless, our results are better than that of BIC [4]. for land-use classification,” in Proceedings of the 18th SIGSPATIAL in- 5, pp. 2015–2030, 2015. ternational conference on advances in geographic information systems. [19] Shizhi Chen and YingLi Tian, “Pyramid of spatial relatons for scene- ACM, 2010, pp. 270–279. level land use classification,” IEEE Transactions on Geoscience and IV. CONCLUSION [13] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Remote Sensing, vol. 53, no. 4, pp. 1947–1957, apr 2015. This paper introduced a representation learning algorithm Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale [20] Jianfeng Ren, Xudong Jiang, and Junsong Yuan, “Learning LBP machine learning on heterogeneous distributed systems,” arXiv preprint called MARTA GANs. In contraststructure to previous by maximizing approaches the conditional mutual information,” Pattern arXiv:1603.04467, 2016. Recognition, vol. 48, no. 10, pp. 3180–3190, 2015. that require supervision, MARTA GANs is completely unsu- [14] G. Sheng, W. Yang, T. Xu, and H. Sun, “High-resolution satellite scene pervised; it can learn interpretable representations even from classification using a sparse coding based multiple feature combination,” challenging remote sensing datasets. In addition, MARTA International journal of remote sensing, vol. 33, no. 8, pp. 2395–2412, 2012. GANs introduces a new multiple-feature-matching layer that [15] F. Hu, G.-S. Xia, Z. Wang, X. Huang, L. Zhang, and H. Sun, “Unsuper- learns multi-scale spatial information for high-resolution re- vised feature learning via spectral clustering of multidimensional patches mote sensing. Other possible future extensions to the work for remotely sensed scene classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 5, described in this paper include: producing high-quality sam- 2015. ples of remote sensing images using the generator and classi- fying remote sensing images in a semi-supervised manner to improve classification accuracy.