Evaluation and Comparison of Random Forest and A-LSTM Networks for Large-Scale Winter Wheat Identification

remote sensing

Article Evaluation and Comparison of Random Forest and A-LSTM Networks for Large-scale Winter Wheat Identiﬁcation

Tianle He 1,2 , Chuanjie Xie 1,*, Qingsheng Liu 1, Shiying Guan 1,3 and Gaohuan Liu 1

1 State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 Henan Polytechnic University, Jiaozuo 454000, China * Correspondence: [email protected]; Tel.: +86-136-8149-8766

Received: 29 May 2019; Accepted: 7 July 2019; Published: 12 July 2019

Abstract: Machine learning comprises a group of powerful state-of-the-art techniques for land cover classification and cropland identification. In this paper, we proposed and evaluated two models based on random forest (RF) and attention-based long short-term memory (A-LSTM) networks that can learn directly from the raw surface reflectance of remote sensing (RS) images for large-scale winter wheat identification in Huanghuaihai Region (North-Central China). We used a time series of Moderate Resolution Imaging Spectroradiometer (MODIS) images over one growing season and the corresponding winter wheat distribution map for the experiments. Each training sample was derived from the raw surface reflectance of MODIS time-series images. Both models achieved state-of-the-art performance in identifying winter wheat, and the F1 scores of RF and A-LSTM were 0.72 and 0.71, respectively. We also analyzed the impact of the pixel-mixing effect. Training with pure-mixed-pixel samples (the training set consists of pure and mixed cells and thus retains the original distribution of data) was more precise than training with only pure-pixel samples (the entire pixel area belongs to one class). We also analyzed the variable importance along the temporal series, and the data acquired in March or April contributed more than the data acquired at other times. Both models could predict winter wheat coverage in past years or in other regions with similar winter wheat growing seasons. The experiments in this paper showed the effectiveness and significance of our methods.

Keywords: winter wheat identiﬁcation; random forest; A-LSTM; pixel-mixing eﬀect; variable importance analysis

1. Introduction For many years, remote sensing (RS) systems have been widely applied for agricultural monitoring and crop identification [1–4], and these systems provide many surface reflectance images that can be utilized to derive hidden patterns of vegetation coverage. In crop identification tasks, the information of growing dynamics or sequential relationships derived from time-series images is used to perform classification. Although high-spatial-resolution datasets such as Landsat have clear advantages for capturing the fine spatial details of the land surface, such datasets typically do not have high temporal coverage frequency over large regions and are often badly affected by extensive cloud cover. However, coarse-resolution sensors such as the Moderate Resolution Imaging Spectroradiometer (MODIS) provide data at a near-daily observational coverage frequency and over large areas [5]. While MODIS data are not a proper option for resolving smaller field sizes, they do provide a valuable balance between high temporal frequency and high spatial resolution [6].

Remote Sens. 2019, 11, 1665; doi:10.3390/rs11141665 www.mdpi.com/journal/remotesensing Remote Sens. 2019, 11, 1665 2 of 21

Many methods for crop identification or vegetation classification using MODIS time series have been examined and implemented in the academic world [5–7]. A common approach for treating multitemporal data is to retrieve temporal features or phonological metrics from vegetation index series obtained by band calculations. According to phenology and simple statistics, several key phenology metrics, such as the base level, maximum level, amplitude, start date of the season, end date of the season, and length of the season, extracted from time-series RS images are used as classification features that are sufficient for accurate crop identification. For example, Cornelius Senf et al. [5] mapped rubber plantations and natural forests in Xishuangbanna (Southwest China) using multispectral phenological metrics from MODIS time series, which achieved an overall accuracy of 0.735. They showed that the key phenological metrics discriminating rubber plantations and natural forests were the timing of seasonal events in the shortwaved infrared (SWIR) reflectance time series and the Enhanced Vegetation Index (EVI) or SWIR reflectance during the dry season. Pittman et al. [6] estimated the global cropland extent and used the normalized difference vegetation index (NDVI) and thermal data to depict cropland phenology over the study period. Subpixel training datasets were used to generate a set of global classification tree models using a bagging methodology, resulting in a global per-pixel cropland probability layer. Tuanmu et al. [7] used phenology metrics generated from MODIS time series to characterize the phenological features of forests with understory bamboo. Using maximum entropy modeling together with these phenology metrics, they successfully mapped the spatial distribution of understory bamboo. To address image noise such as pseudo-lows and pseudo-hikes caused by shadows, clouds, weather or sensors, many studies have used mathematical functions or complex models to smooth the vegetation index time series before feature extraction. Toshihiro Sakamoto et al. [8] adopted wavelet and Fourier transforms for filtering time-series EVI data. Zhang et al. [9] used a series of piecewise logistic functions fit to remotely sensed vegetation index data to represent intra-annual vegetation dynamics. Furthermore, weighted linear regression [10], asymmetric Gaussian smoothing [11,12], Whittaker smoothing [13], and Savitzky-Golay filtering [14] have also been widely used for the same reason. Yang Shao et al. [15] compared the Savitzky-Golay, asymmetric Gaussian, double-logistic, Whittaker, and discrete Fourier transformation smoothing algorithms (noise reduction) and applied them to MODIS NDVI time-series data to provide continuous phenology data for land cover classifications across the Laurentian Great Lakes Basin, proving that the application of a smoothing algorithm significantly reduced image noise compared to the raw data. Although temporal feature extraction-based approaches have exhibited good performance in crop identification tasks, they have some weaknesses. First, general temporal features or phenological characteristics may not be appropriate for the specific task. Expert experience and domain knowledge are highly needed to design proper features and a feature extraction pipeline. Second, the features extracted from the time series cannot always fully utilize all the data, and information loss is inevitable. These types of feature extraction processes usually come with limitations in terms of automation and flexibility when considering large-scale classification tasks [16]. Intelligent algorithms such as random forest (RF) and deep neural networks (DNNs) can learn directly from the original values of MODIS data. They can apply all the values of a time series as input and do not need well-designed feature extractors, which could prevent the information loss that often occurs in temporal feature extraction. These algorithms are convenient for large-scale implementation and application, and there are many processing frameworks based on RF or DNNs that are being implemented for cropland identification. Related works will be introduced in the next paragraphs. Recently, RF classifiers have been widely used for RS images due to their explicit and explainable decision-making process, and these classifiers are easily implemented in a parallel structure for computing acceleration. Rodriguez-Galiano et al. [17] explored the performance of an RF classifier in classifying 14 different land categories in the south of Spain. Results show that the RF algorithm yields accurate land cover classifications and is robust to the reduction of the training set size and noise compared to the classification trees. Charlotte Pelletier et al. [18] assessed the robustness of using RF to map land cover and compared the algorithm with a support vector machine (SVM) algorithm. Remote Sens. 2019, 11, 1665 3 of 21

RF achieved an overall accuracy of 0.833, while SVM achieved an accuracy of 0.771. Works based on RF usually use a set of decision trees trained by different subsets of samples to make predictions collaboratively [19]. However, the splitting nodes still used well-designed features with random selection, and this procedure might be complex and inefficient for the classification of large-scale areas. In this paper, we proposed an RF-based model that directly learns from the original values of the MODIS time series for large-scale crop identification in the Huanghuaihai Region to address the task in an efficient manner. Considering the successful applications of deep learning (DL) in computer vision, deep models have also been evaluated for time-series image classification [16,20,21]. Researchers usually use pretrained baseline architectures of convolutional neural networks (CNNs), such as AlexNet [22], GoogLeNet [23] and ResNet [24], and fine-tuning to automatically obtain advanced representations of data, which are usually followed by a softmax layer or SVM to adapt to specific RS classification tasks. For times series scenarios, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks are often used to analyze RS images due to their ability to capture long-term dependencies. Marc Rußwurm et al. [20] employed LSTM networks to extract temporal characteristics from a sequence of SENTINEL 2A observations and compared the performance with SVM baseline architectures. Zhong et al. [16] designed two types of DNN models for multitemporal crop classification: one was based on LSTM networks, and the other was based on one-dimensional convolutional (Conv1D) layers. Three widely used classifiers were also tested and compared, including gradient boosting machine (XGBoost), RF, and SVM classifiers. Although LSTM is widely used for sequential data representation, Zhong et al. [16] revealed that its accuracy was the lowest among all the classifiers. Considering that the identification of crop types is highly dependent on a few temporal images of key growth stages such as the green-returning and jointing stages of winter wheat, it is important for the model to have the ability to pay attention to the critical images of the times series. In early studies, attention-based LSTM models were used to address sequence-to-sequence language translation tasks [25], which could generate a proper word each time according to the specific input word and the context. Inspired by the machine translation community and crop growth cycle intuition, we proposed an attention-based LSTM model (A-LSTM) to identify winter wheat areas. The LSTM part of the model transforms original values to advanced representations and then follows an attention layer that encodes the sequence to one fixed-length vector that is used to decode the output at each timestep. A final softmax layer is then used to make a prediction. In this study, we proposed two models, RF and A-LSTM, that can be efficiently used for large-scale winter wheat identification throughout the Huanghuaihai Region, by building an automatic data preprocessing pipeline that transforms time-series MODIS tiles into training samples that can be directly fed into the models. As this study is the first to apply an attention mechanism-based LSTM model to the classification of time-series images, a comparison with RF and an evaluation of the performance were also conducted. In addition, we analyzed the impacts of the pixel-mixing effect with two different training strategies in this paper. Furthermore, with the intuition that there is some difference in wheat sowing and harvesting time from north to south, we also evaluated the generalizability of the models to different areas. Finally, our models were used to identify the distribution of winter wheat over the past year, and we evaluated the performance via visual interpretation. Finally, we discussed the advantages and disadvantages of our models.

2. Materials

2.1. Study Area The Huanghuaihai Region, located in the north-central region of China, surrounds the capital city of Beijing, which is shown in Figure1. It consists of seven provinces or municipality cities (i.e., Beijing, Tianjin, Hebei, Shandong, Henan, Anhui, and Jiangsu) stretching over an area of 778.9 thousand square kilometers. Most of the Huanghuaihai Region lies within the North China Plain, which is formed by Remote Sens. 2019, 11, 1665 4 of 21 deposits from the Yellow River, Huai River, Hai River and their hundreds of branches. This region is Remote Sens. 2019, 11, x FOR PEER REVIEW 4 of 22 bordered to the north by the Yanshan Mountains, to the west by the Taihang Mountains, to the south byMountains, the Dabie to Mountains the south and by the the Dabie Yangtze Mountains River, and and to thethe east Yangtze by the River, East Chinaand to Sea the [ 26east]. Thisby the region East hasChina a typical Sea [26]. continental This region temperate has a andtypical monsoonal continental climate temperate with four and distinct monsoonal seasons, climate including with coldfour anddistinct dry wintersseasons, and including hot and cold humid and summers. dry wint Theers and Huanghuaihai hot and humid Region summers. is one of theThe most Huanghuaihai important agriculturalRegion is one granaries of the most in China,important and agricultural the chief cereal granaries crops in include China, and winter the chief wheat, cereal corn, crops millet, include and potatoes.winter wheat, Winter corn, wheat millet, is usually and potatoes. planted Winter from Septemberwheat is usually to October planted and from harvested September in late to October May or Juneand harvested of the following in late year.May Winteror June wheat of the is following highly dependent year. Winter on thewheat water is highly conditions, dependent and artiﬁcial on the irrigationwater conditions, is supplied and in artificial this area irrigation [27]. is supplied in this area [27].

Figure 1. Location of the study area. Figure 1. Location of the study area. 2.2. Materials Description 2.2. Materials Description 2.2.1. MODIS Data 2.2.1. MODIS Data In this section, we described the time-series images and reference data used for winter wheat identificationIn this section, in our paper.we described We downloaded the time-series 110 MODIS images product and reference images, specifically data used thefor MODISwinter /wheatTerra vegetationidentification indices in 16-Dayour paper. L3 Global We 250-mdownloaded SIN Grid 110 (MOD13Q1, MODIS product Collection images, 6), from specifically NASA Level-1 the andMODIS/Terra Atmosphere vegetation Archive &indices Distribution 16-Day System L3 Global Distributed 250-meter Active SIN Archive Grid (MOD13Q1, Center (LAADS Collection DAAC). 6), Thefrom product NASA hasLevel-1 four reflectanceand Atmosphere bands, Archive i.e., blue, & red, Distribution near-infrared System and Distributed middle-infrared, Active which Archive are centeredCenter (LAADS at 469 nm, DAAC). 645 nm, The 858 product nm, and has 3.5 fourµm, respectively,reflectance bands, and two i.e., vegetation blue, red, index near-infrared bands, NDVI and andmiddle-infrared, EVI, which can which be used are centered to maintain at 469 sensitivity nm, 645 overnm, dense858 nm, vegetation and 3.5 µm, conditions respectively, [28]. Theand twotwo vegetationvegetation indices,index bands, NDVI andNDVI EVI, and are computedEVI, which from can atmospherically be used to maintain corrected sensitivity bidirectional over surface dense reflectancevegetation dataconditions that are [28]. masked The for two water, vegetation clouds, heavyindices, aerosols, NDVI andand cloud EVI,shadows. are computed Specifically, from theatmospherically NDVI is computed corrected from bidirectional the near-infrared surface and refl redectance reflectance, data that while are themasked EVI isfor computed water, clouds, from near-infrared,heavy aerosols, red, and and cloud blue shadows. reflectance. Specifically, The detailed the equationsNDVI is computed are as follows: from the near-infrared and red reflectance, while the EVI is computed from near-infrared, red, and blue reflectance. The detailed ρNIR ρRed equations are as follows: NDVI = − (1) ρNIR + ρRed 𝑁𝐷𝑉𝐼 = (1)

𝐸𝑉𝐼 = (2) ×.×

RemoteRemote Sens. Sens.2019 2019, 11, ,11 1665, x FOR PEER REVIEW 5 of5 22 of 21

where ρ, ρ and ρ represent near-infrared, red and blue reflectance, respectively. Global ρNIR ρRed MOD13Q1 data are providedEVI every= 16 days at a 250−-meter spatial resolution as a gridded level-3(2) 1 + ρNIR + 6 ρNIR 7.5 ρ product in the sinusoidal projection. Cloud-free glob×al coverage− × is achievedBlue by replacing clouds with wherethe historicalρNIR, ρRed MODISand ρBlue time-seriesrepresent climatology near-infrared, record. red andVegetation blue reflectance, indices are respectively. used for global Global MOD13Q1monitoring data of vegetation are provided conditions every 16 and days in atproducts a 250-m that spatial exhibit resolution land cover as and a gridded land cover level-3 changes. product inThe the 110 sinusoidal images downloaded projection. Cloud-freein this study global span the coverage period from is achieved October by 2017 replacing to July 2018 clouds with with a 16- the historicalday interval, MODIS resulting time-series in 22 climatology timesteps. Each record. timestep Vegetation contains indices five aretiles used that forcan globalentirely monitoring cover the of vegetationHuanghuaihai conditions Region. and We in selected products all that six bands exhibit of land each cover tile in andthe experiments, land cover changes. and the models The 110 could images downloadedfully capture in the this reflectance study span information, the period which from would October be effective 2017 to for July crop 2018 identification. with a 16-day interval, resultingTo inevaluate 22 timesteps. the generalizability Each timestep of contains the models five tiles on thathistorical can entirely data, we cover collected the Huanghuaihai the same Region.MOD13Q1 We selected product alldata six for bands the 2016–2017 of each tile grow in theing experiments,season for additional and the experiments. models could fully capture the reflectance information, which would be effective for crop identification. 2.2.2. Reference Data To evaluate the generalizability of the models on historical data, we collected the same MOD13Q1 productIt datais very for challenging the 2016–2017 to collect growing reference season data. for Fortunately, additional experiments. we acquired a crop map of the winter wheat distribution for the 2017–2018 growing season in the Huanghuaihai Region from the Chinese 2.2.2.Academy Reference of Agricultural Data Sciences. The wheat map has a spatial resolution of 16 meters and is in the Albers equal-area conic projection. Each pixel in the map is assigned a value of 0 or 1, which represent It is very challenging to collect reference data. Fortunately, we acquired a crop map of the winter winter wheat or no winter wheat growth. Since we were not provided any details regarding how the wheat distribution for the 2017–2018 growing season in the Huanghuaihai Region from the Chinese map was completed or the reliability of the data, we simply reassessed the map by manually Academy of Agricultural Sciences. The wheat map has a spatial resolution of 16 m and is in the Albers collecting samples. Specifically, we randomly selected 1000 pixels in the map and interpreted the equal-areapixels conicvisually projection. based Eachon pixel images in the mapacquired is assigned from a value the of 0 orPlanet 1, which Explorer represent winterAPI wheat(https://www.planet.com/explorer/). or no winter wheat growth. Since We we acquired were not two provided high-resolut any detailsion images regarding in March how the2018 map and was completedJune 2018 or from the reliabilitythe Planet ofExplorer the data, API, we which simply are reassessed shown in theFigure map 2. by First, manually it is easy collecting to distinguish samples. Specifically,non-vegetation we randomly areas such selected as wate 1000r, residential pixels in areas themap and bare and interpretedland using images the pixels acquired visually in the based two on imagesstages acquired via visual from interpretation. the Planet Explorer Practically, API (mohttps:st vegetation//www.planet.com in the Huanghuaihai/explorer/). We Region acquired is twoa high-resolutiondeciduous forest, images which in Marchis gray 2018 and andstill Junenot germinated 2018 from thein March, Planet while Explorer winter API, wheat which has are entered shown in Figurethe green-returning2. First, it is easy stage. to distinguish The few evergreen non-vegetation forests, areas such such as pines as water, and residentialcypresses, will areas not and turn bare landyellow using like images mature acquired wheat in theJune. two To stagesthe best via of visual our knowledge, interpretation. the main Practically, crop during most vegetationthe period in thefrom Huanghuaihai October 2017 Region to June is 2018 a deciduous was winter forest, wheat, which and is there gray were and still no other not germinated crops or vegetation in March, with while wintergrowing wheat stages has similar entered to those the green-returning of winter wheat stage.in March The and few June. evergreen According forests, to the suchimages as acquired pines and cypresses,from the willtwo notcritical turn growing yellow likeperiods mature of winter wheat wh ineat, June. a pixel To the was best assumed of our to knowledge, be winter wheat the main when crop duringit appeared the period green from in March October and 2017 yellow to Junein June. 2018 The was evaluation winter wheat, results and show there thatwere the overall no other accuracy crops or vegetationof the crop with map growing is 0.95, the stages precisio similarn of towinter those wheat of winter is 0.89, wheat the recall in March of winter and June.wheatAccording is 0.83 and tothe the imagesF1 score acquired is 0.86 (The from detailed the two expl criticalanations growing of overall periods accuracy, of winter precision, wheat, recall a pixeland F1 was score assumed are shown to be in section 4.3). winter wheat when it appeared green in March and yellow in June. The evaluation results show that Similarly, we visually interpreted 1000 pixels at the same locations during the 2016–2017 the overall accuracy of the crop map is 0.95, the precision of winter wheat is 0.89, the recall of winter growing season from the Planet Explorer API in the same manner to form our historical testing wheat is 0.83 and the F1 score is 0.86 (The detailed explanations of overall accuracy, precision, recall dataset. These collected samples were used to evaluate the accuracy of the prediction map informed and F1 score are shown in Section 4.3). by the trained models using historical (2016–2017) MODIS data.

(a)

Figure 2. Cont. Remote Sens. 2019, 11, 1665 6 of 21 Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 22

(b)

Figure 2. Two images of the same parcel acquired in March 2018 (a) and June 2018 (b) from the PlanetFigure Explorer 2. Two images API. In of March, the same winter parcel wheat acquired enters in theMarch green-returning 2018 (a) and June stage, 2018 while (b) from most the of Planet the other vegetationExplorer isAPI. still In not March, germinated. winter wheat In June, enters winter the wheatgreen-returning enters maturation stage, while stage most and of turns the other yellow, whilevegetation other vegetationis still not germinated. is green. Therefore, In June, itwinter is easy wheat to distinguish enters maturation wheat areas stage and andother turns landyellow, cover typeswhile using other images vegetation acquired is green. in the Therefore, two stages it is via easy visual to distinguish interpretation. wheat areas and other land cover types using images acquired in the two stages via visual interpretation. Similarly, we visually interpreted 1000 pixels at the same locations during the 2016–2017 growing season2.3. Data from Preprocessing the Planet Explorer API in the same manner to form our historical testing dataset. These collected samples were used to evaluate the accuracy of the prediction map informed by the trained In this study, we built a data preprocessing pipeline to extract training samples from the MODIS models using historical (2016–2017) MODIS data. time series. First, image mosaics were applied to the five image tiles at each timestep. Then, mosaic 2.3.images Data Preprocessingwere projected to the Albers equal-area conic coordinate system, which was coincident with the reference data, thus stably maintained the area of each cell. Next, a mask of the Huanghuaihai RegionIn this was study, used weto extract built a the data data preprocessing within the study pipeline boundary. to extract Since training the spatial samples resolution from of the MODIS MODIS timedata series. was First,250 meters, image mosaicswe resampled were appliedthe 250-meter-spatial-resolution to the five image tiles at eachMODIS timestep. data to Then, a spatial mosaic imagesresolution were of projected 224 meters to the with Albers the nearest equal-area neighbor conic coordinateresampling system,method which and aligned was coincident them with with the the referencereference data, data, thus which stably have maintained a 16-meter the spatial area resolu of eachtion. cell. Therefore, Next, a mask 196 pixels of the of Huanghuaihai the reference map Region waswere used aligned to extract using the a data 0-1 withinannotation the study within boundary. a single SinceMODIS the cell. spatial Furthermore, resolution ofwe MODIS counted data the was 250quantities m, we resampled of 0s and the1s in 250-m-spatial-resolution each MODIS cell, which MODIS were used data to to distinguish a spatial resolution pure cells of that 224 were m with thecompletely nearest neighbor filled with resampling 0 or 1 values method from and mixed aligned cells themthat were with filled the reference with both data, 0 and which 1 values. have Some a 16-m spatialcrop map resolution. pixels within Therefore, the MODIS 196 pixels cells ofinside the referencethe border map had wereno values, aligned and usingwe simply a 0-1 removed annotation withinthese aincomplete single MODIS data. cell.To extract Furthermore, training wesamples, counted each the pure quantities MODIS ofcell 0s was and annotated 1s in each with MODIS 0 or 1, cell, whichwhich were represented used to distinguishwinter wheat pure or no cells winter that whea weret growth, completely respectively. filled with For 0 mixed or 1 values MODIS from cells, mixed a threshold was selected to determine whether the cell should be labeled as positive or negative, and cells that were filled with both 0 and 1 values. Some crop map pixels within the MODIS cells inside the these cells were labeled as 1 (mixed pixel wheat) or 0 (mixed pixel no-wheat). The following border had no values, and we simply removed these incomplete data. To extract training samples, inequality shows the labeling process. each pure MODIS cell was annotated with 0 or 1, which represented winter wheat or no winter wheat growth, respectively. For mixed MODIS cells, a threshold was selected to determine whether the cell should be labeled as positive or negative, and these cells were labeled as 1_ (mixed pixel wheat) or 0_ 1, (𝑐 = 196) (mixed pixel no-wheat). The following inequality⎧ shows the labeling process. 1, (𝑐≥𝜃) 𝑦= (3) ⎨ 0, (𝑐<𝜃)  1, (c = 196) ⎩ 0, (𝑐=0)   1_, (c θ) where 𝑦 denotes the label of a MODISy cell,=  𝑐 denotes≥ the quantities of wheat pixels of reference (3)  0_, (c < θ) map within one MODIS cell, and ?? is the threshold which is used to determine the label of mixed  0, (c = 0) MODIS cell. We simply set ?? to 98. As mentioned in Section 2.2.1, each timestep of a MODIS image wherehas 6y bands,denotes which the labelare blue, of a red, MODIS near-infrared, cell, c denotes middle-infrared, the quantities NDVI of wheatand EVI. pixels We took of reference the digital map value from each MODIS cell throughout the 22 timesteps and the corresponding labels as the training within one MODIS cell, and θ is the threshold which is used to determine the label of mixed MODIS samples; each sample has 132 variables and one class annotation. Throughout the Huanghuaihai cell. We simply set θ to 98. As mentioned in Section 2.2.1, each timestep of a MODIS image has 6 bands,

Remote Sens. 2019, 11, 1665 7 of 21 which are blue, red, near-infrared, middle-infrared, NDVI and EVI. We took the digital value from each MODIS cell throughout the 22 timesteps and the corresponding labels as the training samples; each sample has 132 variables and one class annotation. Throughout the Huanghuaihai Region, we acquired approximately 13 million samples, and the distribution of samples is shown in Table1.

Table1. Data samples extracted from Moderate Resolution Imaging Spectroradiometer (MODIS) images.

Pure Wheat Pure Mixed Pixels Mixed Pixels Total (1) No-wheat (0) Wheat (1_) No-Wheat (0_) 620287 6338524 2679611 3566889 13205311

2.4. Dataset Partition To train and evaluate our model, the datasets must be partitioned into training sets and testing sets. As the MODIS data have a cell size of 250 m, the spatial correlation might be nonsigniﬁcant. We assume that each MODIS cell is independent of the others. The total dataset was partitioned in the following manner:

Pure-mixed pixel set. The entire dataset was ﬁrst randomly partitioned into a training set and • testing set with a ratio of 4:1. Both the training set and testing set consist of pure-pixel samples and mixed-pixel samples. We call this type of training set a pure-mixed pixel set. Pure pixel set. Then, we further selected all the pure-pixel samples (the entire MODIS cell is either • covered with winter wheat or there is no winter wheat) from the pure-mixed pixel set as the new training dataset, which was called the pure pixel set, with the intuition that pure-pixel samples might be representative of the characteristics of wheat areas, while mixed-pixel samples might include noise. We kept the testing dataset unchanged but trained models using the pure-mixed pixel set and pure pixel set for further studies.

2.5. North-South Partition To evaluate the generalizability of our model to diﬀerent areas, we also divided our dataset into several parts according to the pixel location from north to south and utilized three parts for training and the rest parts for testing. As shown in Figure3, the dataset was equally partitioned into eight parts according to latitude. The number of samples in each part is reported in Table2. As the numbers of samples in Part 1 and Part 2 were small, we combined part 1, part 2 and part 3 to form the training set and testing set with the ratio of 4:1, and used part 4, part 5, part 6, part 7 and part 8 to evaluate the generalizability.

Table 2. The number of samples in each geographically partitioned dataset.

Quantity Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 Part 7 Part 8 Pure Pos 1069 34154 159010 117132 135805 158393 14014 710 Pure Neg 1313370 773568 816377 829138 736828 924470 685303 259470 Mixed 82532 339855 1034789 785234 1568533 1802182 565074 68301 Total 1396971 1147577 2010176 1731504 2441166 2885045 1264391 328481 Remote Sens. 2019, 11, 1665 8 of 21 Remote Sens. 2019, 11, x FOR PEER REVIEW 8 of 22

FigureFigure 3. Geographical 3. Geographical partitioning partitioning strategy. strategy. The The whole whol regione region was was partitioned partitioned into intoeight eight regionsregions from northfrom to south. north to south.

3. Methods Table 2. The number of samples in each geographically partitioned dataset.

3.1. RFQuantity Method for CropPart 1 Identification Part 2 Part 3 Part 4 Part 5 Part 6 Part 7 Part 8 Pure Pos 1069 34154 159010 117132 135805 158393 14014 710 ThePure RF Neg method 1313370 is an ensemble773568 learning816377 algorithm829138 that736828 consists 924470 of many decision685303 trees259470 that are combinedMixed to create 82532 a prediction. 339855 Each1034789 tree in the785234 ensemble 1568533 is trained 1802182 using 565074 a bootstrap 68301 sampling strategyTotal (sample 1396971drawn with1147577 replacement), 2010176 in1731504 which the2441166 training 2885045 dataset of1264391 an individual 328481 tree is a subset randomly picked from the whole dataset [19]. In addition, when splitting a node during the construction3. Methods of the tree, the split point that is chosen is the best split among a random subset of the features. The final output is the average of all the trees, thus decreasing the variance of the results. 3.1. RF Method for Crop Identification Although the bias of the forest usually increases slightly due to the randomness in the tree, this increase is less thanThe the RF increase method is in an bias ensemble required learning to compensate algorithm that for theconsists decrease of many in variance.decision trees As that a result, are the methodcombined performs to create better. a prediction. Each tree in the ensemble is trained using a bootstrap sampling strategy (sample drawn with replacement), in which the training dataset of an individual tree is a As shown in Section 2.2.1, our datasets are composed of 22 timesteps, and each timestep includes subset randomly picked from the whole dataset [19]. In addition, when splitting a node during the four raw reflectance bands and two vegetation index bands; thus, each sample contains 132 variables construction of the tree, the split point that is chosen is the best split among a random subset of the withfeatures. an annotated The final label. output RF operatesis the average by constructing of all the trees, multiple thus decreasing random the classification variance of the trees, results. and each tree randomlyAlthough the selects bias severalof the forest variables usually to makeincreases decision slightly rules. due Into ourthe randomness experiment, inn theand tree,k denote this the numberincrease of trees is less and than the the number increase of in selected bias required variables, to compensate respectively. for Thethe decrease number in of variance. trees n represents As a the complexityresult, the method and ability performs of RF better. to learn patterns from the data. Therefore, n needs to be large enough in case someAs shown samples in section or variables 2.2.1, our are datasets selected are only composed once or of even 22 timesteps, missed in and all each subspaces. timestep As includesn increases, the performancefour raw reflectance of RF tends bands to and remain two vegetation unchanged, index however, bands; thus, the computingeach sample resource contains needs132 variables to increase. An increasewith an inannotated the selected label. variablesRF operatesk generally by constructi improvesng multiple the random performance classification of an individual trees, and each tree, but tree randomly selects several variables to make decision rules. In our experiment, 𝑛 and 𝑘 denote the variance of an individual tree decreases, and the computing resources required for the individual the number of trees and the number of selected variables, respectively. The number of trees 𝑛 n k tree increase.represents Hence, the complexity we need and to strike ability a of balance RF to learn and findpatterns the twofrom optimal the data. parameters Therefore, 𝑛 needsand .to To be do so, we built many RF models with n increasing from 100 to 1000 and k increasing from 2 to 132. The overall accuracy was used to evaluate the performance of the model. In addition, the minimum number of samples required to split a node was set to 2, while the minimum number of samples of a leaf node was set to 1. The maximum tree depth was not fixed, and the nodes were expanded until all leaves Remote Sens. 2019, 11, x FOR PEER REVIEW 9 of 22

large enough in case some samples or variables are selected only once or even missed in all subspaces. As 𝑛 increases, the performance of RF tends to remain unchanged, however, the computing resource needs to increase. An increase in the selected variables 𝑘 generally improves the performance of an individual tree, but the variance of an individual tree decreases, and the computing resources required for the individual tree increase. Hence, we need to strike a balance and find the two optimal parameters 𝑛 and 𝑘. To do so, we built many RF models with 𝑛 increasing from 100 to 1000 and 𝑘 Remote Sens.increasing2019, 11 from, 1665 2 to 132. The overall accuracy was used to evaluate the performance of the model. In 9 of 21 addition, the minimum number of samples required to split a node was set to 2, while the minimum number of samples of a leaf node was set to 1. The maximum tree depth was not fixed, and the nodes were purewere orexpanded until all until leaves all leaves contained were pure less thanor until the all minimum leaves contained number less of than samples. the minimum For each number individual tree, Giniof samples. impurity For was each used individual to measure tree, Gini the impurity quality of was a split. used to When measure making the quality an inference, of a split. RF When combines all themaking equally an weighted inference, tree RF classifierscombines byall averagingthe equally theirweighted probabilistic tree classifiers predictions by averaging instead their of letting each classifierprobabilistic vote predictions for a single instead class. of Theletting predicted each classifier class vote is the for one a single with class. the highestThe predicted probability. class is the one with the highest probability. 3.2. A-LSTM Architecture for Crop Identification 3.2. A-LSTM Architecture for Crop Identification LSTM is a kind of special RNN architecture that is designed for long-term dependency problems. LSTM is a kind of special RNN architecture that is designed for long-term dependency problems. Both theBoth standard the standard RNN RNN and and LSTM LSTM have have repeated repeated neural neural units units thatthat can be be thought thought of ofmultiple multiple copies copies of the sameof the network same network and have and the have form the of form a chain of a ofchain repeating of repeating modules modules in a neuralin a neural network. network. In standard In RNNs,standard the repeating RNNs, modulethe repeating will havemodule a verywill have simple a very structure, simple structure, such as a such single as a tanh single layer, tanh as layer, shown in Figureas4. shown LSTMs in alsoFigure utilize 4. LSTMs this also chain utilize structure, this chain but structure, the repeating but the repeating module module has a di hasff erenta different structure. Insteadstructure. of having Instead a single of having neural a sing networkle neural layer, network there layer, are fourthere gates, are four which gates, are which the are forget the forget gate, input gate, modulationgate, input gate, gate, modulation and output gate, gate, and output as shown gate, inas Equationshown in Equation (4)–(7), (4), respectively, Equation (5), and Equation these gates (6), and Equation (7), respectively, and these gates interact in a specific manner [20]. interact in a specific manner [20]. 𝑓 =𝜎(𝑊 𝑥 +𝑊 ℎ +𝑏), (4) f f f ft = σ f W xt + W ht 1 + b , (4) data state − 𝑖 =𝜎(𝑊𝑥 +𝑊ℎ +𝑏), (5) = i + i + i it σi Wdataxt Wstateht 1 b , (5) 𝑔 =𝜎 𝑊 𝑔 +𝑊 ℎ +𝑏,− (6) = g + g + g gt σg Wdata gt Wstateht 1 b , (6) 𝑜 =𝜎 (𝑊 𝑥 +𝑊 ℎ +𝑏)−, (7) = o + o + o ot σo Wdataxt Wstateht 1 b , (7) These gates influence the ability of LSTM cells to discard− old information, gain new information

Theseand use gates that inﬂuence information the to ability create of an LSTM output cells vector. to discard The cell old state information, vector 𝑐 stores gain newthe internal information and usememory that information and is then to updated create anusing output the vector.Hadamard The operator cell state ⊙ vector, whichct storesperforms the elementwise internal memory and ismultiplication, then updated while using the the layerwise Hadamard hidden operator state vector, is which further performs derived from elementwise the LSTM output multiplication, gate vector 𝑜 [20]. while the layerwise hidden state vector is further derived from the LSTM output gate vector ot [20].

𝑐 = 𝑓 ⊙𝑐 +𝑖 ⊙𝑔, (8) ct = ft ct 1 + it gt, (8) − ℎ =𝑜 ⊙𝜎(𝑐), (9) ht = ot σ (ct), (9) h

Remote Sens. 2019, 11, x FOR PEER REVIEW 10 of 22

(a) (b)

(c)

FigureFigure 4. 4.( a(a)) The The simple simple recurrent recurrent neural neural network network (RNN) (RNN) architecture, architecture, ( b(b)) the the attention-based attention-based long long short-termshort-term memory memory (A-LSTM) (A-LSTM) architecture, architecture, ( c()c) the the unrolled unrolled chain-type chain-type RNN RNN structure. structure.

ToTo address address the the original original crop crop identiﬁcation identification problem problem with time-series with time-series MODIS images,MODIS we images, proposed we anproposed end-to-end an end-to-end encoder-decoder encoder-decoder architecture, architecture, which is shown which in is Figure shown5. in The Figure encoder 5. The consists encoder of threeconsists bidirectional of three bidirectional LSTM layers LSTM that are layers stacked that ar togethere stacked with together the full with sequence the full returned. sequence The returned. input The input sequence 𝑥, which is shown in Equation (10), has 22 timesteps and six variables in each step, which is coincident with our time-series data:

x = (x,…,x,…,x),x ∈R (10)

where t is the length of input sequence and k is the dimensionality of data. In the encoder, each LSTM layer has 128 hidden units and returns the full sequence to the next layer. The output of the three-layer encoder, which is shown in Equation (11), is a sequence of hidden states:

h = (h,…,h,…,h),h ∈R (11)

ℎ = 𝑓(𝑥,ℎ) (12)

where ℎ represents the hidden state at time i and 𝑓 is a nonlinear function. In the decoder section, considering that it is difficult for the network to address long sequences, we included an attention mechanism in our network [25]. This mechanism looks at the complete encoded sequence to determine which encoded steps to weight highly and generates a context vector

𝑐 for each class. Each encoded step ℎ includes the information of the whole input sequence due to the recurrent layers, and it has a strong focus on the parts surrounding the i-th step of the input

sequence. The context vector 𝑐 is computed by the weighted sum of these annotations ℎ:

c = ∑ αh (13)

where the weight α for each ℎ is computed by

( ) α = (14) ∑ ( )

where 𝑡 is the length of the encoded sequence and 𝑒 is an alignment model, which is shown Equation (15), represents how well the output at position 𝑗 and the hidden annotation ℎ match.

𝑒 =𝑔 (𝑠 , ℎ) (15)

𝑦=softmax (𝑠) (16)

where 𝑠 is the hidden neuron of the output at position 𝑗 and 𝑦 represents the probability distribution of wheat and no-wheat.

Remote Sens. 2019, 11, 1665 10 of 21 sequence x, which is shown in Equation (10), has 22 timesteps and six variables in each step, which is coincident with our time-series data:

k x = (x , ... , x , ... , xt), x R (10) 1 i i ∈ t whereRemoteis the Sens. length 2019, 11, ofx FOR input PEER sequence REVIEW and k is the dimensionality of data. 11 of 22

FigureFigure 5. Overview 5. Overview of theof the A-LSTM A-LSTM architecture. architecture. The The attention attention mechanism mechanism looks looks at the at complete the complete encodedencoded sequence sequence to determine to determine the the encoded encoded steps steps to to weigh weigh highly highly andand generatesgenerates a a context context vector. vector. The decoderThe network decoder network uses these uses context these context vectors vectors to make to make a prediction. a prediction.

3.3. Comparison and Evaluation of the Model Performance In the encoder, each LSTM layer has 128 hidden units and returns the full sequence to the next layer. The outputIn of this the paper, three-layer we used encoder, four metrics which tois evalua shownte the in Equationperformance (11), in isthe a sequencecontext of ofthe hidden binary states: classification problem. The precision, recall, overall accuracy, and F1 score were derived from the confusion matrix charted by the predicted and actual classificationk results [29]. The confusion matrix h = (h1, ... , hi, ... , ht), hi R (11) is shown in Table 3. In the table, positive and negative represent∈ winter wheat and no-wheat, respectively. TP and FP refer to the number of predicted positives that were correct and incorrect, hi = f (xi, hi 1) (12) and TN and FN refer to the number of predicted negatives− that were correct and incorrect. The four wheremetricshi represents are defined the hidden in Equations state at(17)–(20), time i and respfectively.is a nonlinear Specifically, function. the precision denotes the Inproportion the decoder of predicted section, positives considering that are that actual it is po disitives,ﬃcult while for the the network recall denotes to address the proportion long sequences, of we includedactual positives an attention that are mechanism correctly predicted in our positive networks. The [25 overall]. This accuracy mechanism is the looksproportion at the of completethe total cases that are correctly predicted, while the F1 score represents the harmonic mean of the recall encoded sequence to determine which encoded steps to weight highly and generates a context vector and precision. cj for each class. Each encoded step hi includes the information of the whole input sequence due to the recurrent layers, andTable it has 3. Confusion a strong matrix focus charted on the by parts the predicted surrounding and actual the classification.i-th step of the input sequence. The context vector c is computed by the weighted sum of these annotations h : j Positive (predicted) Negative (predicted)i Positive (actual) True positives (TP) False negatives (FN) Xt Negative (actual) False positives (FP) True negatives (TN) cj = αjihi (13) i=1

α 𝑇𝑃 (17) where the weight ji for each hi is computed𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= by 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 𝑟𝑒𝑐𝑎𝑙𝑙 = exp eji (18) α 𝑇𝑃= + 𝐹𝑁 (14) ji Pt exp eji 𝑇𝑃i=1 + 𝑇𝑁 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (19) 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁 where t is the length of the encoded sequence and eji is an alignment model, which is shown Equation 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 (15), represents how well the𝑓 output1𝑠𝑐𝑜𝑟𝑒 at position =2× j and the hidden annotation hi match. (20) 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙 First, the two models were trained witheji the= pure-mixedg sj, hi pixel set consisting of pure pixel samples (15) and mixed pixel samples. Considering that pure-pixel samples are representative of the characteristics of wheat areas, while mixed-pixel samples might include noises, we also used the pure

Remote Sens. 2019, 11, 1665 11 of 21

y = softmax(s) (16) where sj is the hidden neuron of the output at position j and y represents the probability distribution of wheat and no-wheat.

3.3. Comparison and Evaluation of the Model Performance In this paper, we used four metrics to evaluate the performance in the context of the binary classification problem. The precision, recall, overall accuracy, and F1 score were derived from the confusion matrix charted by the predicted and actual classification results [29]. The confusion matrix is shown in Table3. In the table, positive and negative represent winter wheat and no-wheat, respectively. TP and FP refer to the number of predicted positives that were correct and incorrect, and TN and FN refer to the number of predicted negatives that were correct and incorrect. The four metrics are defined in Equations (17)–(20), respectively. Specifically, the precision denotes the proportion of predicted positives that are actual positives, while the recall denotes the proportion of actual positives that are correctly predicted positives. The overall accuracy is the proportion of the total cases that are correctly predicted, while the F1 score represents the harmonic mean of the recall and precision.

TP precision = (17) TP + FP

TP recall = (18) TP + FN TP + TN overallaccuracy = (19) TP + FP + FN + TN precision recall f 1score = 2 × (20) × precision + recall

Table 3. Confusion matrix charted by the predicted and actual classiﬁcation.

Positive (predicted) Negative (predicted) Positive (actual) True positives (TP) False negatives (FN) Negative (actual) False positives (FP) True negatives (TN)

First, the two models were trained with the pure-mixed pixel set consisting of pure pixel samples and mixed pixel samples. Considering that pure-pixel samples are representative of the characteristics of wheat areas, while mixed-pixel samples might include noises, we also used the pure pixel set to train the two models. We kept the testing set unchanged to evaluate the performances of the two models trained with the two diﬀerent datasets. Moreover, we evaluated the generalizability of the models via the north-south dataset partitioning strategy. In practice, our models could also be used to identify croplands in historical datasets. Field-collected reference data were used for analysis and evaluation. More details about the experimental results and analysis are provided in the next section.

4. Experiments and Results Analysis

4.1. RF Fine-Tuning We used the Python Scikit-learn [30] package to implement our RF experiments following the instructions in Section 3.1. Hundreds of RF models were built to ﬁne-tune the RF parameters n (number of trees) and k (number of selected variables) with diﬀerent combinations, which are shown in Figure6. We used the overall accuracy score to measure the performance of each model. To balance performance and the cost of computation, the parameter combination with the highest score was chosen. Figure6 shows the changes in model performance with n values from 100 to 1000 and k values from 2 to 132. Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 22 pixel set to train the two models. We kept the testing set unchanged to evaluate the performances of the two models trained with the two different datasets. Moreover, we evaluated the generalizability of the models via the north-south dataset partitioning strategy. In practice, our models could also be used to identify croplands in historical datasets. Field-collected reference data were used for analysis and evaluation. More details about the experimental results and analysis are provided in the next section.

4. Experiments and Results Analysis

4.1. RF Fine-Tuning We used the Python Scikit-learn [30] package to implement our RF experiments following the instructions in Section 3.1. Hundreds of RF models were built to fine-tune the RF parameters 𝑛 (number of trees) and 𝑘 (number of selected variables) with different combinations, which are shown in Figure 6. We used the overall accuracy score to measure the performance of each model. To Remote Sens. 2019, 11, 1665 12 of 21 balance performance and the cost of computation, the parameter combination with the highest score was chosen. Figure 6 shows the changes in model performance with 𝑛 values from 100 to 1000 and When𝑘 values n equals from 5002 to and132.k Whenequals n 40,equals the performance500 and k equals of the 40, RF the method performance remains of stable; the RF thus, method we selectedremains thisstable; combination thus, we selected for the followingthis combination experiments. for the following experiments.

(a) (b)

FigureFigure 6.6. Fine-tuning of the parameters n and k ofof thethe randomrandom forestforest (RF)(RF) model,model, ((aa)) thethe overalloverall accuracyaccuracy ofof the the RF RF trained trained on onpure-mixed pure-mixed pixel pixel set ,(setb,) ( theb) the overall overall accuracy accuracy of the of RFthe trained RF trained on Pure on pixelPure setpixel. When set. When n equals n equals 500 and 500 k equals and k 40, equals the performances 40, the perf oformances RF models of converge;RF models thus, converge; we selected thus, this we combinationselected this forcombination further experiments. for further experiments.

4.2.4.2. A-LSTMA-LSTM TrainingTraining andand EvaluationEvaluation ForFor thethe A-LSTMA-LSTM method,method, wewe usedused PythonPython TensorFlowTensorFlow [[31]31] andand thethe KerasKeras [[32]32] packagepackage toto implementimplement ourour model.model. TheThe cross-entropycross-entropy lossloss waswas calculated,calculated, andand thethe RMSpropRMSprop (Root(Root MeanMean SquareSquare Prop)Prop) algorithmalgorithm was was used used to optimizeto optimize the modelthe model [33]. During[33]. During training, training, we used we the used mini-batch the mini-batch strategy andstrategy set the and batch set the size batch to 128. size We to 128. used We an used NVIDIA an NVIDIA TESLA V100TESLA graphics V100 graphics card to traincard to the train model, the andmodel, after and 500 after epochs 500 (anepochs epoch (an is epoch an iteration is an iterat overion the over entire the dataset), entire dataset), the model the convergedmodel converged to the globalto the optimum.global optimum. Figure7 Figure shows 7 the shows training the training procedure procedure and validation and validation results with results each with step iteration.each step Finally,iteration. our Finally, LSTM our model LSTM achieved model anachieved overall an accuracy overall accuracy score of 0.85 score on of the 0.85 pure-mixed on the pure-mixed pixel set pixel and Remote0.82set and on Sens. the0.82 2019 pure on, 11 the, pixel x FOR pure set. PEER pixel We REVIEW willset. discussWe will thediscuss performance the performance thoroughly thoroughly below. below. 13 of 22

(a) (b)

Figure 7. TrainingTraining logs for the A-LSTM A-LSTM model, ( a) model model trained trained on on a a mixed mixed dataset, dataset, ( (b)) model model trained trained on a pure dataset.

4.3. Identification Identiﬁcation Metrics The overall accuracy of the RF trained on the pure-mixed pixel set was 0.87 ( 0.01), and the overall The overall accuracy of the RF trained on the pure-mixed pixel set was± 0.87 (±0.01), and the accuracy of the RF trained on the pure pixel set was 0.85 ( 0.01), while the overall accuracies of the overall accuracy of the RF trained on the pure pixel set was± 0.85 (±0.01), while the overall accuracies A-LSTM trained on these two datasets were 0.85 ( 0.01) and 0.82 ( 0.01), respectively. The overall of the A-LSTM trained on these two datasets were± 0.85 (±0.01) an±d 0.82 (±0.01), respectively. The overall accuracy scores were high because the no-wheat class accounts for approximately 85 percent of the dataset. Thus, the accuracy scores were dominated by the majority class. However, the F1 score is the weighted average of the precision and recall, which is a better metric for such an uneven dataset. Specifically, the F1 scores of RF and A-LSTM trained on the pure-mixed pixel set were 0.72 and 0.71, while the scores of the two models trained on the pure pixel set were 0.68 and 0.66. More details regarding the performance score are shown in Table 3. In general, the two models behaved better when they were trained on the pure-mixed pixel set. For comparison, when the pure pixel set was utilized, the precision of the two models improved, while the recall worsened. In total, the two models trained with the pure-mixed pixel set were more stable, traded precision for recall and achieved high F1 scores and overall accuracy, which are highly recommended in practical applications. This result occurred because the data distribution of the test dataset was the same as that of the pure-mixed pixel set. This phenomenon might cause severe overfitting problems when the models are trained on only pure-pixel samples. In some classification cases that require large amounts of manually collected training data, whether the selected samples include mixed pixels needs to be reconsidered, and the impact of including these cells needs to be evaluated. The classified wheat map resulting from the RF and A-LSTM models trained with the pure- mixed set is shown in Figure 8a,b, while the reference wheat map is shown in Figure 8c. The numbers of wheat pixels in the three maps were 3296256, 3327509, and 3269754. According to the map, most of the crop areas were correctly predicted in the plain area, while there were extensive differences between the prediction and reference data in the border between wheat areas and no-wheat areas or some isolated wheat areas. Thus, the many mixed cells that might include crops, buildings, and mountains result in irregular spectral reflectance. In addition, although the numbers of wheat pixels in the prediction map and reference map were similar, there was a slight visual difference. In Figure 9a,b, wheat pixels are more likely to be clustered together, while there are many isolated wheat areas in the reference map. Generally, wheat areas among continuous no-wheat areas or no-wheat areas among continuous wheat areas were very likely to be misclassified.

Remote Sens. 2019, 11, 1665 13 of 21 accuracy scores were high because the no-wheat class accounts for approximately 85 percent of the dataset. Thus, the accuracy scores were dominated by the majority class. However, the F1 score is the weighted average of the precision and recall, which is a better metric for such an uneven dataset. Speciﬁcally, the F1 scores of RF and A-LSTM trained on the pure-mixed pixel set were 0.72 and 0.71, while the scores of the two models trained on the pure pixel set were 0.68 and 0.66. More details regarding the performance score are shown in Table4.

Table 4. Precision, recall, overall accuracy and F1 scores for the two models trained on diﬀerent datasets.

Pure Pixel Set Pure-Mixed Pixel Set RF A-LSTM RF A-LSTM Precision 0.75 0.74 0.72 0.71 Recall 0.62 0.60 0.71 0.70 Overall Accuracy 0.85 ( 0.01) 0.82 ( 0.01) 0.87 ( 0.01) 0.85 ( 0.01) ± ± ± ± F1 score 0.68 ( 0.01) 0.66 ( 0.01) 0.72 ( 0.01) 0.71 ( 0.01) ± ± ± ±

In general, the two models behaved better when they were trained on the pure-mixed pixel set. For comparison, when the pure pixel set was utilized, the precision of the two models improved, while the recall worsened. In total, the two models trained with the pure-mixed pixel set were more stable, traded precision for recall and achieved high F1 scores and overall accuracy, which are highly recommended in practical applications. This result occurred because the data distribution of the test dataset was the same as that of the pure-mixed pixel set. This phenomenon might cause severe overfitting problems when the models are trained on only pure-pixel samples. In some classification cases that require large amounts of manually collected training data, whether the selected samples include mixed pixels needs to be reconsidered, and the impact of including these cells needs to be evaluated. The classified wheat map resulting from the RF and A-LSTM models trained with the pure-mixed Remote Sens. 2019, 11, x FOR PEER REVIEW 14 of 22 set is shown in Figure8a,b, while the reference wheat map is shown in Figure8c. The numbers of wheat pixelsTable in the3. Precision, three maps recall, were overall 3296256, accuracy 3327509, and F1 scores and 3269754.for the two According models trained to the on map,different most of the cropdatasets. areas were correctly predicted in the plain area, while there were extensive differences between the prediction and reference data in the border between wheat areas and no-wheat areas Pure Pixel Set Pure-Mixed Pixel Set or some isolated wheat areas. Thus, the many mixed cells that might include crops, buildings, and mountains result in irregular spectral reflectance.RF A-LSTM In addition, althoughRF the numbersA-LSTM of wheat pixels in Precision 0.75 0.74 0.72 0.71 the prediction map and reference map were similar, there was a slight visual difference. In Figure9a,b, Recall 0.62 0.60 0.71 0.70 wheat pixels are more likely to be clustered together, while there are many isolated wheat areas in the Overall Accuracy 0.85 (+0.01) 0.82 (+0.01) 0.87 (+0.01) 0.85 (+0.01) reference map. Generally, wheat areas among continuous no-wheat areas or no-wheat areas among F1 score 0.68 (+0.01) 0.66 (+0.01) 0.72 (+0.01) 0.71 (+0.01) continuous wheat areas were very likely to be misclassified.

(a) (b) (c)

FigureFigure 8. (a) The 8. (a prediction) The prediction map informedmap informed by the by RF the model, RF model, (b) the(b) predictionthe prediction map map informed informed by by the the A-LSTMA-LSTM model, model, (c) ground (c) ground truth wintertruth winter wheat wheat distribution distribution map. map.

4.4. Feature Importance The relative depth of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction of a large fraction of the input samples. The expected fraction of samples that these features contribute to can thus be used as an estimate of the relative importance of the features. In the Scikit-learn package, the fraction of samples a feature contributes to is combined with the decrease in impurity from splitting them to create a normalized estimate of the predictive power of that feature. By averaging the estimates of predictive ability over several randomized trees, one can reduce the variance in such an estimate and use it for feature selection. This process is known as the mean decrease in impurity [30]. In this paper, we visualized the feature importance of 132 variables in RF, which is shown in Figure 9a. The importance scores were the mean scores of all individual trees. The top 10 variables are shown in Table 4; these variables are 2018081_EVI, 2018081_NDVI, 2018097_EVI, 2018081_NIR, 2018097_NIR, 2018161_NDVI, 2018097_NDVI, 2018113_EVI, 2018065_EVI and 2018245_NDVI, where the first part of each variable represents the day of the year, and the last part represents the band. Furthermore, we summed six importance scores per timestep and 22 importance scores per band, which are shown in Figure 9b,c. Generally, variables in March or April 2018 were more important than others. Moreover, EVI and NDVI had higher importance scores than raw reflectance bands. The differences of feature importance among variables could be explained by several points. According to the typical growing cycles of winter wheat, wheat enters the green-returning stage in late February or early March, while most other vegetation in the Huanghuaihai Region is deciduous forest, which is gray and still not germinated; this phenomenon leads to a higher importance of data captured in March or April, such as 2018081_EVI, 2018081_NDVI, 2018097_EVI. For the comparison between vegetation index bands and spectral reflectance bands, vegetation indices are more sensitive over dense vegetation conditions and have high importance scores.

Remote Sens. 2019, 11, 1665 14 of 21 Remote Sens. 2019, 11, x FOR PEER REVIEW 15 of 22

(a) (b) (c)

FigureFigure 9. 9.( a(a)) Importance Importance scores of of 132 132 variables variables with with the the hierarchy hierarchy indexed indexed by timestep by timestep and band, and band,(b) (b)Accumulated Accumulated importance importance score score per per timestep, timestep, (c) Accumulated (c) Accumulated importance importance score score per band. per band.

4.4. FeatureTable Importance 4. Top 10 variables in order of decreasing importance in the RF model. The first column lists Thethe relative index in depth the 132-variable of a feature sequence. used as The a decision second nodecolumn in represents a tree can the be feature used to name, assess which the relative importanceconsists of of that the featureday of the with year respect and the toband the name. predictability The third column of the targetrepresents variable. the importance Features score. used at the top of the tree contributeIndex to the finalBand/Feature prediction of name a large fraction of Importance the input samples. The expected fraction of samples that these features contribute to can thus be used as an estimate of the relative importance of the features.79 In the Scikit-learn2018081_EVI package, the fraction of samples 0.029 a feature contributes to is combined with the81 decrease in impurity2018081_NDVI from splitting them to create 0.024 a normalized estimate of the predictive power of that feature. By averaging the estimates of predictive ability over several randomized trees, one85 can reduce the variance2018097_EVI in such an estimate and use 0.023 it for feature selection. This process is known as the82 mean decrease in2018081_NIR impurity [ 30]. 0.023 In this paper, we visualized the feature importance of 132 variables in RF, which is shown in 88 2018097_NIR 0.022 Figure9a. The importance scores were the mean scores of all individual trees. The top 10 variables are shown in Table5;111 these variables are2018161_NDVI 2018081_EVI, 2018081_NDVI, 0.020 2018097_EVI, 2018081_NIR, 2018097_NIR, 2018161_NDVI, 2018097_NDVI, 2018113_EVI, 2018065_EVI and 2018245_NDVI, where 87 2018097_NDVI 0.020 the first part of each variable represents the day of the year, and the last part represents the band. Furthermore, we summed91 six importance2018113_EVI scores per timestep and 22 0.018 importance scores per band, which are shown in Figure9b,c. Generally, variables in March or April 2018 were more important than 73 2018065_EVI 0.015 others. Moreover, EVI and NDVI had higher importance scores than raw reflectance bands. 105 2018245_NDVI 0.014 Table 5. Top 10 variables in order of decreasing importance in the RF model. The first column lists the indexFor inthe the A-LSTM 132-variable model, sequence. the context The secondvector column𝑐 of output represents class the j determines feature name, which which encoded consists steps of to theweigh day highly of the yearand andis calculated the band name.as the Theweighted third column sum of represents these encoded the importance steps. As score.shown in Section 3.2, the weight α can be considered the probability that the j-th class is aligned to the i-th step of the encoded sequence. Since Indexdifferent samples Band/Feature have namedifferent Importanceweight distributions and the weight distribution of a single sample79 is usually 2018081_EVI noisy or not sufficiently 0.029 representative of the alignment pattern, we simply used the81 mean of 2018081_NDVI the weight distribution 0.024 of the whole test dataset for visualization. As shown in Figure85 10, the 2018097_EVI two curves represent 0.023the weight distribution along the 82 2018081_NIR 0.023 encoded sequence for the two classes, and the integral of each curve is equal to 1. According to the 88 2018097_NIR 0.022 figure, the wheat class is highly111 aligned 2018161_NDVI with the 16th step of 0.020the encoded sequence, which was acquired in April 2018 (the 113th87 day of 2018097_NDVIthe year 2018). Thus, there 0.020 is a very high probability that the wheat class is aligned with the91 16th step of 2018113_EVI the encoded sequence, 0.018 which contains all the information of the input sequence, with73 a strong focus 2018065_EVI on the parts surrounding 0.015 the 16th step of the input sequence. Compared to the 105variable importance 2018245_NDVI scores of RF, they 0.014 both show that sequence data acquired in April are more important for the winter wheat identification problem using time-series data.The differences of feature importance among variables could be explained by several points. According to the typical growing cycles of winter wheat, wheat enters the green-returning stage in late February or early March, while most other vegetation in the Huanghuaihai Region is deciduous forest, which is gray and still not germinated; this phenomenon leads to a higher importance of data captured in March or April, such as 2018081_EVI, 2018081_NDVI, 2018097_EVI. For the comparison

Remote Sens. 2019, 11, 1665 15 of 21 between vegetation index bands and spectral reflectance bands, vegetation indices are more sensitive over dense vegetation conditions and have high importance scores. For the A-LSTM model, the context vector cj of output class j determines which encoded steps to weigh highly and is calculated as the weighted sum of these encoded steps. As shown in Section 3.2, the weight αji can be considered the probability that the j-th class is aligned to the i-th step of the encoded sequence. Since different samples have different weight distributions and the weight distribution of a single sample is usually noisy or not sufficiently representative of the alignment pattern, we simply used the mean of the weight distribution of the whole test dataset for visualization. As shown in Figure 10, the two curves represent the weight distribution along the encoded sequence for the two classes, and the integral of each curve is equal to 1. According to the figure, the wheat class is highly aligned with the 16th step of the encoded sequence, which was acquired in April 2018 (the 113th day of the year 2018). Thus, there is a very high probability that the wheat class is aligned with the 16th step of the encoded sequence, which contains all the information of the input sequence, with a strong focus on the parts surrounding the 16th step of the input sequence. Compared to the variable importance scores of RF, they both show that sequence data acquired in April are more important for the winter Remotewheat Sens. identification 2019, 11, x FOR problem PEER REVIEW using time-series data. 16 of 22

Figure 10. The mean probability distribution that the target class is aligned with the encoded sequence. Figure 10. The mean probability distribution that the target class is aligned with the encoded 4.5. Generalizabilitysequence. in Diﬀerent Areas

4.5. GeneralizabilityTo evaluate the in generalizability Different Areas in different areas, we divided the entire dataset into eight parts from north to south, the details of which are presented in Section 2.5. The evaluation results are reportedTo evaluate in Table the6. None generalizability of the precision, in different recall, areas, and accuracy we divided metrics the exhibitedentire dataset significant into eight patterns parts fromexcept north for the to F1south, score. the Specifically, details of thewhich F1 scoreare pres of theented RF modelin Section decreased 2.5. The as evaluation the evaluation results dataset are reportedmoved away in Table from 5. the None training of the set. precision, Since the recall, F1 score and representsaccuracy metrics the harmonic exhibited mean significant of the recall patterns and exceptprecision, for the it is F1 appropriate score. Specifically, to use it the to represent F1 score of the the generalizability RF model decreased of the model.as the evaluation According dataset to the movedexperiments, away from the model the training achieved set. the Since best the performance F1 score represents when the the testing harmonic set and mean training of the set recall were and in precision,the same part,it is appropriate because they to had use theit to same represent data distribution.the generalizability When theof the trained model. model According was required to the experiments,to perform prediction the model in achieved other areas, the thebest performance performance worsened. when the testing The main set reasonand training for this set di ffwereerence in theis that same the part, winter because wheat they growing had the season same in data the dist Huanghuaihairibution. When Region the trained changes model slightly was from required north to performsouth. We prediction divided the in other growing areas, season the performance into six growth worsened. stages: sowing, The main overwintering, reason for this green-returning, difference is thatjointing, the winter flowering, wheat and growing maturation. season Thein the maps Huanghuaihai of the starting Region times changes of the slightly six growth from stages north are to south.shown We in Figure divided 11. Eachthe growing raster pixel season represents into si thex growth day of thestages: year sowing, when winter overwintering, wheat entered green- the returning,corresponding jointing, stage. flowering, The first and two maturation. maps show The the ma growingps of the times starting in 2017, times and of the six other growth maps stages show arethe shown growing in timesFigure in 11. 2018. Each Contour raster pixel lines represen are alsots shownthe day on of thethe maps.year when The winter six raster wheat maps entered were theinterpolated corresponding from data stage. recorded The first from two 82 maps agricultural show the stations growing distributed times in in 2017, the Huanghuaihai and the other Region. maps showThe time the intervalgrowing of times the same in 2018. growing Contour stage lines from are south also shown to north on was the 30maps. or 40 The days six at raster most. maps Therefore, were interpolatedto obtain the from best model data recorded performance, from the 82 growing agricultural season stations in the predictingdistributed area in the must Huanghuaihai be as close as Region. The time interval of the same growing stage from south to north was 30 or 40 days at most. Therefore, to obtain the best model performance, the growing season in the predicting area must be as close as possible to that of the training area, and a new model should be retrained in the prediction area for practical applications if necessary.

Table 5. Generalization performance scores of the RF model.

Generalization evaluation Metrics Testing Fold Fold Fold Fold Fold 4 5 6 7 8 Precision 0.752 0.724 0.660 0.746 0.747 0.602 Recall 0.682 0.664 0.682 0.603 0.410 0.073 Total 0.906 0.845 0.759 0.786 0.913 0.972 accuracy F1 score 0.715 0.693 0.671 0.667 0.529 0.130

Remote Sens. 2019, 11, 1665 16 of 21 possible to that of the training area, and a new model should be retrained in the prediction area for practicalRemote applications Sens. 2019, 11 if, x necessary. FOR PEER REVIEW 17 of 22

(a) (b) (c)

(d) (e) (f)

Figure 11.FigureThe winter11. The wheat winter growing wheat season growing in the season Huanghuaihai in the Huanghuaihai Region, (a) sowing; Region, (b) overwintering;(a) sowing; (b) (c) green-returning;overwintering; (d )( jointing;c) green-returning; (e) ﬂowering; (d) (jointing;f) maturation. (e) flowering; The cell ( valuesf) maturation. of the raster The cell map values represent of the the dayraster of the map year represent on which the winter day of wheat the year enters on thewhic growingh winter stage.wheat Contourenters the lines growing are shown stage. onContour the map. Thelines ﬁrst are twoshown maps on the show map. the The growing first two times maps in sh 2017,ow the and growing the others times showin 2017, the and growing the others times show in 2018.the growing times in 2018.

4.6. Inference on HistoricalTable 6. GeneralizationData performance scores of the RF model. To determine the historical winter wheatGeneralization distribution, we Evaluation also applied the two models to Metrics Testing historical MODIS data from the HuanghuaihaiFold 4 Region. Fold 5 Specifically, Fold 6 Foldwe collected 7 Fold time-series 8 data for the 2016–2017 growing season, which were processed by the same data preprocessing pipeline. Using Precision 0.752 0.724 0.660 0.746 0.747 0.602 the two modelsRecall trained on0.682 Pure-mixed 0.664pixel set, 0.682which is described 0.603 0.410in Section0.073 2.4, we obtained two historicalTotal winter accuracy wheat distributi 0.906on maps 0.845 of the 0.759 Huanghuaihai 0.786 Region, 0.913 which 0.972 are shown in Figure 12a,b. To evaluateF1 score the accuracy 0.715 of the two 0.693 predicti 0.671on maps, 0.667 we randomly 0.529 selected 0.130 1000 cells from the Huanghuaihai Region and interpreted them visually via the Planet Explorer API, which was 4.6. Inferenceintroduced on Historical in Section Data 2.2.2. The distribution of the 1000 cells is shown in Figure 12c. Similarly, we used the annotation strategy described in Section 2.2.2. Then, two confusion matrices with the Topredicted determine and the reference historical samples winter wheatwere generated. distribution, They we are also reported applied in the Tables two models6 and 7, to respectively. historical MODISFor data the from winter the Huanghuaihaiwheat class, the Region. recall, precision Speciﬁcally, and we F1 collectedscore predicted time-series by the data RF for model the 2016–2017 were 0.720, growing0.739 season, and which0.729, while were processedthese values by for the the same A-LSTM data preprocessing model were 0.703, pipeline. 0.741 Using and 0.721, the two respectively. models trainedGenerally, on Pure-mixed the two pixel models set, which achieved is described comparativ in Sectione performance 2.4, we in obtained the 2016–2017 two historical growing winter season, wheat distributionwhich proved maps our ofconcept. the Huanghuaihai Region, which are shown in Figure 12a,b. To evaluate the accuracy of the two prediction maps, we randomly selected 1000 cells from the Huanghuaihai Region and interpreted them visually via the Planet Explorer API, which was introduced in Section 2.2.2.

Remote Sens. 2019, 11, 1665 17 of 21

The distribution of the 1000 cells is shown in Figure 12c. Similarly, we used the annotation strategy described in Section 2.2.2. Then, two confusion matrices with the predicted and reference samples were generated. They are reported in Tables7 and8, respectively. For the winter wheat class, the recall, precision and F1 score predicted by the RF model were 0.720, 0.739 and 0.729, while these values for the A-LSTM model were 0.703, 0.741 and 0.721, respectively. Generally, the two models achieved comparative performance in the 2016–2017 growing season, which proved our concept. Remote Sens. 2019, 11, x FOR PEER REVIEW 18 of 22

(a) (b) (c)

FigureFigure 12. 12.Historical Historical winter winter wheat wheat distribution distribution map map (2016–2017 (2016–2017 growing growing season) season) informed informed by by ( a(a)) RF RF and (band) A-LSTM, (b) A-LSTM, (c) the distribution(c) the distribution of manually of manually collected collected evaluation evaluation samples fromsamples the Planetfrom Explorerthe Planet API. Explorer API. Table 7. Confusion matrix of the prediction map informed by RF. Table 6. Confusion matrix of the prediction map informed by RF. Prediction Reference Prediction Reference Winter Wheat No-Wheat Recall Winter wheat No-wheat Recall Winter wheatWinter Wheat 85 85 33 33 0.720 0.720 No-Wheat 30 852 0.966 No-wheat 30 852 0.966 Precision 0.739 0.963 Precision 0.739 0.963

Table 8. Confusion matrix of the prediction map informed by A-LSTM.

Table 7. Confusion matrix of the predictionPrediction map informed by A-LSTM. Reference Winter WheatPrediction No-Wheat Recall Reference Winter wheat No-wheat Recall Winter wheat 83 35 0.703 Winter wheat No-wheat 83 29 35 853 0.967 0.703 No-wheat Precision29 0.741853 0.961 0.967 Precision 0.741 0.961

5. Discussion 5. DiscussionOur two models achieved F1 scores of 0.72 and 0.71 for identifying the winter wheat distribution in the HuanghuaihaiOur two models Region. achieved Previous F1 scores works of such0.72 and as Senf 0.71 etfor al. identifying [5], Tuanmu the etwinter al. [7 wheat], Rodriguez-Galiano distribution etin al. the [17 ],Huanghuaihai Charlotte Pelletier Region. et al.Previous [18], and works Marc such Rußwurm as Senf et et al. al. [20 [5],] regarding Tuanmu landet al. cover[7], Rodriguez- classification orGaliano crop identification et al. [17], Charlotte usually Pelletier focused onet al. only [18], a smalland Marc area, Rußwurm such as a et county al. [20] or regarding city, which land resulted cover in a classificationclassification or map crop that identification was more detailed usually thanfocused ours. on However, only a small our area, trained such models as a county could alsoor city, easily recognizewhich resulted the approximate in a classification winter map wheat that distribution, was more detailed but in athan large-scale ours. However, area, and our these trained models models could becould especially also easily used recognize in some close the approximate areas or historical winter cropland wheat distribution, area identification. but in a large-scale All the procedures area, and are verythese simple models to implement,could be especially and the used results in some can be close rapidly areas obtained. or historical cropland area identification. All Althoughthe procedures the RFare andvery A-LSTM simple to methodsimplemen achievedt, and the comparableresults can be performance rapidly obtained. according to our experiments,Although the the computation RF and A-LSTM costs are methods different. achieved As the comparable score tables aboveperformance show, theaccording overall to accuracy our andexperiments, F1 score of the RF computation is slightly costs higher are thandifferent. that As of the A-LSTM, score tables but theabove computation show, the overall time accuracy required to and F1 score of RF is slightly higher than that of A-LSTM, but the computation time required to train an RF model is much greater than that required to train an A-LSTM model. As there are many samples throughout the study area, RF needs to traverse almost all samples and find the optimal split orders and split points to build a decision tree each time. In addition, it required almost 2 hours to train a complete RF model on our 24-core working station with parallel computation. For A-LSTM, a high-performance graphics card could be used to accelerate the computation, and an early stop strategy [34] (automatically stop training when the model converges) could be employed in practical

Remote Sens. 2019, 11, 1665 18 of 21 train an RF model is much greater than that required to train an A-LSTM model. As there are many samples throughout the study area, RF needs to traverse almost all samples and find the optimal split orders and split points to build a decision tree each time. In addition, it required almost 2 h to train a complete RF model on our 24-core working station with parallel computation. For A-LSTM, a high-performance graphics card could be used to accelerate the computation, and an early stop strategy [34] (automatically stop training when the model converges) could be employed in practical applications. In this manner, approximately 50 min are required to train an A-LSTM model with a Tesla V100 GPU card. Furthermore, the overall accuracies of our two models were 0.87 and 0.85. However, the precision and recall were approximately 0.70, which is not sufficiently precise for the detailed mapping of winter wheat. This result is mainly due to three phenomena. (1) the cell size of MODIS data is 250 m, and a cell might contain multiple land cover types, thus making the reflectance spectra unstable and unpredictable. (2) Our ground truth map, which was provided by the Chinese Academy of Agricultural Sciences, was assumed to be totally accurate in each cell. Since we did not receive any instructions regarding how to complete the map or information on the data accuracy, we reassessed the data via manually collected field data, as described in Section 2.2.2. The results indicated that the overall accuracy of the ground truth map was 0.95, the precision of winter wheat was 0.89, and the recall was 0.83, which might result in noise in the training sample labels. (3) Although many works have demonstrated the effectiveness of RF and DNNs, they still have limitations in learning the characteristics of such a large and complex area. Using a more complex RF or A-LSTM (a larger number of trees with RF or a deeper network with A-LSTM) could increase the inference ability. However, this usually causes severe overfitting problems, and experiments have shown that the validation score remains almost unchanged when the models reach saturation. In this paper, we also visualized the feature importance of RF. Generally, the features derived from March or April had high importance scores. In early March, the first joint emerges above the soil line. This joint provides visual evidence that the plants have initiated their reproduction [35]. Then, winter wheat enters a fast-growing period until maturation. Thus, features in this period tend to be significant. For the feature importance of different bands, vegetation indices such as NDVI or EVI represent the reflectance differences between vegetation and soil in the visible and NIR bands. Green vegetation usually exhibits a large spectral difference between the red and NIR bands, thus making the vegetation index more important than a single band. In our study, features were derived from the six bands provided by the MOD13Q1, Collection 6 product. Future work could add additional bands in the models, which might provide better results. When the trained models are used to make predictions in other areas, close areas usually have reliable results. The main reason behind this result is that the winter wheat growing season varies by area. Specifically, in the Huanghuaihai Region, winter wheat crops at the same latitudes likely have similar growing seasons. In addition, the experiments above support our point of view. Practically, there must be other reasons that result in the poor performance, such as the terrain, elevation, climate, crop species or different land cover compositions of the negative samples. Thus, the MODIS data distribution in the prediction area varies compared to that in the training areas. Our future work will focus on revealing the detailed mechanisms underlying this difference. Fortunately, when we applied our model to historical MODIS data, the performance was stable, as described in Section 4.6. However, exceptions sometimes exist that result in noise in the prediction, such as improved winter wheat species, climate changes, and land cover changes. Regrettably, we did not conduct additional experiments over more past years because of the extensive labor required to collect validation data.

6. Conclusions In this paper, we developed two models for large-scale winter wheat identification in the Huanghuaihai Region. To the best of our knowledge, this study was the first to use raw digital numbers derived from time-series MODIS images to implement classification pipelines and make Remote Sens. 2019, 11, 1665 19 of 21 predictions over a large-scale land area. According to our experiments, we can draw the following general conclusions. (1) Both the RF and A-LSTM models were efficient for identifying winter wheat areas, with F1 scores of 0.72 and 0.71, respectively. (2) The comparison of the two models indicates that RF achieved a slightly better score than A-LSTM, while A-LSTM is much faster with GPU acceleration. (3) For time-series winter wheat identification, the data acquired in March or April are more important and contribute more than the data acquired at other times. Vegetation indices such as EVI and NDVI are more helpful than single reflectance bands. (4) Predicting in local or nearby areas is more likely to yield reliable results. (5) The models are also capable of efficiently identifying historical winter wheat areas. Our models were applied to only winter wheat identification due to the limitations of the crop distribution data, but they could potentially solve multiclass problems. Further research regarding multiple crop types or other land cover types over large regions could be more meaningful and useful. Moreover, due to the scarcity and acquisition difficulties of high-spatiotemporal-resolution images, the cell size of each training sample was 250 m, which is too large to avoid the mixed-pixel problem. We believe that using time-series images with finer scales could help improve the accuracy and generalizability of our models. Obviously, it is easy to determine how predictions are determined in RF models, but this information is difficult to determine in the A-LSTM model. Future research will likely devote more attention to the inference mechanisms of DNNs.

Author Contributions: T.H. performed the research, analyzed the data and wrote the manuscript; C.X. provided ideas and reviewed the experiments and the manuscript. S.G. acquired the data. G.L. reviewed the research and provided funding support. Funding: This research was financially supported by the National Key Research and Development Program of China (No. 2017YFD0300403) and the Laboratory Independent Innovation Project of State Key Laboratory of Resources and Environment Information System. Acknowledgments: We thank the Chinese Academy of Agricultural Sciences for providing the winter wheat distribution map and growing season data for 2017–2018. We thank the NASA Goddard Space Flight Center for providing the MODIS data. Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Becker-Reshef, I.; Justice, C.; Sullivan, M.; Vermote, E.; Tucker, C.; Anyamba, A.; Small, J.; Pak, E.; Masuoka, E.; Schmaltz, J.; et al. Monitoring Global Croplands with Coarse Resolution Earth Observations: The Global Agriculture Monitoring (GLAM) Project. Remote Sens. 2010, 2, 1589–1609. [CrossRef] 2. Eerens, H.; Haesen, D.; Rembold, F.; Urbano, F.; Tote, C.; Bydekerke, L. Image time series processing for agriculture monitoring. Environ. Model. Softw. 2014, 53, 154–162. [CrossRef] 3. Atzberger, C. Advances in Remote Sensing of Agriculture: Context Description, Existing Operational Monitoring Systems and Major Information Needs. Remote Sens. 2013, 5, 949–981. [CrossRef] 4. Beeri, O.; Peled, A. Geographical model for precise agriculture monitoring with real-time remote sensing. ISPRS J. Photogramm. Remote Sens. 2009, 64, 47–54. [CrossRef] 5. Senf, C.; Pﬂugmacher, D.; Van Der Linden, S.; Hostert, P. Mapping rubber plantations and natural forests in Xishuangbanna (Southwest China) using multi-spectral phenological metrics from MODIS time series. Remote Sens. 2013, 5, 2795–2812. [CrossRef] 6. Pittman, K.; Hansen, M.C.; Becker-Reshef, I.; Potapov, P.V.; Justice, C.O. Estimating Global Cropland Extent with Multi-year MODIS Data. Remote Sens. 2010, 2, 1844–1863. [CrossRef] 7. Tuanmu, M.N.; Viña, A.; Bearer, S.; Xu, W.; Ouyang, Z.; Zhang, H.; Liu, J. Mapping understory vegetation using phenological characteristics derived from remotely sensed data. Remote Sens. Environ. 2010, 114, 1833–1844. [CrossRef] 8. Sakamoto, T.; Yokozawa, M.; Toritani, H.; Shibayama, M.; Ishitsuka, N.; Ohno, H. A crop phenology detection method using time-series MODIS data. Remote Sens. Environ. 2005, 96, 366–374. [CrossRef] 9. Zhang, X.; Friedl, M.A.; Schaaf, C.B.; Strahler, A.H.; Hodges, J.C.; Gao, F.; Reed, B.C.; Huete, A. Monitoring vegetation phenology using MODIS. Remote Sens. Environ. 2003, 84, 471–475. [CrossRef] Remote Sens. 2019, 11, 1665 20 of 21

10. Funk, C.; Budde, M.E. Phenologically-tuned MODIS NDVI-based production anomaly estimates for Zimbabwe. Remote Sens. Environ. 2009, 113, 115–125. [CrossRef] 11. Jönsson, P.; Eklundh, L. Seasonality extraction by function fitting to time-series of satellite sensor data. Geosci. Remote Sens. IEEE Trans. 2002, 40, 1824–1832. [CrossRef] 12. Jönsson, P.; Eklundh, L. TIMESAT—A program for analyzing time-series of satellite sensor data. Comput. Geosci. 2004, 30, 833–845. [CrossRef] 13. Atzberger, C.; Eilers, P.H.C. A time series for monitoring vegetation activity and phenology at 10-daily timesteps covering large parts of South America. Int. J. Digit. Earth 2011, 4, 365–386. [CrossRef] 14. Chen, J.; Jönsson, P.; Tamura, M.; Gu, Z.; Matsushita, B.; Eklundh, L. A simple method for reconstructing a high-quality NDVI time-series data set based on the Savitzky-Golay filter. Remote Sens. Environ. 2004, 91, 332–344. [CrossRef] 15. Shao, Y.; Lunetta, R.S.; Wheeler, B.; Iiames, J.S.; Campbell, J.B. An evaluation of time-series smoothing algorithms for land-cover classifications using MODIS-NDVI multi-temporal data. Remote Sens. Environ. 2016, 174, 258–265. [CrossRef] 16. Zhong, L.; Hu, L.; Zhou, H. Deep learning based multi-temporal crop classification. Remote Sens. Environ. 2019, 221, 430–443. [CrossRef] 17. Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 2012, 67, 93–104. [CrossRef] 18. Pelletier, C.; Valero, S.; Inglada, J.; Champion, N.; Dedieu, G. Assessing the robustness of Random Forests to map land cover with high resolution satellite image time series over large areas. Remote Sens. Environ. 2016, 187, 156–168. [CrossRef] 19. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef] 20. Rußwurm, M.; Korner, M. Temporal vegetation modelling using long short-term memory networks for crop identification from medium-resolution multi-spectral satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 11–19. 21. Rußwurm, M.; Körner, M. Multi-temporal land cover classification with sequential recurrent encoders. ISPRS Int. J. Geo-Inf. 2018, 7, 129. [CrossRef] 22. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. 23. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. 24. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 25. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. 26. Long, H.; Yi, Q. Land Use Transitions and Land Management: A Mutual Feedback Perspective. Land Use Policy 2018, 74, 111–120. [CrossRef] 27. Zhang, J.; Sun, J.; Duan, A.; Wang, J.; Shen, X.; Liu, X. Effects of different planting patterns on water use and yield performance of winter wheat in the Huang-Huai-Hai plain of China. Agric. Water Manag. 2007, 92, 41–47. [CrossRef] 28. Huete, A.R.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [CrossRef] 29. Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. 30. Scikit-Learn. Available online: https://scikit-learn.org/ (accessed on 1 March 2019). 31. Tensorflow. Available online: https://www.tensorflow.org/ (accessed on 1 March 2019). 32. Keras. Available online: https://keras.io/ (accessed on 1 March 2019). Remote Sens. 2019, 11, 1665 21 of 21

33. Sebastian, R. An Overview of Gradient Descent Optimization Algorithms. Available online: https://arxiv. org/pdf/1609.04747.pdf (accessed on 3 January 2019). 34. Yao, Y.; Rosasco, L.; Caponnetto, A. On Early Stopping in Gradient Descent Learning. Constr. Approx. 2007, 26, 289–315. [CrossRef] 35. Li, Z.-C. Hyperspectral Features of Winter Wheat after Frost Stress at Jointing Stage: Hyperspectral Features of Winter Wheat after Frost Stress at Jointing Stage. Acta Agron. Sin. 2008, 34, 831–837. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).