A Novel Framework Based on Mask R-CNN and Histogram Thresholding for Scalable Segmentation of New and Old Rural Buildings

remote sensing

Communication A Novel Framework Based on Mask R-CNN and Histogram Thresholding for Scalable Segmentation of New and Old Rural Buildings

Ying Li 1, Weipan Xu 1, Haohui Chen 2 , Junhao Jiang 1 and Xun Li 1,*

1 Department of Urban and Regional Planning, School of Geography and Planning, China Regional Coordinated Development and Rural Construction Institute, Urbanization Institute, Sun Yat-sen University, Guangzhou 510275, China; [email protected] (Y.L.); [email protected] (W.X.); [email protected] (J.J.) 2 Data61, Commonwealth Scientiﬁc and Industrial Research Organisation (CSIRO), Canberra 2601, Australia; [email protected] * Correspondence: [email protected]

Abstract: Mapping new and old buildings are of great significance for understanding socio-economic development in rural areas. In recent years, deep neural networks have achieved remarkable building segmentation results in high-resolution remote sensing images. However, the scarce training data and the varying geographical environments have posed challenges for scalable building segmentation. This study proposes a novel framework based on Mask R-CNN, named Histogram Thresholding Mask Region-Based Convolutional Neural Network (HTMask R-CNN), to extract new and old rural buildings even when the label is scarce. The framework adopts the result of single-object instance Citation: Li, Y.; Xu, W.; Chen, H.; segmentation from the orthodox Mask R-CNN. Further, it classifies the rural buildings into new and Jiang, J.; Li, X. A Novel Framework old ones based on a dynamic grayscale threshold inferred from the result of a two-object instance Based on Mask R-CNN and segmentation task where training data is scarce. We found that the framework can extract more Histogram Thresholding for Scalable buildings and achieve a much higher mean Average Precision (mAP) than the orthodox Mask R-CNN Segmentation of New and Old Rural model. We tested the novel framework’s performance with increasing training data and found that it Buildings. Remote Sens. 2021, 13, 1070. converged even when the training samples were limited. This framework’s main contribution is to https://doi.org/10.3390/rs13061070 allow scalable segmentation by using significantly fewer training samples than traditional machine learning practices. That makes mapping China’s new and old rural buildings viable. Academic Editors: Sawaid Abbas, Janet E. Nichol, Faisal M. Qamer and Keywords: deep learning; rural buildings; instance segmentation; Mask R-CNN; histogram thresh- Jianchu Xu olding

Received: 8 February 2021 Accepted: 8 March 2021 Published: 11 March 2021 1. Introduction Publisher’s Note: MDPI stays neutral Monitoring the composition of new and old buildings in rural area is of great signifi- with regard to jurisdictional claims in cance to rural development [1]. In particular, China’s recent rapid urbanization has tremen- published maps and institutional affil- dously transformed its rural settlements over the last decades [2]. However, unplanned and iations. poorly-documented dwellings have posed significant challenges for understanding rural settlements [3,4]. Traditionally, field surveys had been the major solutions, but they require intensive labour inputs and could be time-consuming, especially in remote areas. The recent breakthroughs of remote sensing technologies provide the growing availability of Copyright: © 2021 by the authors. high-resolution remote sensing images such as low-altitude aerial photos and Unmanned Licensee MDPI, Basel, Switzerland. Aerial Vehicle (UAV) images. That allows manual mappings of the rural settlements at a This article is an open access article lower cost and with broader coverage, but they are still time-consuming. Therefore, to map distributed under the terms and the settlements for nearly 564 million rural population in China [5], a scalable, intelligent conditions of the Creative Commons and image-based solution is urgently needed. Attribution (CC BY) license (https:// Remote sensing-based mapping of buildings has been a popular research topic for creativecommons.org/licenses/by/ decades [6–10]. Since the launch of IKONO, QuickBird, WorldView and most recently 4.0/).

Remote Sens. 2021, 13, 1070. https://doi.org/10.3390/rs13061070 https://www.mdpi.com/journal/remotesensing Remote Sens. 2021, 13, 1070 2 of 11

UAVs, the remote sensing images’ spatial resolution grows significantly. That encourages applications of machine learning (ML) technologies in computer vision to push image segmentation forefronts. ML-based methods, such as Markov Random Fields [11], Bayesian Network [12], Neural Network [13], SVM [14], and Deep Convolution neural network (DCNN) [15] achieved impressive results. These supervised methods learn the spatial and hierarchical relationships between pixels and objects, utilizing remote sensing technologies’ resolution gains in recent years. There are two kinds of image segmentation. The semantics segmentation, such as U-Net [16], SegNet [17], and DeepLab [18], treats multiple objects of the same class as a single entity. In contrast, instance segmentation treats multiple objects of the same class as distinct individual entities (or instances) [19,20]. The understanding of rural settlements would involve instance segmentation instead of semantic segmentation, as settlement’s numbers and areas are needed.Amongst the instance segmentation methods, Mask Region-Based Convolutional Neural Network (R-CNN) has been unprecedentedly popular. It predicts bounding boxes for the target objects and then segments the objects inside the predicted boxes [21]. Scholars have used Mask R-CNN to carry out building extraction work and regularize the extraction results. Li et al. [22] improved Mask R-CNN by detecting key points to segment individual building and preserve geometric details. Zhao et al. introduced building boundary regularization to the orthodox Mask R-CNN, which benefited many cartographic and engineering applications [23]. In addition, some scholars have used Mask R-CNN for object detection [24] and mapping other land use objects, such as ice-wedge [25,26]. The empirical studies mostly focus on the optimization of object extraction but pay little attention to the challenges where training samples are scarce. Compared to the urban buildings, rural ones attracted much less attention from the remote sensing community [27,28]. Only a few rural datasets were open to the public. The Wuhan dataset (WHU) [29] extracts 220,000 independent buildings from high-resolution remote sensing images covering 450 km2 in Christchurch, New Zealand. The Massachusetts dataset consists of 151 aerial images of the Boston area, covering roughly 340 km2 [30]. Training on these datasets, the ML-based models achieved impressive segmentation results [31–33]. However, these data only cover a relatively small area. As a result, the models might not generalize well in other regions where the geographical environments differ significantly. Therefore, the lack of training data specifically for rural environments and the varying geographical environments across regions have posed challenges for building a robust and genialized algorithm. To achieve human-like segmentation for China’s vast and varying geography, we might need to build models dedicated to different regions. In this regard, the bottleneck is the manual effort for annotating a large amount of training data. In this study, we proposed a novel framework that could significantly reduce data annotation efforts while retaining the classification capability. Humans annotate the new and old buildings in high-resolution remote sensing images by the difference of pixel color, because the histogram of grayscales would vary significantly across new and old buildings. When new and old buildings’ grayscale histogram exhibits bimodal distributions, the valley point can be used as the threshold for discriminating them. This methodology is called histogram thresholding, which is widely used for image object extraction, such as the extraction of surface water area [34], built-up areas [35], and brain image fusion [36]. If all building footprints are given, a few predicted labels of new and old buildings in the same remote sensing image could validate if such a bimodal distribution exists and consequently find the valley point. In this regard, we can reduce the number of training samples while retaining the algorithm’s capability. It should be noticed that the segmentation of buildings is more manageable than the segmentation of new and old ones in machine learning practices. That is partially due to the reduced efforts for labelling one class instead of two. Another reason is that binary classifiers are more accurate than multi-class classifiers in general. For example, classifying dogs is easier than classifying dog breeds. Therefore, the proposed framework uses histogram thresholding as an add-on to the state-of-the-art deep learning algorithm to achieve impressive segmentation results. In the methods section, we will address the proposed framework in detail. The proposed Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 11

reduced efforts for labelling one class instead of two. Another reason is that binary classi‐ fiers are more accurate than multi‐class classifiers in general. For example, classifying dogs is easier than classifying dog breeds. Therefore, the proposed framework uses histo‐ gram thresholding as an add‐on to the state‐of‐the‐art deep learning algorithm to achieve impressive segmentation results. In the methods section, we will address the proposed Remote Sens. 2021, 13, 1070 framework in detail. The proposed framework’s contribution to the building3 of extraction 11 research area is to achieve a promising classification capability while annotation efforts can be significantly reduced. This study uses rural areas in Xinxing County, Guangdong Province,framework’s as the contribution case study to to the test building the proposed extraction framework’s research area is performance. to achieve a promising classification capability while annotation efforts can be significantly reduced. This study 2. Studyuses ruralArea areas and inData Xinxing County, Guangdong Province, as the case study to test the proposed framework’s performance. 2.1. Study Area 2. Study Area and Data To test the proposed framework’s performance, we collected data samples from high‐ 2.1. Study Area resolution satellite images covering rural Xinxing County, Guangdong Province, China To test the proposed framework’s performance, we collected data samples from high- (see resolutionFigure 1). satellite Xinxing images is a traditional covering rural mountainous Xinxing County, agricultural Guangdong county, Province, with China a (see large agri‐ culturalFigure population1). Xinxing isand a traditional a relatively mountainous complete agricultural landscape, county, forest, with and a large land agricultural city. Moreover, Xinxingpopulation is a rural and a revitalization relatively complete pilot landscape, area and forest, has and made land much city. Moreover, rural development Xinxing is and governancea rural revitalization achievements pilot [37]. area The and hasextraction made much of new rural and development old buildings and governance is of great signif‐ icanceachievements for understanding [37]. The rural extraction development of new and in old Xinxing. buildings is of great significance for understanding rural development in Xinxing.

Figure 1. The study area. (a) The location of Xingxing in Guangdong Province, China, (b) a village of Xinxing (c) building Figure 1. The study area. (a) The location of Xingxing in Guangdong Province, China, (b) a village of Xinxing (c) building labels of the same region in panel c (new and old buildings are masked in blue and red respectively). labels of the same region in panel c (new and old buildings are masked in blue and red respectively).

2.2. Data2.2. Data Collection Collection and and Annotation Annotation TableTable 1 shows1 shows the the new new and old old buildings buildings in high-resolutionin high‐resolution satellite satellite images. images. Most of Most of the new buildings are brick-concrete structures, with roofs made of cement or colored tiles. the newOld houses buildings are mainly are brick Cantonese-style‐concrete structures, courtyards inwith Xinxing, roofs where made the of roof cement materials or colored tiles.are Old dark houses tiles. Moreover, are mainly the Cantonese outline of their‐style footprints courtyards is less in clear Xinxing, than new where houses. the New roof mate‐ RemoteRemoteRemoteRemoteRemoteRemoteRemoteRemote Sens.Sens. Sens.Sens. Sens. Sens. 2021 Sens.2021Sens. 20212021 2021 2021,, 132021rials,202113, 1313,, , 13, x,, x , 13 ,x, FOR x, FOR13 13buildings xFOR,,FORare x,FOR, x xFOR PEER PEERFOR FOR PEERPEERdark PEER PEER PEERPEER REVIEWREVIEW REVIEWREVIEW areREVIEWtiles. REVIEW REVIEWREVIEW mostly Moreover, distributed the alongoutline the of streets, their footprints while the old is buildingsless clear still than retain new a houses.44 4 4of of 4ofof 4 of11 4 11 4of11 11 of of11 11 1111

Newcompact buildings comb are pattern. mostly distributed along the streets, while the old buildings still retain a compact comb pattern. Table 1. The image characteristics of new and old buildings in rural areas. TableTableTableTableTableTableTableTable 1.1. 1.1. TheThe1. TheThe1. The1.1. The image image The Theimageimage image imageimage imageimage characteristicscharacteristics characteristicscharacteristics characteristics characteristics characteristicscharacteristics ofof ofof newofnew newofnew ofnewof new andnewandnew andand and and old old and andoldold old buildingsoldbuildings oldbuildingsoldbuildings buildings buildings buildingsbuildings inin inin ruralinrural ruralininrural inruralin rural rural ruralareas.areas. areas.areas. areas. areas. areas.areas. TypeTypeTypeTypeTypeTypeTypeTypeType ImageImageImageImageImageImage ImageImage CharacteristicsCharacteristics CharacteristicsCharacteristics Characteristics Characteristics Characteristics CharacteristicsCharacteristics

oldoldoldoldoldold old buildingsbuildings old oldbuildingsbuildings buildings buildings buildings buildingsbuildings

newnewnewnewnewnewnew new buildingsnewbuildings buildingsbuildings buildings buildings buildingsbuildings

ForForForForFor For modelmodel For Formodelmodel model model modelmodel trainingtraining trainingtraining training training trainingtraining purposes,purposes, purposes,purposes, purposes, purposes, purposes,purposes, wewe wewe we collectedwecollected collectedwecollectedwe collected collected collectedcollected 6868 6868 68 imagesimages 68 images images 68 68images images imagesimages withwith withwith with with with with aa aa resolutionresolution a resolutionresolution a resolution a aresolution resolutionresolution ofof ofof of0.260.26 of0.260.26 of0.26of 0.26 m.0.26m.0.26 m.m. m. Each Eachm. Each Each m.m. Each Each Each Each imageimageimageimageimageimageimageimage hashas hashas has has a a has hasa a size size a size sizea size a asize rangingsizerangingsize rangingranging ranging ranging rangingranging fromfrom fromfrom from from from from 900900 900900 900 900 × × 900 900 ×× 900 900 × 900 900× 900 × × 900 to to 900 900toto to1024 1024 to1024 1024 toto1024 1024 10241024 ×× × × 1024 1024 × 1024 1024× 1024 × ×1024 10241024 pixelspixels pixelspixels pixels pixels pixelspixels inin inin inthe the inthe the ininthe theRGB RGB thetheRGBRGB RGB RGB RGB RGB colorcolor colorcolor color color colorcolor space.space. space.space. space. space. space.space. WeWe WeWe We We We We useuseuseuseuse use thethe use usethethe the theopen open the theopenopen open open openopen‐‐sourcesource‐‐sourcesource‐source‐source‐‐sourcesource imageimage imageimage image image imageimage annotationannotation annotationannotation annotation annotation annotationannotation tooltool tooltool tool tool toolVIAtoolVIA VIAVIA VIA VIA VIA[38]VIA[38] [38][38] [38] [38] to[38]to[38] toto todelineate delineate todelineate delineate totodelineate delineate delineatedelineate thethe thethe the thebuilding building thethebuildingbuilding building building buildingbuilding footprintsfootprints footprintsfootprints footprints footprints footprintsfootprints (see(see(see(see(see(see Figure(see(seeFigure FigureFigure Figure Figure FigureFigure 2).2). 2).2). 2). WeWe2). WeWe2).2). We We edited edited We Weeditededited edited edited editededited andand andand and and checkedandcheckedand checkedchecked checked checked checkedchecked allall allall all thetheall the theallall the thebuilding building the thebuildingbuilding building building buildingbuilding samplessamples samplessamples samples samples samplessamples ofof ofof ofthe the ofthe the ofofthe theoriginal original the theoriginaloriginal original original originaloriginal vectorvector vectorvector vector vector vectorvector filefile filefile file file file file usingusingusingusingusingusingusingusing thethe thethe the theArcGISArcGIS the ArcGIStheArcGIS ArcGIS ArcGIS ArcGISArcGISTMTMTMTM TM softwaresoftware TM softwaresoftware TMTMsoftware software softwaresoftware toto toto toproduceproduce toproduceproduce totoproduce produce produceproduce aa a a high high a high high a high a ahigh ‐highhigh‐qualityquality‐‐qualityquality‐quality‐quality‐‐qualityquality dataset.dataset. dataset.dataset. dataset. dataset. dataset.dataset. AllAll AllAll All Allbuildingbuilding AllbuildingAllbuilding building building buildingbuilding samplessamples samplessamples samples samples samplessamples fromfrom fromfrom from from from from thosethosethosethosethosethosethosethose 6868 6868 68 images images 68 images images 68 68images images imagesimages werewere werewere were were werewere compiledcompiled compiledcompiled compiled compiled compiledcompiled asas asas as datasetdataset asdatasetdataset asasdataset dataset datasetdataset calledcalled calledcalled called called calledcalled oneone oneone one one‐‐ classone‐classone‐classclass‐class‐class‐‐class class samples.samples. samples.samples. samples. samples. samples.samples. WeWe WeWe We We annotated annotated We Weannotatedannotated annotated annotated annotatedannotated onlyonly onlyonly only only only only3434 3434 34 34 34 34 outoutoutoutout out ofof out outofof of6060 of6060 of of60 images images 60 images images 60 60images images imagesimages withwith withwith with with with with newnew newnew new new new andnewand andand and and oldandoldand oldold old old labels labels old oldlabelslabels labels labels labelslabels (called(called (called(called (called (called (called(called twotwo twotwo two two‐ ‐ twoclasstwoclass‐‐classclass‐class‐class‐‐class class samplessamples samplessamples samples samples samplessamples hereafter).hereafter). hereafter).hereafter). hereafter). hereafter). hereafter).hereafter). Finally,Finally, Finally,Finally, Finally, Finally, Finally,Finally, thethe thethe the the thethe annotatedannotatedannotatedannotatedannotatedannotatedannotatedannotated imageriesimageries imageriesimageries imageries imageries imageriesimageries werewere werewere were were werewere randomlyrandomly randomlyrandomly randomly randomly randomlyrandomly divideddivided divideddivided divided divided divideddivided intointo intointo into into intoaintoa a a training,training, a training,training,a training,a atraining, training,training, validationvalidation validationvalidation validation validation validationvalidation andand andand and and testandandtest testtest test test settestset test setset set (seeset (see set(seeset(see (see (see (see(see TableTableTableTableTableTableTableTable 2).2). 2).2). 2). 2). 2). 2).

FigureFigureFigureFigureFigureFigureFigureFigure 2.2. 2.2. The2.The TheThe2. The2.2. The construction construction The Theconstructionconstruction construction construction constructionconstruction ofof ofof imageofimage imageofimage ofimageof imageimage imageimage dataset.dataset. dataset.dataset. dataset. dataset. dataset.dataset. (( a a(()a )a ( )a )a ( a ) avillage( village( )a avillagevillage ) a)village a avillage villagevillage image;image; image;image; image; image;image; image;image; (( b b((b )b)( )bthe )the( bthe )the(( b)theb building)the)building buildingthebuildingthe building building buildingbuilding labels;labels; labels;labels; labels; labels;labels; labels;labels; (( c c(()c )c ( )the )c the ( the) thec ( ()thecc label)the)label labelthelabelthe label labellabel label jsonlabeljson jsonjson json json json file; file;json json file;file; file; file; ( ( dfile; file;d((d )d)( )dthe )the( dthe )the(( d)thed mask)the)mask maskthemaskthe mask mask mask mask withwithwithwithwithwith withbuildingwithbuilding buildingbuilding building building buildingbuilding categories.categories. categories.categories. categories. categories. categories.categories.

TableTableTableTableTableTableTableTable 2.2. 2.2. TheThe2. TheThe2. The2.2. The training, training, The Thetraining,training, training, training, training,training, validation,validation, validation,validation, validation, validation, validation,validation, andand andand and and test test and and testtest test testdataset.dataset. test dataset.testdataset. dataset. dataset. dataset.dataset.

OneOneOneOneOneOne‐One‐OneClass‐Class‐ClassClass‐Class‐Class‐‐ClassClass SamplesSamples SamplesSamples Samples Samples SamplesSamples TwoTwoTwoTwoTwoTwoTwo‐Two‐Class‐Class‐ClassClass‐Class‐Class‐‐ClassClass SamplesSamples SamplesSamples Samples Samples SamplesSamples NewNewNewNewNewNewNewNew BuildingsBuildingsBuildingsBuildingsBuildingsBuildingsBuildingsBuildings OldOldOldOldOldOld OldBuildingsOldBuildings BuildingsBuildings Buildings Buildings BuildingsBuildings TotalTotalTotalTotalTotalTotalTotalTotal BuildingsBuildingsBuildingsBuildingsBuildingsBuildingsBuildingsBuildings trainingtrainingtrainingtrainingtrainingtrainingtrainingtraining 54545454 54 pic/5359pic/5359 54 pic/5359pic/535954 54pic/5359 pic/5359 pic/5359pic/5359 polypoly polypoly poly poly polypoly 134013401340134013401340 13401340 108110811081108110811081 10811081 20202020 20 pic/2421pic/2421 20 pic/2421pic/242120 20pic/2421 pic/2421 pic/2421pic/2421 polypoly polypoly poly poly polypoly validationvalidationvalidationvalidationvalidationvalidationvalidationvalidation 888 8 pic/817pic/817 8 pic/817pic/8178 pic/8178 8pic/817 pic/817pic/817 polypoly polypoly poly poly polypoly 439439439439439 439 439 439 378378378378378 378 378 378 888 8 pic/817pic/817 8 pic/817pic/8178 pic/8178 8pic/817 pic/817pic/817 polypoly polypoly poly poly polypoly testtesttesttesttest test test test 666 6 pic/892pic/892 6 pic/892pic/8926 pic/8926 6pic/892 pic/892pic/892 polypoly polypoly poly poly polypoly 462462462462462 462 462 462 454454454454454 454 454 454 666 6 pic/906pic/906 6 pic/906pic/9066 pic/9066 6pic/906 pic/906pic/906 polypoly polypoly poly poly polypoly totaltotaltotaltotaltotaltotal totaltotal 68686868 68 pic/7068pic/7068 68 pic/7068pic/706868 68pic/7068 pic/7068 pic/7068pic/7068 polypoly polypoly poly poly polypoly 224122412241224122412241 22412241 191319131913191319131913 19131913 34343434 34 pic/4154pic/4154 34 pic/4154pic/415434 34pic/4154 pic/4154 pic/4154pic/4154 polypoly polypoly poly poly polypoly

3.3.3.3. Methods3.Methods Methods3.Methods 3.Methods3. Methods MethodsMethods 3.1.3.1.3.1.3.1.3.1. 3.1. HTMaskHTMask 3.1. 3.1.HTMaskHTMask HTMask HTMask HTMaskHTMask RR RR‐ ‐CNNR‐CNN‐ CNNRCNN‐ RCNNR‐CNN‐‐CNNCNN MaskMaskMaskMaskMaskMaskMaskMask RR RR‐ ‐CNNR‐CNN ‐ CNNRCNN ‐ RCNNR‐CNN‐‐CNNCNN [21][21] [21][21] [21] [21] has[21]has[21] hashas has has been been has hasbeenbeen been been been been provenproven provenproven proven proven provenproven toto toto tobe be tobe be toto be aa be a a be powerfulbepowerful a powerful powerfula powerfula apowerful powerfulpowerful andand andand and and adaptableandandadaptable adaptableadaptable adaptable adaptable adaptableadaptable modelmodel modelmodel model model modelmodel inin inin inmany many inmany many ininmany many manymany differentdifferentdifferentdifferentdifferentdifferentdifferentdifferent domainsdomains domainsdomains domains domains domainsdomains [23,39].[23,39]. [23,39].[23,39]. [23,39]. [23,39]. [23,39].[23,39]. ItIt ItIt operatesItoperates operatesItoperates ItoperatesIt operates operatesoperates inin inin intwo two intwo two inintwo two phases,twotwophases, phases,phases, phases, phases, phases,phases, generationgeneration generationgeneration generation generation generationgeneration ofof ofof ofregion region ofregion region ofofregion region regionregion proposalsproposals proposalsproposals proposals proposals proposalsproposals andand andand and and andand classificationclassificationclassificationclassificationclassificationclassificationclassificationclassification ofof ofof ofeacheach ofeacheach ofofeach each eacheach generatedgenerated generatedgenerated generated generated generatedgenerated proposal.proposal. proposal.proposal. proposal. proposal. proposal.proposal. InIn InIn In this this Inthis this InInthis this study, thisstudy, this study,study, study, study, study,study, wewe wewe we useweuse weuseweuse use use Mask Mask use useMaskMask Mask Mask MaskMask RR RR‐ ‐CNNR‐CNN‐ CNNRCNN‐ RCNNR‐CNN‐‐CNNCNN asas asas as our our asour our asasour our base baseour our basebase base base ‐base‐base‐‐‐ ‐‐‐ linelinelinelinelineline linemodel linemodel modelmodel model model modelmodel for for forfor for for benchmarking. benchmarking. forforbenchmarking.benchmarking. benchmarking. benchmarking. benchmarking.benchmarking. As As AsAs As As discussed discussed AsdiscussedAsdiscussed discussed discussed discusseddiscussed before, before, before,before, before, before, before,before, we we wewe we we propose propose weweproposepropose propose propose proposepropose a a a a novel anovel novelnovela novela anovel novelnovel segmentation segmentation segmentationsegmentation segmentation segmentation segmentationsegmentation frameworkframeworkframeworkframeworkframeworkframeworkframeworkframework that that thatthat that that thatcanthat can cancan can can utilizecan utilizecan utilizeutilize utilize utilize utilizeutilize the the thethe the the histogram histogramthe thehistogramhistogram histogram histogram histogramhistogram thresholding thresholding thresholdingthresholding thresholding thresholding thresholdingthresholding and and andand and and anddeepand deep deepdeep deep deep deepdeep learning’s learning’s learning’slearning’s learning’s learning’s learning’slearning’s image image imageimage image image imageimage seg seg segseg seg ‐seg‐ ‐seg‐seg‐ ‐‐‐ mentationmentationmentationmentationmentationmentationmentationmentation capability capability capabilitycapability capability capability capabilitycapability to to toto to extract extract to extract extract totoextract extract extractextract the the thethe the the new newthe thenewnew new new new newand and andand and and andoldand old oldold old old rural ruralold oldruralrural rural rural ruralrural buildings buildings buildingsbuildings buildings buildings buildingsbuildings.. .We. We .WeWe ..We . .We call Wecall Wecallcall call call thecall thecall thethe the the proposed proposedthe theproposedproposed proposed proposed proposedproposed

Remote Sens. 2021, 13, x FOR PEER REVIEW 4 of 11

Table 1. The image characteristics of new and old buildings in rural areas.

Type Image Characteristics

old buildings

new buildings Remote Sens. 2021, 13, 1070 4 of 11

For model training purposes, we collected 68 images with a resolution of 0.26 m. Each imageFor has model a size training ranging purposes, from 900 we × 900 collected to 1024 68 × images1024 pixels with in a resolutionthe RGB color of 0.26 space. m. Each We useimage the has open a size‐source ranging image from annotation 900 × 900 tool to 1024VIA ×[38]1024 to pixelsdelineate in the the RGB building color footprints space. We (seeuse theFigure open-source 2). We edited image and annotation checked all tool the VIA building [38] to samples delineate of the the building original vector footprints file using(see Figure the ArcGIS2). WeTM edited software and to checked produce all a thehigh building‐quality dataset. samples All of thebuilding original samples vector from file TM thoseusing 68 the images ArcGIS weresoftware compiled to as produce dataset a called high-quality one‐class dataset. samples. All building We annotated samples only from 34 outthose of 6860 imagesimages werewith compilednew and old as datasetlabels (called called two one-class‐class samples.samples hereafter). We annotated Finally, only the 34 annotatedout of 60 images imageries with were new randomly and old labels divided (called into two-class a training, samples validation hereafter). and test Finally, set (see the annotated imageries were randomly divided into a training, validation and test set (see Table 2). Table2).

Figure 2. The construction of image dataset. ( a)) a village image; image; ( (b)) the building labels; ( c) the label json file; ﬁle; ( d) the mask with building categories.

Table 2. The training, validation, and test dataset.

OneOne-Class‐Class SamplesSamples Two Two-Class‐Class Samples Samples Buildings NewNew Buildings Old Buildings Total Buildings Old Buildings Total training 54 pic/5359 polyBuildings 1340 1081 20 pic/2421 poly trainingvalidation 54 8 pic/5359 pic/817 poly poly 1340 439 1081 378 20 8pic/2421 pic/817 poly validationtest 8 6 pic/817 pic/892 poly 462439 378 454 8 6pic/817 pic/906 poly poly testtotal 686 pic/892 pic/7068 poly poly 2241462 454 1913 634 pic/906 pic/4154 poly poly total 68 pic/7068 poly 2241 1913 34 pic/4154 poly 3. Methods 3.3.1. Methods HTMask R-CNN 3.1. HTMaskMask R-CNN R‐CNN [ 21] has been proven to be a powerful and adaptable model in many differentMask domains R‐CNN [[21]23,39 has]. It been operates proven in twoto be phases, a powerful generation and adaptable of region model proposals in many and differentclassiﬁcation domains of each [23,39]. generated It operates proposal. in Intwo this phases, study, generation we use Mask of R-CNN region asproposals our baseline and classificationmodel for benchmarking. of each generated As discussed proposal. before, In this westudy, propose we use a novel Mask segmentationR‐CNN as our frame- base‐ linework model that can for utilize benchmarking. the histogram As discussed thresholding before, and deepwe propose learning’s a imagenovel segmentation frameworkcapability to that extract can utilize the new the and histogram old rural thresholding buildings. We and call deep the learning’s proposed frameworkimage seg‐ mentationHTMask R-CNN, capability abbreviating to extract Histogramthe new and Thresholding old rural buildings Mask R-CNN.. We call The the workﬂow proposed of the framework is addressed as follows (see Figure3 for illustration): a. We built two segmentation models (one-class model and two-class model) based on the one-class and two-class samples’ training sets (Figure3a). The one-class model

can extract rural buildings, while the two-class model can classify new and old rural buildings. All models used the pre-trained weights trained on the COCO dataset [40] as the base weights. b. An satellite image (Figure3b) is classiﬁed by the one-class and two-class model separately, leading to a map of building footprints (R1 in Figure3c), and a map of new and old buildings (R2 in Figure3d). c. Grayscale histograms were built using the pixels from the new and old building footprints (R2). The average grayscale levels for new and old buildings were computed as N and O, respectively (Figure3e). A valley point is determined by θ = (N + O)/2. d. The valley point θ is used as the threshold to determine the type of building in R1. Finally, we get a map of the old and new buildings in R3 (Figure3f). Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 11

framework HTMask R‐CNN, abbreviating Histogram Thresholding Mask R‐CNN. The workflow of the framework is addressed as follows (see Figure 3 for illustration): a. We built two segmentation models (one‐class model and two‐class model) based on the one‐class and two‐class samples’ training sets (Figure 3a). The one‐class model can extract rural buildings, while the two‐class model can classify new and old rural buildings. All models used the pre‐trained weights trained on the COCO dataset [40] as the base weights. b. An satellite image (Figure 3b) is classified by the one‐class and two‐class model sep‐ arately, leading to a map of building footprints (R1 in Figure 3c), and a map of new and old buildings (R2 in Figure 3d). c. Grayscale histograms were built using the pixels from the new and old building foot‐ prints (R2). The average grayscale levels for new and old buildings were computed as N and O, respectively (Figure 3e). A valley point is determined by N+O/2. d. The valley point is used as the threshold to determine the type of building in R1. Finally, we get a map of the old and new buildings in R3 (Figure 3f). The hypothesis is that R3 performs better than R2. Specifically, R3 can take advantage of the capability of R1, while utilizing the grayscale difference of the new and old build‐ ings in R2. The two‐class model’s performance depends on the numbers of training sam‐ ples. Assumably, the more the training data are added, the more robust the network train‐ ing, the better the segmentation results. This study also tests how the numbers of training samples could affect R2 and R3’s performance to evaluate how HTMask R‐CNN can save Remote Sens. 2021, 13, 1070 the annotation efforts while retaining the segmentation capability. 5 of 11

Figure 3. 3. TheThe workflow workﬂow of of HTMask HTMask R‐ R-CNNCNN (Histogram (Histogram Thresholding Thresholding Mask Mask R‐CNN). R-CNN). (a) the (a) training the training process process of two of seg two‐ mentationsegmentation models models based based on the on one the‐class one-class and two and‐class two-class samples’ samples’ training training sets; (b) sets;a sample (b) avillage sample image village in the image validation in the validation and test set; (c) R1: the prediction result of the one-class model; (d) R2: the prediction result of the two-class model; (e) the calculation of the threshold from the grayscale histogram; (f) R3: the prediction result of HTMask R-CNN.

The hypothesis is that R3 performs better than R2. Speciﬁcally, R3 can take advantage of the capability of R1, while utilizing the grayscale difference of the new and old buildings in R2. The two-class model’s performance depends on the numbers of training samples. Assumably, the more the training data are added, the more robust the network training, the better the segmentation results. This study also tests how the numbers of training samples could affect R2 and R3’s performance to evaluate how HTMask R-CNN can save the annotation efforts while retaining the segmentation capability.

3.2. Experiment We used R2, the prediction results of the two-class model, as the benchmarking. R3 is the result of the proposed framework. We compared R2 and R3 to test how much accuracy improvements in the proposed framework. We randomly selected 50% of the images from the one-class and two-class training set for data augmentations, resulting in 1.5 times of the original training size. The augmentations included rotating, mirroring, brightness enhancement, and adding noise points to the images. In the training stage for the one-class and two-class models, 50 epochs with two batches per epoch were applied, and the learning rate was set at 0.0001. The Stochastic Gradient Descent (SGD) optimization algorithm was adopted as the optimizer [41]. We set the weight decay to 0.000. The loss function is shown in Equation (S1). The learning mo- mentum was set at 0.9, which was used to control to what extent the model remains in the original updating direction. We used cross-entropy as a loss function to evaluate the train- Remote Sens. 2021, 13, 1070 6 of 11

ing performance. We performed hyperparameters tuning and the settings addressed above achieved the best performance (refer to Table S2 for details of hyperparameters tuning). To test how HTMask R-CNN can achieve a converged performance with a limited amount of training data, the training process has involved an incremental number of samples (from 5 to 20 satellite images). Afterward, we compared the baseline Mask R-CNN and the HTMask R-CNN by comparing R2 and R3.

3.3. Accuracy Assessment We use the average precision (AP) to quantitatively evaluate our framework on the validation dataset. The AP is equal to taking the area under the precision-recall (PR) curve, Equation (1). The mAP50 represents the AP value when the threshold of intersection over union (IoU) is 0.5. Precision = TP/(TP + FP) Recall = TP/(TP + FN) (1) R 1 AP = 0 P(R)dR IoU means the ratio of intersection and union of the prediction and the reference. When a segmentation image is obtained, the value of IoU is calculated according to Equation (2).

IoU = TP/(TP + FP + FN) (2)

Remote Sens. 2021, 13, x FOR PEER REVIEW4. Results 7 of 11

Figure4 shows the result of an example image Site1 (Figures S1 and S2 presents additional examples for other sites with gradually increasing building density). In terms of buildingthe building segmentation, footprints R3 mapping, is significantly the one-class better model than R2 has at identiﬁed all levels mostof training of the samples. buildings. WhenMore the importantly, number of it training can accurately samples outline is very limited, individual e.g., buildings five, the baseline and the model boundaries nearly of misidentifiedadjacent buildings most of being the new correctly and old separated, buildings, which while allows the proposed the texture framework of the building still pro to‐ ducedbe captured. a reasonable result.

FigureFigure 4. 4. TheThe result result comparison comparison of of Site Site 1 1 between between the the baseline baseline Mask Mask R R-CNN‐CNN model and the HTMask HTMaskR-CNN framework.R‐CNN framework. (a) the satellite (a) the image satellite from image the testfrom set, the the test second set, the column second shows column the annotations,shows the annotations,the third column the third shows column the result shows of one-classthe result model of one (R1),‐class and model the fourth(R1), and column the fourth shows column the grayscale showshistograms the grayscale of the new histograms and old buildingof the new footprints and old frombuilding R2 when footprints training from images R2 when = 20. training (b) the result imagesof the = two-class 20. (b) the model result (R2), of the with two incremental‐class model training(R2), with samples incremental from 5training to 20. (samplesc) the result from of 5 to the 20. (c) the result of the HTMask R‐CNN framework (R3) with incremental training samples. HTMask R-CNN framework (R3) with incremental training samples.

Figure 5 shows the performance of the one‐class model. We noticed that the perfor‐ mance of the one‐class model converges at the 25th epoch, where it identified most of the buildings. The mAP50 could reach 0.70.

Figure 5. The mAP50 of one‐class model on the test dataset at different training iterations.

The baseline two‐class model and the HTMask R‐CNN also converge at the 25th epoch (Table 3; Figure 6). When the training size is small (image_num = 5), the mAP50 of the baseline two‐class model is very low (0.24), while the HTMask R‐CNN can signifi‐ cantly improve the recognition (0.46). With the increasing training size, the baseline two‐ class model and HTMask R‐CNN’s performance gap become narrower (see Figure 6d). Finally, the mAP50 of the baseline two‐class reached 0.51 when the training size is 20. More

Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 11

building segmentation, R3 is significantly better than R2 at all levels of training samples. When the number of training samples is very limited, e.g., five, the baseline model nearly misidentified most of the new and old buildings, while the proposed framework still pro‐ duced a reasonable result.

Remote Sens. 2021, 13, 1070 7 of 11

The baseline model (two-class model) performed better and better with the growing numbers of training samples in building extraction (Figure4b). However, the numbers of buildings in R2 are still significantly fewer than R1, especially if the buildings are very Figuredense 4. (see The Site3), result whichcomparison aligns of withSite 1 ourbetween assumption. the baseline In R3,Mask the R‐ proposedCNN model framework and the uses HTMaskR1 as the R‐ baseCNN map, framework. so the numbers(a) the satellite of buildings image from are the equal test between set, the second R1 and column R3. That shows means the annotations,the proposed the frameworkthird column outperforms shows the result the of baseline one‐class model. model (R1), In terms and the of fourth the new column and old showsbuilding the grayscale segmentation, histograms R3 is of significantly the new and old better building than R2footprints at all levels from R2 of when training training samples. imagesWhen = the 20. number (b) the result of training of the two samples‐class model is very (R2), limited, with incremental e.g., five, the training baseline samples model from nearly 5 to 20.misidentified (c) the result of most the HTMask of the new R‐CNN and framework old buildings, (R3) with while incremental the proposed training framework samples. still produced a reasonable result. Figure 5 shows the performance of the one‐class model. We noticed that the perfor‐ Figure5 shows the performance of the one-class model. We noticed that the performance of the one‐class model converges at the 25th epoch, where it identified most of the mance of the one-class model converges at the 25th epoch, where it identified most of the buildings. The mAP50 could reach 0.70. buildings. The mAP50 could reach 0.70.

FigureFigure 5. 5. TheThe mAP mAP5050 ofof one one-class‐class model model on on the the test test dataset dataset at at different different training training iterations. iterations.

TheThe baseline two-classtwo‐class model model and and the the HTMask HTMask R-CNN R‐CNN also also converge converge at the at 25th the epoch 25th (Table3; Figure6). When the training size is small (image_num = 5), the mAP of the epoch (Table 3; Figure 6). When the training size is small (image_num = 5), the mAP50 50 of thebaseline baseline two-class two‐class model model is very is very low low (0.24), (0.24), while while the HTMaskthe HTMask R-CNN R‐CNN can signiﬁcantlycan signifi‐ cantlyimprove improve the recognition the recognition (0.46). (0.46). With With the increasing the increasing training training size, size, the baseline the baseline two-class two‐ classmodel model and HTMaskand HTMask R-CNN’s R‐CNN’s performance performance gap become gap become narrower narrower (see Figure (see 6Figured). Finally, 6d). the mAP50 of the baseline two-class reached 0.51 when the training size is 20. More Finally, the mAP50 of the baseline two‐class reached 0.51 when the training size is 20. More importantly, HTMask R-CNN performs consistently (mAP50 ≈ 0.48), no matter which levels of training size.

Table 3. The mAP50 between R2 and R3 at all levels of training.

Models\Epoch 1 5 10 15 20 25 30 35 40 45 50 5_2 class 0.00 0.01 0.09 0.16 0.17 0.14 0.16 0.22 0.22 0.24 0.24 5_HTM 0.04 0.31 0.33 0.41 0.45 0.45 0.48 0.47 0.45 0.48 0.46 10_2 class 0.00 0.06 0.18 0.29 0.33 0.35 0.37 0.37 0.39 0.41 0.41 10 pic_ HTM 0.05 0.37 0.32 0.43 0.45 0.46 0.48 0.49 0.46 0.48 0.47 15_2 class 0.00 0.11 0.28 0.29 0.30 0.35 0.39 0.39 0.44 0.42 0.44 15 pic_HTM 0.05 0.35 0.34 0.41 0.46 0.47 0.48 0.49 0.46 0.49 0.48 20_2 class 0.02 0.16 0.27 0.40 0.43 0.46 0.42 0.49 0.45 0.50 0.51 20_HTM 0.05 0.36 0.34 0.42 0.45 0.47 0.48 0.50 0.46 0.49 0.48 Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 11

importantly, HTMask R‐CNN performs consistently (mAP50 ≈ 0.48), no matter which lev‐ els of training size. It confirms our assumption that HTMask R‐CNN can perform well in a small number of samples, which means that it can significantly reduce annotation efforts while retaining the segmentation capability. In contrast, the baseline two‐class model performed poorly in extracting old‐new two‐category buildings.

Table 3. The mAP50 between R2 and R3 at all levels of training.

Models\Epoch 1 5 10 15 20 25 30 35 40 45 50 5_2 class 0.00 0.01 0.09 0.16 0.17 0.14 0.16 0.22 0.22 0.24 0.24 5_HTM 0.04 0.31 0.33 0.41 0.45 0.45 0.48 0.47 0.45 0.48 0.46 10_2 class 0.00 0.06 0.18 0.29 0.33 0.35 0.37 0.37 0.39 0.41 0.41 10 pic_ HTM 0.05 0.37 0.32 0.43 0.45 0.46 0.48 0.49 0.46 0.48 0.47 15_2 class 0.00 0.11 0.28 0.29 0.30 0.35 0.39 0.39 0.44 0.42 0.44 15 pic_HTM 0.05 0.35 0.34 0.41 0.46 0.47 0.48 0.49 0.46 0.49 0.48 Remote Sens. 2021, 13, 1070 20_2 class 0.02 0.16 0.27 0.40 0.43 0.46 0.42 0.49 0.45 0.50 0.518 of 11 20_HTM 0.05 0.36 0.34 0.42 0.45 0.47 0.48 0.50 0.46 0.49 0.48

Figure 6. The mAP for each model. (a). sample size = 5; (b) sample size = 10; (c) sample size = 15; Figure 6. The mAP5050 for each model. (a). sample size = 5; (b) sample size = 10; (c) sample size = 15; (d(d) )sample sample size size = =20. 20.

5. DiscussionsIt confirms our assumption that HTMask R-CNN can perform well in a small number of samples, which means that it can significantly reduce annotation efforts while retaining theWith segmentation the advance capability. of deep In learning, contrast, the the extraction baseline two-class of building model footprint performed from poorlysatellite in imageryextracting has old-new made notable two-category progress, buildings. contributing significantly to settlements’ digital rec‐ ords. However, the scarcity of training data has always been the main challenge for scaling building5. Discussions segmentation. Therefore, this study proposes a novel framework based on the Mask WithR‐CNN the model advance and of histogram deep learning, thresholding the extraction to extract of buildingold and new footprint rural frombuildings satel- evenlite imagerywhen the has label made is scarce. notable We progress, tested the contributing framework significantly in Xinxing toCounty, settlements’ Guangdong digital Province,records. and However, achieved the promising scarcity of results. training This data framework has always provides been the a viable main challengesolution for for mappingscaling building China’s segmentation.rural buildings Therefore, at a significantly this study reduced proposes cost. a novel framework based on the Mask R-CNN model and histogram thresholding to extract old and new rural buildings even when the label is scarce. We tested the framework in Xinxing County, Guangdong Province, and achieved promising results. This framework provides a viable solution for mapping China’s rural buildings at a significantly reduced cost. Mask R-CNN models have been proven useful in many applications. However, this study found that the orthodox Mask R-CNN model performed poorly in extracting old- new two-category buildings. When the training samples are limited, the mAP50 is only 0.24, respectively. We believe the varying geographical environments lead to the poor generalization of the segmentation model when the training samples cannot cover the most distinctive spatial and spectrum features. For instance, the model might not classify a building with an open patio as either a new or old building if none of the training samples contains this unique shape. Meanwhile, the single-category classification task using Mask R-CNN could reach mAP50 at 0.70, respectively. That means utilizing one-class model’s capability in mapping building footprints could improve the recall rate for the old-new two-category classification task, especially in high-density areas. Hence, we propose such a novel framework. While tested the framework with increasing training samples, we found that it converges at a very early stage when the numbers of training images are only five. That means the framework can be applied on a large scale, to map all rural buildings in China. Remote Sens. 2021, 13, 1070 9 of 11

Before then, more careful studies should be undertaken to understand the limitations of the framework. We have applied the framework in Fuliang County, Jiangxi Province, China, and found that the performance of the framework is worse than the benchmarking two-class model R2 (see Figure S4). When the histogram of the R2 result does not exhibit a clear valley, the pixel grayscale of the new and old buildings is similar. In this regard, the thresholding method lose its advantage against the orthodox model. In this case, the prediction should come from the output of R2. Moreover, polygons produced from the proposed framework have irregular shapes, slightly different from the building footprint boundaries. Therefore, downstream regularization is needed in future studies. The recent advances of multi-angle imaging technologies and vision technologies integrated with deep learning that emerged in civil engineering provide new opportunities in the 3D reconstruction of rural building models [42,43].

6. Conclusions Nearly half of the Chinese population live in the rural areas of China. The lack of a digital record of the new and old buildings has posed challenges for the governments to realize the socio-economic state. Under China’s central government’s current rural revitalization policy, many migrant workers will return to the villages. Therefore, a scalable, intelligent, and accurate building mapping solution is urgently needed. The proposed framework in this study achieved a promising result even when the training samples are scarce. As a result, we can scale the mapping process at a signiﬁcantly reduced cost. Therefore, we believe this framework could map every settlement in the rural areas, help policymakers establish a longitudinal digital building record, and monitor socio-economics across all rural regions.

Supplementary Materials: The following are available online at https://www.mdpi.com/2072-4 292/13/6/1070/s1, Figure S1: The result comparison of Site 2 between the baseline Mask R-CNN model and the HTMask R-CNN framework, Figure S2:The result comparison of Site 3 between the baseline Mask R-CNN model and the HTMask R-CNN framework, Figure S3: The feature maps of the three sites in the two-class model at the 50th epoch, Figure S4: The baseline Mask R-CNN model and the HTMask R-CNN framework were tested in Fuliang County, Jiangxi Province, China, Figure S5: The loss function per training iteration, Table S1: The mAP75 between R2 and R3 at all levels of training, Table S2: Tuning hyperparameters, Equation S1: Loss function of the two-class model R2. Author Contributions: Conceptualization, X.L.; Data curation, Y.L. and W.X.; Formal analysis, J.J.; Methodology, Y.L. and H.C.; Project administration, W.X. and X.L.; Resources, X.L.; Validation, H.C.; Visualization, J.J.; Writing – original draft, Y.L. and H.C. All authors reviewed and edited the draft, approved the submitted manuscript, and agreed to be listed and accepted the version for publication. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the National Natural Science Foundation of China (41971157), the Key R&D Programs in Guangdong Province aligned with Major National Science and Technology Projects of China (2020B0202010002) Institutional Review Board Statement: Not applicable. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patient(s) to publish this paper. Data Availability Statement: The data presented in this study are available at https://github.com/ liying268-sysu/HTM-R-CNN, accessed on 8 February 2021. Conﬂicts of Interest: The authors declare no conﬂict of interest. Remote Sens. 2021, 13, 1070 10 of 11

References 1. Zhao, X.; Sun, H.; Chen, B.; Xia, X.; Li, P. China’s rural human settlements: Qualitative evaluation, quantitative analysis and policy implications. Ecol. Indic. 2018, 105, 398–405. [CrossRef] 2. Yang, R.; Xu, Q.; Long, H. Spatial distribution characteristics and optimized reconstruction analysis of China’s rural settlements during the process of rapid urbanization. Rural Stud. 2016, 47, 413–424. [CrossRef] 3. Kuffer, M.; Pfeffer, K.; Sliuzas, R. Slums from space—15 years of slum mapping using remote sensing. Remote Sens. 2016, 8, 455. [CrossRef] 4. Kuffer, M.; Persello, C.; Pfeffer, K.; Sliuzas, R.; Rao, V. Do we underestimate the global slum population? Joint Urban Remote Sensing Event (JURSE). IEEE 2019, 2019, 1–4. 5. National Bureau of Statistics of China. China Statistical Yearbook 2018; China Statistics Press: Beijing, China, 2019. 6. Patino, J.E.; Duque, J.C. A review of regional science applications of satellite remote sensing in urban settings. Comput. Environ. Urban Syst. 2013, 37, 1–17. [CrossRef] 7. Jin, X.; Davis, C.H. Automated building extraction from high-resolution satellite imagery in urban areas using structural, contextual, and spectral information. EURASIP J. Adv. Signal Process. 2005, 2005, 745309. [CrossRef] 8. Huang, X.; Zhang, L. Morphological building/shadow index for building extraction from high-resolution imagery over urban areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 5, 161–172. [CrossRef] 9. Ghanea, M.; Moallem, P.; Momeni, M. Building extraction from high-resolution satellite images in urban areas: Recent methods and strategies against significant challenges. Int. J. Remote Sens. 2016, 37, 5234–5248. [CrossRef] 10. Bachofer, F.; Braun, A.; Adamietz, F.; Murray, S.; D’Angelo, P.; Kyazze, E.; Mumuhire, A.P.; Bower, J. Building stock and building typology of kigali, rwanda. Data 2019, 4, 105. [CrossRef] 11. Tupin, F.; Roux, M. Markov random field on region adjacency graph for the fusion of SAR and optical data in radar grammetric applications. IEEE Trans. Geosci. Remote Sens. 2005, 43, 1920–1928. [CrossRef] 12. Zhang, L.; Ji, Q. Image segmentation with a unified graphical model. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1406–1425. [CrossRef][PubMed] 13. Kurnaz, M.N.; Dokur, Z.; Ölmez, T. Segmentation of remote-sensing images by incremental neural network. Pattern Recognit. Lett. 2005, 26, 1096–1104. [CrossRef] 14. Mitra, P.; Uma Shankar, B.; Pal, S.K. Segmentation of multispectral remote sensing images using active support vector machines. Pattern Recognit. Lett. 2004, 25, 1067–1074. [CrossRef] 15. Audebert, N.; Le Saux, B.; Lefèvre, S. Semantic Segmentation of Earth Observation Data Using Multimodal and Multi-Scale Deep Networks; Lai, S., Lepetit, V., Nishino, K., Sato, Y., Eds.; Asian Conference on Computer Vision; Springer: Cham, Germany, 2016; pp. 180–196. [CrossRef] 16. Guo, M.; Liu, H.; Xu, Y.; Huang, Y. Building extraction based on U-Net with an attention block and multiple losses. Remote Sens. 2020, 12, 1400. [CrossRef] 17. Chen, H.; Lu, S. Building Extraction from Remote Sensing Images Using SegNet. In Proceedings of the 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; pp. 227–230. 18. Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. 19. Arnab, A.; Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Larsson, M.; Kirillov, A.; Savchynskyy, B.; Rother, C.; Kahl, F.; Torr, P.H. Conditional random fields meet deep neural networks for semantic segmentation: Combining probabilistic graphical models with deep learning for structured prediction. IEEE Signal Process. Mag. 2018, 35, 37–52. [CrossRef] 20. Pan, Z.; Xu, J.; Guo, Y.; Hu, Y.; Wang, G. Deep Learning Segmentation and Classification for Urban Village Using a Worldview Satellite Image Based on U-Net. Remote Sens. 2020, 12, 1574. [CrossRef] 21. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; 2017; pp. 2961–2969. Available online: https://openaccess.thecvf.com/content_iccv_2017/html/He_Mask_R-CNN_ICCV_2017 _paper.html (accessed on 8 February 2021). 22. Li, Q.; Mou, L.; Hua, Y.; Sun, Y.; Jin, P.; Shi, Y.; Zhu, X.X. Instance segmentation of buildings using keypoints. arXiv 2020, arXiv:2006.03858. 23. Zhao, K.; Kang, J.; Jung, J.; Sohn, G. Building Extraction from Satellite Images Using Mask R-CNN With Building Boundary Regularization. In CVPR Workshops; IEEE: New York, NY, USA, 2018; pp. 247–251. [CrossRef] 24. Mahmoud, A.; Mohamed, S.; El-Khoribi, R.; Abdelsalam, H. Object Detection Using Adaptive Mask RCNN in Optical Remote Sensing Images. Int. Intell. Eng. Syst. 2020, 13, 65–76. 25. Zhang, W.; Liljedahl, A.K.; Kanevskiy, M.; Epstein, H.E.; Jones, B.M.; Jorgenson, M.T.; Kent, K. Transferability of the deep learning mask R-CNN model for automated mapping of ice-wedge polygons in high-resolution satellite and UAV images. Remote Sens. 2020, 12, 1085. 26. Bhuiyan, M.A.E.; Witharana, C.; Liljedahl, A.K. Use of Very High Spatial Resolution Commercial Satellite Imagery and Deep Learning to Automatically Map Ice-Wedge Polygons across Tundra Vegetation Types. J. Imaging 2020, 6, 137. [CrossRef] 27. Kaiser, P.; Wegner, D.; Lucchi, A.; Jaggi, M.; Hofmann, T.; Schindler, K. Learning Aerial Image Segmentation from Online Maps. IEEE Trans. Geosci. Remote Sens. 2017, 55, 6054–6068. [CrossRef] Remote Sens. 2021, 13, 1070 11 of 11

28. Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS); 2017; pp. 3226–3229. [CrossRef] 29. Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [CrossRef] 30. Mnih, V. Machine Learning for Aerial Image Labeling; University of Toronto (Canada): Toronto, ON, Canada, 2013. 31. Wang, S.; Hou, X.; Zhao, X. Automatic building extraction from high-resolution aerial imagery via fully convolutional encoder- decoder network with non-local block. IEEE Access 2020, 8, 7313–7322. [CrossRef] 32. Kang, W.; Xiang, Y.; Wang, F.; You, H. EU-net: An efficient fully convolutional network for building extraction from optical remote sensing images. Remote Sens. 2019, 11, 2813. [CrossRef] 33. Chen, M.; Wu, J.; Liu, L.; Zhao, W.; Tian, F.; Shen, Q.; Zhao, B.; Du, R. DR-Net: An Improved Network for Building Extraction from High Resolution Remote Sensing Image. Remote Sens. 2021, 13, 294. [CrossRef] 34. Sekertekin, A. A survey on global thresholding methods for mapping open water body using Sentinel-2 satellite imagery and normalized difference water index. Arch. Comput. Methods Eng. 2020, 1–13. [CrossRef] 35. Li, C.; Duan, P.; Wang, M.; Li, J.; Zhang, B. The Extraction of Built-up Areas in Chinese Mainland Cities Based on the Local Optimal Threshold Method Using NPP-VIIRS Images. J. Indian Soc. Remote Sens. 2020, 1–16. [CrossRef] 36. Srikanth, M.V.; Prasad, V.; Prasad, K.S. An improved firefly algorithm-based 2-d image thresholding for brain image fusion. Int. J. Cogn. Inform. Nat. Intell. (IJCINI) 2020, 14, 60–96. [CrossRef] 37. Qi, Z. Rural revitalization in Xinxing County. China Econ. Wkly. 2018, 78–79. (In Chinese) 38. Dutta, A.; Zisserman, A. The VIA annotation software for images, audio and video. In Proceedings of the 27th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2276–2279. 39. Wu, T.; Hu, Y.; Peng, L.; Chen, R. Improved Anchor-Free Instance Segmentation for Building Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 2910. [CrossRef] 40. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft Coco: Common Objects in Context; European Conference on Computer Vision; Springer: Cham, Germany, 2014; pp. 740–755. 41. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010; Physica-Verlag HD: Heidelberg, Germany, 2010; pp. 177–186. 42. Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-based extraction of spatial information in grape clusters for harvesting robots. Biosyst. Eng. 2016, 151, 90–104. [CrossRef] 43. Tang, Y.; Li, L.; Wang, C.; Chen, M.; Feng, W.; Zou, X.; Huang, K. Real-time detection of surface deformation and strain in recycled aggregate concrete-filled steel tubular columns via four-ocular vision. Robot. Comput. Integr. Manuf. 2019, 59, 36–46. [CrossRef]