sensors

Article RGDiNet: Efficient Onboard Object Detection with Faster R-CNN for Air-to-Ground Surveillance

Jongwon Kim and Jeongho Cho *

Department of Electrical Engineering, Soonchunhyang University, Asan 31538, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-41-530-4960

Abstract: An essential component for the autonomous flight or air-to-ground surveillance of a UAV is an object detection device. It must possess a high detection accuracy and requires real- time data processing to be employed for various tasks such as search and rescue, object tracking and disaster analysis. With the recent advancements in multimodal data-based object detection architectures, autonomous driving technology has significantly improved, and the latest algorithm has achieved an average precision of up to 96%. However, these remarkable advances may be unsuitable for the image processing of UAV aerial data directly onboard for object detection because of the following major problems: (1) Objects in aerial views generally have a smaller size than in an image and they are uneven and sparsely distributed throughout an image; (2) Objects are exposed to various environmental changes, such as occlusion and background interference; and (3) The payload weight of a UAV is limited. Thus, we propose employing a new real-time onboard object detection architecture, an RGB aerial image and a point cloud data (PCD) depth map image network (RGDiNet). A faster region-based convolutional neural network was used as the baseline detection network and an RGD, an integration of the RGB aerial image and the depth map reconstructed by the light detection and ranging PCD, was utilized as an input for computational efficiency. Performance tests   and evaluation of the proposed RGDiNet were conducted under various operating conditions using hand-labeled aerial datasets. Consequently, it was shown that the proposed method has a superior Citation: Kim, J.; Cho, J. RGDiNet: performance for the detection of vehicles and pedestrians than conventional vision-based methods. Efficient Onboard Object Detection with Faster R-CNN for Air-to-Ground Keywords: onboard detection; UAV; faster R-CNN; air-to-ground surveillance Surveillance. Sensors 2021, 21, 1677. https://doi.org/10.3390/s21051677

Academic Editor: Marco Diani 1. Introduction Received: 27 January 2021 Unmanned aerial vehicles (UAVs), also known as drones, have been developed primar- Accepted: 15 February 2021 ily for military purposes and are used in universities and research institutions for research Published: 1 March 2021 in computer science and robotics. Studies focus mainly on autonomous flight-based search and rescue [1], shipping [2], agriculture [3] and infrastructure inspection [4]. One of the Publisher’s Note: MDPI stays neutral most important components for the autonomous flights or air-to-ground surveillance of with regard to jurisdictional claims in UAVs is object detection, which requires real-time data processing with high accuracy. It published maps and institutional affil- is directly utilized in diverse tasks such as situation analysis, traffic monitoring, disaster iations. analysis and object tracking. Particularly, when flying in cities or residential areas, it is extremely important to detect and avoid objects because of safety issues arising from the terrain, obstacles, collisions with other objects, and falls [5]. Object detection is a fundamental topic in computer vision and an important com- Copyright: © 2021 by the authors. ponent of sensor-based situational awareness systems for autonomous driving. With Licensee MDPI, Basel, . the emergence of convolutional neural networks (), object detection schemes have This article is an open access article evolved significantly in the past few years. The development of various detection frame- distributed under the terms and works, such as faster region-based CNN (R-CNN) [6], RetinaNet [7], and single-shot conditions of the Creative Commons multi-box detector (SSD) [8], has led to considerable advancement in the newly invented Attribution (CC BY) license (https:// state-of-the-art technologies for object detection and identification. Two-phase detection creativecommons.org/licenses/by/ frameworks including preprocessing steps to generate region proposals, such as faster 4.0/).

Sensors 2021, 21, 1677. https://doi.org/10.3390/s21051677 https://www.mdpi.com/journal/sensors Sensors 2021, 21, 1677 2 of 17

R-CNN and mask R-CNN [9,10], have been proposed to increase the detection accuracy. Moreover, single-phase detection frameworks, such as “you only look once” (YOLO) [11], Single Shot Detector (SSD), and corner-Net [12], have been developed to maximize the computational efficiency to enable real-time detection. Despite the clear differences be- tween these two pipeline structures, the ultimate objectives of the modern detectors being currently proposed are to accurately classify objects and estimate their locations in images. Although, concurrently, numerous different object detection frameworks have been pro- posed [13–15], designing a robust and accurate detection system is still a challenging task because of the various environmental problems encountered in real scenarios. Monocular RGB cameras are being utilized to develop numerous object detection techniques, and they are the most common sensors and primary options for detecting various objects, including traffic signs, license plates, pedestrians and cars. However, passive vision systems, such as cameras, cannot be used for night vision, cannot acquire information about depth, and are limited in use in real driving scenarios because of their sensitivity to the changes in brightness [16,17]. To resolve the drawbacks of such cameras, employing object detection techniques using multimodal data, instead of using cameras only, has recently been proposed. Several methods have proposed utilizing the depth information collected from stereo vision or infrared (IR) sensors and integrating them with RGB images [18,19]. They have achieved significant performance improvements compared to monocular camera-only-based methods. Furthermore, regarding autonomous driving, three-dimensional (3D) sensors, such as light detection and ranging (LiDAR) sensors, are more common than 2D vision sensors because safety requires higher accuracy [20]. Most conventional techniques first utilize LiDAR to convert point cloud data (PCD) into images by projection [21,22] or into a volumetric grid via quantization, and subsequently apply a convolutional network [23,24]. Recently, it has been proposed to directly deal with point clouds without converting them into other formats [25]. Consequently, the recent advances in multimodal data-based object detection have made significant progress in autonomous driving technology, and the KITTI benchmark [26] site reports that state-of-the-art algorithms can achieve an average precision (AP) of up to 96%. Although such remarkable advances have already been achieved in object detection, they may be unsuitable for processing the sequences or images captured by UAVs. This may be attributed to the following major problems: (1) objects generally have a smaller scale than images; (2) objects are typically nonuniformly and sparsely distributed throughout an image; (3) objects experience various environmental changes, for example, occlusion and background interference; (4) the payload weight of a UAV is limited. The features of object detection in air-to-ground views and on-road scenarios are compared in Table1. In aerial images, the visual distance from a UAV to an object can be long and con- stantly changing, so that the object of interest appears small and contains less information about its features. Simultaneously, the field of view of the target scene is wide and the object to be detected experiences various environmental changes, such as background interference, shadows of other objects and occlusions. Moreover, owing to the payload weight limitations, it is difficult to mount high-performance processors simultaneously with multiple sensors. However, even if they could be mounted, there are few available techniques for locating objects by multimodal data integration and it is also difficult to ensure robustness against noise. For example, in [27], a simple CNN-based approach for automatic object detection in aerial images was presented, and [28–30] directly extended a faster R-CNN to object detection in aerial images. Similarly, the detection algorithms proposed in [31–33] were based on a single-stage framework, such as SSD and YOLO. Recently, to resolve the ambiguities of viewpoints and visual distances, [34] proposed a scale adaptive proposal network and [35] suggested utilizing reinforcement learning to sequentially select the areas to be detected on a high-resolution scale. These methods aim to efficiently detect vehicles or pedestrians on roads and accurately count the number of objects, and can be employed for the detection of objects in aerial images. Because object detection based on monocular RGB cameras still has significant limitations, as addressed Sensors 2021, 21, x FOR PEER REVIEW 3 of 17

Sensors 2021, 21, 1677 3 of 17

objects, and can be employed for the detection of objects in aerial images. Because object detection based on monocular RGB cameras still has significant limitations, as addressed earlier,earlier, some some authors authors have have proposed proposed new new techniques techniques using using thermal thermal imaging imaging sensors sensors [36,37 ] [36,37]and depth and depth maps maps of infrared of infrared (IR) sensors (IR) sensors [38,39]. [38,39]. However, However, they also they have also a limitation have a limita- in that tionthe in distance that the to distance an object to is an unknown; object is thus,unknown; they can thus, only they locate can anonly object locate indoors an object or within in- doors10 m. or Consequently, within 10 m. Consequently, it is difficult to it obtainis difficult satisfactory to obtain detection satisfactory results detection in aerial results views inusing aerialthe views existing using technologies the existing andtechnologies the detection and schemes,the detection if any, schemes, proposed if any, in the proposed literature infocus the literature only on visionfocus only sensor-based on vision object sensor detection-based object in aerial detection images. in The aerial abovementioned images. The abovementionedissues motivated issues us to motivated devise a framework us to devise for efficientlya framework detecting for efficiently objects indetecting air-to-ground ob- jectsviews in air while-to-grou measuringnd views the while distance measuring from a UAVthe distance to the target. from a UAV to the target.

TableTable 1. 1.ComparisonComparison of offeatures features considered considered when when detecting detecting objects objects in in an an air air-to-ground-to-ground scenario. scenario.

FeaturesFeatures On-roadOn-road Aviation Aviation VisualVisual distance ConstantConstant Varies Varies FieldField of of view SmallSmall Large Large BackgroundBackground complexity MediumMedium High High PayloadPayload weight Unlimited Very Very limited limited

ExemplaryExemplary images

In this paper, we propose a new real-time object detection system based on RGB aer- ial imagesIn this and paper, a PCD we depth propose map a image new real-time network object (RGDiNet detection), which system can provide based on improved RGB aerial objectimages detection and a PCDperformance depth map as well image as networkthe distance (RGDiNet), to objects which in aviation. can provide Moreover, improved we verifyobject its detection performance performance using a UAV as well embedded as the distance with an to objectsNvidia inAGX aviation. Xavier Moreover, prototype. we Theverify results its performanceshow that the using proposed a UAV system embedded can withperform an Nvidiapowerful AGX real Xavier-time prototype.onboard de- The tectionresults in show aviation that for the scenes proposed with system varying can image perform conditions powerful and real-time various onboard environments. detection Figurein aviation 1 displays for scenes the proposed with varying detection image system, conditions which and comprises various environments.a preprocessingFigure filter 1 (PF)displays for generating the proposed a newly detection integrated system, color which image, comprises a faster R a-CNN preprocessing as the baseline filter detec- (PF) for tiongenerating network, aand newly a histogram integrated network color (HN) image, for a predicting faster R-CNN the distance as the baselinefrom the detectionUAV to objects.network, Based and on athe histogram aerial observations network (HN) (i.e., RGB for predicting images and the PCD) distance captured from by the the UAVUAV, to theobjects. PF produces Based a on three the-channel aerial observations fused image(i.e., consisting RGB images of a portion and PCD)of the capturedRGB channels by the andUAV, a depth the PF map produces created aby three-channel the point clouds. fused The image reconstructed consisting image of a portion is applied of the to the RGB fasterchannels R-CNN and to a predict depth map the bounding created by box, the pointclass and clouds. confidence The reconstructed of each image object. is applied Sub- sequently,to the faster the R-CNNdistance to is predict calculated the bounding using the box,HN classwith andthe predicted confidence bounding score of each box object.and Subsequently, the distance is calculated using the HN with the predicted bounding box the PCD. and the PCD. Compared with the conventional detection methods based on multimodal data, the Compared with the conventional detection methods based on multimodal data, the proposed RGDiNet has several advantages: (1) Utilizing the PF, the learning framework proposed RGDiNet has several advantages: (1) Utilizing the PF, the learning framework does not require a connection layer, 1 × 1 convolutional layer, or sub-network to integrate does not require a connection layer, 1 × 1 convolutional layer, or sub-network to integrate two data points. Therefore, our approach can significantly reduce the computational costs two data points. Therefore, our approach can significantly reduce the computational and improve the detection efficiency; (2) Using the HN, the actual distance between a costs and improve the detection efficiency; (2) Using the HN, the actual distance between predicted object and a UAV can be provided; and (3) The 3D PCD can be used with sim- a predicted object and a UAV can be provided; and (3) The 3D PCD can be used with ultaneous localization and mapping (SLAM) [40] for object positioning along with the simultaneous localization and mapping (SLAM) [40] for object positioning along with surrounding environment, if necessary, and it also has robustness against changes in the the surrounding environment, if necessary, and it also has robustness against changes in the imaging conditions and various environments. Detection frameworks combining two different inputs have been proposed [41]; however, contrastingly, we propose a method

Sensors 2021, 21, x FOR PEER REVIEW 4 of 17

imaging conditions and various environments. Detection frameworks combining two dif- ferent inputs have been proposed [41]; however, contrastingly, we propose a method to substitute one color channel in an RGB image with another signal. This approach is as- sumed to be efficient in the onboard systems of UAVs owing to the simplification of the network structure. After analyzing the object detection impact diagram for the R, G, and B channels, the less affected channel is substituted by a depth map created by the PCD to reconstruct the input for the faster R-CNN. Sensors 2021, 21, 1677 The proposed object detection algorithm for autonomous flights or ground surveil-4 of 17 lance was tested in various operating environments. Hand-labeled aerial datasets contain- ing two categories of objects (pedestrians and vehicles) were used for the evaluation of its performance. The experimental results showed that the proposed system improved the toobject substitute detection one performance color channel by in increasing an RGB image the AP with by almost another 5% signal. even in This a noisy approach environ- is assumedment or topoor be efficientconditions in thein which onboard the systems brightness of UAVs varied. owing Such to improved the simplification performance of the is network structure. After analyzing the object detection impact diagram for the R, G, and B expected to be practical by expanding the proposed object detection scheme to various channels, the less affected channel is substituted by a depth map created by the PCD to fields, such as autonomous flying drones and ground surveillance from an air-to-ground reconstruct the input for the faster R-CNN. view.

FigureFigure 1. 1.Architecture Architecture of of proposed proposed multimodal multimodal aerial aerial data-based data-based object object detection detection system. system.

2. MethodologyThe proposed object detection algorithm for autonomous flights or ground surveil- lance2.1. System was tested Overview in various operating environments. Hand-labeled aerial datasets contain- ing twoThe categories proposed of multimodal objects (pedestrians data-based and detection vehicles) were architecture, used for RGDiNet, the evaluation can detect of its performance. The experimental results showed that the proposed system improved the ob- objects in aerial images in which the shapes and dynamic environment of the objects are ject detection performance by increasing the AP by almost 5% even in a noisy environment constantly changing. A LiDAR and a camera are fixed on a UAV, and it is assumed that or poor conditions in which the brightness varied. Such improved performance is expected both the sensors consider only objects present within 40 m. Moreover, the sensors are syn- to be practical by expanding the proposed object detection scheme to various fields, such chronized. The pipeline of RGDiNet consists of three tasks: the generation of PCD depth as autonomous flying drones and ground surveillance from an air-to-ground view. maps and RGB images by the PF, object detection and bounding box prediction by the 2.faster Methodology R-CNN, and calculation of the distance between the UAV and an object by the HN. 2.1.The System detection Overview is based on the faster R-CNN utilizing a region proposal network (RPN), and the newly integrated image reconstructed by the PF is input into the faster R-CNN to The proposed multimodal data-based detection architecture, RGDiNet, can detect objects in aerial images in which the shapes and dynamic environment of the objects are constantly changing. A LiDAR and a camera are fixed on a UAV, and it is assumed that

both the sensors consider only objects present within 40 m. Moreover, the sensors are synchronized. The pipeline of RGDiNet consists of three tasks: the generation of PCD depth maps and RGB images by the PF, object detection and bounding box prediction by the faster R-CNN, and calculation of the distance between the UAV and an object by the HN. The detection is based on the faster R-CNN utilizing a region proposal network (RPN), and the newly integrated image reconstructed by the PF is input into the faster R-CNN to predict the bounding box containing the entity of interest and its confidence score. Subsequently, the coordinates of the bounding box are projected onto the PCD to Sensors 2021, 21, x FOR PEER REVIEW 5 of 17

Sensors 2021, 21, 1677 5 of 17

predict the bounding box containing the entity of interest and its confidence score. Subse- calculatequently, thethe actualcoordinates position of the of the bounding detected box object. are projected Finally, the onto PCD the within PCD to the calculate bounding the boxactual are position quantified of the to compute detected theobject. average Finally, distance the PCD to this within object. the bounding box are quan- tified to compute the average distance to this object. 2.2. Integration of RGB Image and PCD Depth Map 2.2. IntegrationTo use the dataof RGB obtained Image fromand PCD sensors Depth with Map mismatching dimensions, such as a camera and aT LiDAR,o use the preprocessing data obtained should from sensors be conducted. with mismatching An RGB image dimensions, can be such used as directly a cam- fromera and a camera a LiDAR, without preprocessing any processing. should However,be conducted. a LiDAR An RGB returns image readings can be used in spherical directly coordinates;from a camera therefore, without the any position processing. of a However, point in 3D a LiDAR space needs returns to readin be convertedgs in spherical into a positioncoordinates; on the therefore, surface the of theposition image of in a pixelpoint coordinates. in 3D space Toneeds build to abe two-dimensional converted into a (2D)position depth on map, the surface first, the of the 3D image PCD are in pixel transformed coordinates. into To a front build view a two using-dimensional the rotation (2D) transformationdepth map, first, expressed the 3D PCD in Equation are transformed (1) to match into the a front 2D RGB view image. using the rotation trans- formation expressed in Equation (1) to match the 2D RGB image.  0      x cos(θR) 0 − sin(θR) x 0 =  y  푥′ cos0(휃푅) 10 − 0sin(휃푅) 푥y . (1) 0 z [푦′] =sin[ (θR0) 01 cos(θR0) ] [푦]z. (1) 푧′ sin(휃푅) 0 cos(휃푅) 푧 Here, [x, y, z]T is the 3D vector of the PCD and, using Equation (1), the vector is rotated Here, [푥, 푦, 푧]푇 is the 3D vector of the PCD and, using Equation (1), the vector is ro- by an angle θ about the y-axis. Considering a drone flying over the surface of an image tated by an angleR 휃 about the 푦-axis. Considering a drone flying over the surface of an plane, as shown in Figure푅 2, the PCD are created based on the absolute coordinates, F . image plane, as shown in Figure 2, the PCD are created based on the absolute coordinates,PCD To project them on the image in pixel coordinates, the absolute coordinates, FPCD, are 퐹푃퐶퐷. To project them on the image in pixel coordinates, the absolute coordinates, 퐹푃퐶퐷, rotated by the angle, θR, to convert into F . In this study, the PCD were collected in the are rotated by the angle, 휃 , to convert intoRGB 퐹 . In this study, the PCD were collected in direction normal to the ground푅 at a fixed angle푅퐺퐵 of 60◦ about the y-axis. the direction normal to the ground at a fixed angle of 60° about the y-axis.

FigureFigure 2. 2.Projection Projectionof ofpoint pointclouds clouds in in 3D 3D space space on on 2D 2D surface. surface.

AssumingAssuming that that the the LiDAR LiDAR is is calibrated calibrated by by the the image image plane plane of theof the camera, camera, the projectionthe projec- oftion the of points the points of the of LiDAR the LiDAR appears appears much much sparser sparser than than that that in its in associated its associated RGB RGB image. im- Becauseage. Because the proposed the proposed object object detection detecti systemon system employs employs a 16-channel a 16-channel LiDAR, LiDAR, instead instead of theof the 64-channel 64-channel one one commonly commonly used used on theon the ground, ground, the the scarcity scarcity of theof the data data distribution distribution is inevitablyis inevitably worse. worse. Because Because such such limited limited spatial spatial resolution resolution obtained obtained from from a a LiDAR LiDAR sensor sensor makesmakes object object detection detection very very challenging, challenging, we we adopted adopted the the bidirectional bidirectional filter-based filter-based strategy strat- proposedegy proposed in [42 in] to[42] reconstruct to reconstruct the PCDthe PCD into into a high-resolution a high-resolution depth depth map. map. This This filtering filter- techniqueing technique has thehas merits the merits of high of high computational computational efficiency efficiency and easy and parametereasy parameter adjustment. adjust- ment.Once the depth map is built, it is concatenated with the RGB image to create a reconstructedOnce the input depth space. map RGB,is built, which it is concatenated represents red, with green, the and RGB blue, image is the to create primary a recon- color spacestructed of a input computer space. for RGB, capturing which or displayingrepresents images.red, green, The and human blue, eye is is the sensitive primary to thesecolor threespace primary of a computer colors [ 43for]. capturing Specifically, or wedisplaying can easily images. recognize The and human detect eye objects is sensitive of more to sensitivethese three wavelengths, primary colors such [43]. as ambulances Specifically, and we warning can easily signs. recognize Based onand this detect background, objects of we assumed that certain color channels have little influence on object detection. The effect of the selection of the color space, such as RGB, CMY, and YCrCb, on the detectability and

Sensors 2021, 21, 1677 6 of 17

classification of objects was studied in [44]; however, no studies have been conducted on the influence of specific channels on object detection performance. Therefore, we trained the object detection algorithm by removing channels from the RGB image sequentially, and analyzed the effect of the color of each channel on the object detection. After this analysis, the channel that was least affected was replaced by a depth map, and the reconstructed image was adopted for the object detection. A simple experiment confirmed that the blue channel lowered the object detection performance by up to 3%, similar to human vision; therefore, the blue channel was replaced by a depth map. Thus, the proposed object detection network in the UAV, using a depth map instead of the blue channel of RGB, was named RGDiNet. It is the most direct extension of a faster R-CNN through multiple modalities because only a part of the input image is replaced by a depth map. Once the faster R-CNN is trained and the reconstructed input image is ready by extracting the depth map by the PF, all the object detection preparations are completed.

2.3. Object Detection in Aerial Image Using Faster R-CNN As described earlier, the proposed RGDiNet utilizes a faster R-CNN to detect objects based on a reconstructed RGD image. Figure3 shows the object detection pipeline using the faster R-CNN. As a backbone architecture for learning image features, a network is built based on VGG-16 [45] and is initialized with pretrained weights using the ImageNet dataset [46]. The core concept of the faster R-CNN is an RPN. While inheriting the ex- isting fast R-CNN structure, the selective search is removed and the region of interest (ROI) is calculated by the RPN. Overall, a feature map is first extracted by the CNN and subsequently sent to the RPN to calculate the region of interest (ROI). The RPN is a fully convolutional network, which effectively generates a wide range of region proposals that are delivered to the detection network. Region proposals are rectangular in shape and may or may not include candidate entities. A detection network is connected after the RPN to refine the proposals, and several ROIs are passed as inputs. A fixed-length feature vector is extracted for each ROI by an ROI pooling layer. The network creates region proposals by sliding a small window on the shared convolutional feature map. In the last convolutional layer, 3 × 3 windows are selected to reduce the dimensionality of the representation, and the features are provided in two 1 × 1 convolutional layers. One of them is for localization and the other is to categorize the box as the background or the foreground [47]. Herein, the faster R-CNN provides the probabilities for the two classes of vehicles and pedestrians viewed from the UAV as well as the coordinates of their bounding boxes. The RPN generates a candidate list of bounding boxes from the RGD aerial image, each associated with a confidence score that measures the memberships of selected parts of an image against a set of object classes versus the background. Sensors 2021, 21, 1677 7 of 17 Sensors 2021, 21, x FOR PEER REVIEW 7 of 17

Figure 3. Faster region-based convolutional neural network (R-CNN) pipeline for object detection by Figure 3. Faster region-based convolutional neural network (R-CNN) pipeline for object detection integrationby integration of aof camera a camera and and light light detection detection and and ranging ranging (LiDAR). (LiDAR). 2.4. Estimation of Visual Distance Using Histogram Network 2.4. Estimation of Visual Distance Using Histogram Network The 3D PCD from the LiDAR are projected onto the 2D image plane to obtain the visualThe distance 3D PCD to thefrom target. the LiDAR However, are itprojected is impossible onto the to project 2D image directly plane onto to obtain the image the planevisual becausedistance the to PCDthe target. are essentially However, created it is impossible in the spatial to project coordinate directly system onto of the 3D image point clouds.plane because Thus, thethe coordinatePCD are essentially points of thecreated bounding in the boxspatial predicted coordinate based system on the of image 3D point are projectedclouds. Thus, onto the the coordinate PCD plane points using dataof the voxelization bounding box [48– predicted50], a classical based method on the commonly image are usedprojected for quantifying onto the PCD point plane clouds. using data voxelization [48–50], a classical method com- monlyIn used the proposed for quantifying RGDiNet, point the clouds. PCD use a two-point-five-dimensional self-centered occupancyIn the gridproposed map, RGDiNet in which,each the PCD voxel use covers a two a- smallpoint- groundfive-dimensional patch of 0.1 self m-centered× 0.1 m. Eachoccupancy cell stores grid map, a single in which value each indicating voxel howcovers occupation a small ground of the patch voxel of by 0.1 the m object × 0.1 m. or background.Each cell stores Figure a single4 shows value a schematicindicating ofhow the occupation estimation of the visualvoxel by distance the object to the or target.background. After Figure voxelization, 4 shows the a schematic voxelized of PCD the areestimation normalized of the to visual match distance the projected to the target. After voxelization, the voxelized PCD are normalized to match the projected points points of the bounding box, {xb, yb, xt, yt}, and the image coordinate system to the PCD of the bounding box, {x , y , x , y }, and the image coordinate system to the PCD plane. plane. {xb, yb} and {xt, ybt}baret thet coordinates of the bottom left and top right sides of the{xb, boundingyb} and {x box,t, yt} respectively. are the coordinates Subsequently, of the bottom the average left and of thetop occupationright sides of values the bound- of the voxelsing box, in respectively. the projected Subsequently, bounding box the is consideredaverage of asthe the occupation distance tovalues the target. of the voxels in the projected bounding box is considered as the distance to the target.

SensorsSensors2021 2021, 21, 21, 1677, x FOR PEER REVIEW 88 of of 17 17 Sensors 2021, 21, x FOR PEER REVIEW 8 of 17

Figure 4. Schematic of estimation of visual distance to target. FigureFigure 4. 4.Schematic Schematic ofof estimationestimation of visual distance to target. 3. Experimental Results 3.3. ExperimentalExperimental ResultsResults 3.1. Assessment Preliminaries 3.1.3.1. AssessmentAssessment PreliminariesPreliminaries The major benefit of the proposed RGDiNet is that it can detect on-road objects and TheThe majormajor benefit benefit of of the the proposed proposed RGDiNet RGDiNet is that is that it can it candetect detect on-road on-road objects objects and measure distances from an air-to-ground view by integrating the images from heteroge- andmeasure measure distances distances from from an air an-to air-to-ground-ground view viewby integrating by integrating the images the images from heteroge- from het- erogeneousneousneous sensors.sensors. sensors. ThisThis enables enables This enables the the accurate accurate the accurate detection detection detection of of objects objects of in objectsin various various in aviation variousaviation environ- aviation environ- environments,ments,ments, ultimatelyultimately ultimately facilitating facilitating facilitating ground ground ground surveillance surveillance surveillance and and safe safe and autonomous safeautonomous autonomous flights. flights. flights. The The pro- Thepro- proposedposedposed algorithmalgorithm algorithm was was was comprehensively comprehensively comprehensively evaluated evaluated evaluated using using using a ahand hand a hand-labeled-labeled-labeled dataset, dataset, dataset, called called called the the theSCHFSCHF SCHF dataset,dataset, dataset, toto verify verify to verify its its use itsuse useby by mounting bymounting mounting on on an onan actual actual an actual autonomous autonomous autonomous flight flight flightsystem. system. system. The The TheSCHFSCHF SCHF dataset dataset dataset consists consists consists of of 2428 2428 of densely 2428densely densely labeled labeled labeled (i.e., (i.e., pedestrian pedestrian (i.e., pedestrian and and vehicle) vehicle) and vehicle)pairs pairs of of aligned pairs aligned of alignedRGBRGB andand RGB RGDRGD and multimodalmultimodal RGD multimodal images. images. images. Once Once the the Once faster faster the R fasterR-CNN-CNN R-CNN mo modeldel was model was trained trained was trainedon on an an of- on of- anflinefline offline workstation,workstation, workstation, the the optimized theoptimized optimized model model model was was wasinstalled installed installed on on an onan aerial anaerial aerial platform platform platform to to evaluate evaluate to evaluate the the therealreal real-time--timetime detectiondetection detection performance.performance. performance. Assuming Assuming Assuming a avirtual avirtual virtual flight, flight, flight, the the test test dataset datasetdataset was waswas used used used to to to validatevalidatevalidate the the the performanceperformance performance ofof of thethe the proposed proposed system. system. The The airborne airborne platform platform platform used used used in in in this this this study study study waswaswas a aa DJI DJIDJI Matrice MatriceMatrice 600 600600 ProPro Pro quadcopterquadcopter quadcopter integratedintegrated integrated with with a a stabilized stabilized gimbal. gimbal. Note Note Note that that that to to to stabilizestabilizestabilize the thethe training trainingtraining processprocess process ofof of thethe the faster faster R-CNN,R R-CNN,-CNN, we we adopted adopted the the the standard standard standard loss loss loss function function function withwithwith the thethe same samesame hyperparameterhyperparameter hyperparameter asas as inin in [[48].48 [48].]. RGBRGBRGB imagesimages were were acquired acquired acquired by by by a a Gopro aGopro Gopro Hero Hero Hero 2019 2019 2019 edition edition edition camera camera camera mounted mounted mounted on on a UAV,a onUAV, a UAV,havinghaving having aa videovideo a video resolutionresolution resolution of of 1920 1920 of × 1920× 1080 1080× and and1080 a aframe andframe a rate frame rate of of rate30 30 frame offrame 30s framespers per second second per second(FPS (FPS). ). (FPS).InIn addition,addition, In addition, the the PCD PCD the were were PCD collected collected were collected by by a a VLP VLP- by16-16 aLiDAR LiDAR VLP-16 mounted mounted LiDAR on mountedon the the UAV; UAV; on the the the LiDAR LiDAR UAV; thehadhad LiDAR 1616 channelschannels had 16 and and channels a a frame frame andrate rate aof of frame 10 10 FPS. FPS. rate A A ofhigh high 10-performance FPS.-performance A high-performance embedded embedded system system embedded mod- mod- systemule,ule, NVIDIANVIDIA module, JETSON JETSON NVIDIA Xavier, Xavier, JETSON was was installed installed Xavier, onboard was onboard installed the the UAV UAV onboard for for air air theflight flight UAV condition condition for air mon- flight mon- conditionitoitoringring asas monitoring wellwell as as signal signal as processingwell processing as signal of of the processingthe PF, PF, detectors, detectors, of the and PF, and detectors,HN HN comprising comprising and HN the the comprising proposed proposed theRGDiNet.RGDiNet. proposed The The RGDiNet. prototype prototype The UAV UAV prototype bus bus and and UAV payload payload bus and based based payload on on NVIDIA based NVIDIA on JETSON NVIDIA JETSON Xavier JETSON Xavier is is Xaviershownshown is inin shown FigureFigure in 5. 5. Figure 5.

Figure 5. Unmanned Aerial Vehicle (UAV) designed for real-time data collection, processing, and Figure 5. Unmanned Aerial Vehicle (UAV) designed for real-time data collection, processing, and Figuretest evaluation 5. Unmanned using AerialRGDiNet. Vehicle (UAV) designed for real-time data collection, processing, and test evaluation using RGDiNet. test evaluation using RGDiNet.

Sensors 2021, 21, x FOR PEER REVIEW 9 of 17 Sensors 2021, 21, 1677 9 of 17

The SCHF dataset was prepared to train the proposed RGDiNet and conduct its per- formanceThe SCHFevaluation. dataset The was following prepared three to main train aspects the proposed were ensured RGDiNet when and collecting conduct the its data:performance (1) fixed evaluation.sensor angle The of following60°; (2) collection three main of images aspects of were objects ensured of various when scales collecting and ◦ aspectthe data: ratios (1) fixedfrom sensor15 to 30 angle m away; of 60 and; (2) (3) collection fixed resolution of images as of 1700 objects × 700 of by various image scales cali- bration.and aspect The ratios video fromshot on 15 the to 30 UAV m away; covered and ten (3) different fixed resolution parking lots as 1700 and ×two700-lane by roads. image Forcalibration. each location, The video a 10-min shot-long on the video UAV was covered recorded ten and different “PNG” parking image lots files and were two-lane gener- atedroads. at For1-s eachintervals. location, The aSCHF 10-min-long dataset videocontains was 3428 recorded frames and with “PNG” 5020 imageobject fileslabels were for twogenerated categories: at 1-s pedestrians intervals. The (1638) SCHF and dataset vehicles contains (3382). 3428In addition, frames withtwo types 5020 objectof RGB labels im- agesfor two and categories: multimodal pedestrians images (RGD) (1638) were and vehiclescreated for (3382). comparison In addition, with two the typesresults of of RGB an objectimages detection and multimodal system utilizing images (RGD) only RGB were images. created forThe comparison dataset was with split the into results two ofsets: an 75%object for detection training systemand 25% utilizing for testing. only RGBThe details images. of The the dataset dataset was are splitsummarized into two sets:in Table 75% 2,for and training example and images 25% for are testing. shown The in Figure details 6. of the dataset are summarized in Table2, and example images are shown in Figure6. Table 2. Details of SCHF dataset. Table 2. Details of SCHF dataset. Class Image Class SensorSensor Image FramesResolution Resolution # of Vehi- Frames # of Pedestrians # of Vehiclescles # of Pedestrians RGBRGB camera camera + + 16 16-CH-CH LiDAR 34283428 17001700× 700 × 700 33823382 1638

(a) (b)

FigureFigure 6. ExampleExample images images from from SCHF dataset ( a) RGB image ( b) RGD image image..

3.2.3.2. Performance Performance E Evaluationvaluation Generally,Generally, object object detection detection performance performance is isevaluated evaluated to toverify verify the thepresence presence of an of ob- an jectobject image image as well as well as the as ability the ability to determine to determine the accurate the accurate location location of the ofobject the objectin the image in the [51,52].image [ 51Three,52]. common Three common indicators indicators were used were to used assess to assess the detection the detection performan performancece of the of proposedthe proposed RGDiNet: RGDiNet: intersection intersection over overunion union (IoU), (IoU), precision precision-recall-recall curve curve (PRC), (PRC), and andAP. ForAP. discrete For discrete scores, scores, each detection each detection is considered is considered positive positive if the ratio if the of ratio the IoU of the to a IoU one to-to a- oneone-to-one matched matched ground- ground-truthtruth annotation annotation is greater is than greater 0.5 or than 0.7. 0.5 By or changing 0.7. By changingthe detection the scoredetection threshold, score threshold,a set of true a setpositives, of true true positives, negatives, true negatives,false positives, false and positives, false negatives and false isnegatives created, isbased created, on which based the on PRC which and the the PRC AP and are thecalculated. AP are calculated.Figure 7 depicts Figure the7 depicts defini- tionsthe definitions of IoU, precision, of IoU, precision,and recall. and recall.

SensorsSensors 20212021,, 2121,, 1677x FOR PEER REVIEW 1010 ofof 1717 Sensors 2021, 21, x FOR PEER REVIEW 10 of 17

Figure 7. Illustrations of intersection over union (IoU), precision and recall. FigureFigure 7.7. IllustrationsIllustrations ofof intersectionintersection overover unionunion (IoU),(IoU), precisionprecision andand recall.recall. Two types of comparative experiments were conducted to evaluate the performance TwoTwo typestypes ofof comparativecomparative experimentsexperiments werewere conductedconducted toto evaluateevaluate thethe performanceperformance of the proposed system. Specifically, RGDiNet was compared to a general faster R-CNN ofof thethe proposedproposed system.system. Specifically,Specifically, RGDiNetRGDiNet waswas comparedcompared toto aa generalgeneral fasterfaster R-CNNR-CNN model that only uses RGB images under normal conditions and in a noisy environment. modelmodel thatthat onlyonly usesuses RGBRGB imagesimages underunder normalnormal conditionsconditions andand inin aa noisynoisy environment.environment. The evaluation results are summarized in Table 3. The IoU was fixed at 0.5 and the APs TheThe evaluationevaluation resultsresults areare summarizedsummarized inin TableTable3 .3. The The IoU IoU was was fixed fixed at at 0.5 0.5 and and the the APs APs of the the two two models models were were compared compared for for two two objects objects (vehicle (vehicle and and pedestrian). pedestrian). RGDiNet RGDiNet of the two models were compared for two objects (vehicle and pedestrian). RGDiNet showed an AP at leastleast 1.5% points higher than the vision-basedvision-based faster R-CNN,R-CNN, confirmingconfirming showed an AP at least 1.5% points higher than the vision-based faster R-CNN, confirming itsits superiority.superiority. By including the PF and the HN,HN, thethe processing processing timetime ofof RGDiNetRGDiNet waswas its superiority. By including the PF and the HN, the processing time of RGDiNet was measured 0.1 s slowerslower thanthan thatthat ofof thethe vision-basedvision-based fasterfaster R-CNN.R-CNN. However, thisthis isis notnot measured 0.1 s slower than that of the vision-based faster R-CNN. However, this is not considered a long timetime toto bebe criticalcritical forfor datadata processing,processing, andand thus,thus, real-timereal-time operationoperation isis considered a long time to be critical for data processing, and thus, real-time operation is possible. possible.

Table 3.3. AP comparison of of vision vision-based-based faster faster R R-CNN-CNN and and RGDiNet RGDiNet when when IoU IoU = 0.5 = 0.5 for for various various Table 3. AP comparison of vision-based faster R-CNN and RGDiNet when IoU = 0.5 for various objects. objects.objects. 퐀퐏AP (IoU(퐈퐨퐔 == 0.5)ퟎ. ퟓ (%)) (%) MethodMethod ProcessingProcessing Time Time (s) (s) 퐀퐏 (퐈퐨퐔 = ퟎ. ퟓ) (%) Method Processing Time (s) Vehicle Pedestrian VehicleVehicle PedestrianPedestrian Vision-based faster R-CNN 0.8 83.22 84.80 VisionVision-based-based faster faster R-CNN R-CNN 0.80.8 83.2283.22 84.8084.80 This study (RGDiNet) 0.9 84.78 88.33 This studystudy (RGDiNet) (RGDiNet) 0.90.9 84.7884.78 88.3388.33 In addition, more detailed comparisons were made using the PRC. Figure 8 compares In addition, more detailed comparisons were made using the PRC. Figure 8 compares the PRCsIn addition, of the conventional more detailed vision comparisons-based faster were R made-CNN using and RGDiNet the PRC. with Figure varying8 compares IoUs. the PRCs of the conventional vision-based faster R-CNN and RGDiNet with varying IoUs. theIn the PRCs figure, of the the conventional following are vision-based observed: (1) faster RGDiNet R-CNN surpasses and RGDiNet the existing with varyingvision-bas IoUs.ed In the figure, the following are observed: (1) RGDiNet surpasses the existing vision-based Infaster the figure,R-CNN the model following for all are the observed: IoU thresholds. (1) RGDiNet Specifically, surpasses at thea high existing IoU, vision-based such as 0.7, faster R-CNN model for all the IoU thresholds. Specifically, at a high IoU, such as 0.7, fasterRGDiNet R-CNN can still model locate for allobjects the IoUvery thresholds. accurately Specifically,without using at other a high networks IoU, such; (2) as The 0.7, RGDiNet can still locate objects very accurately without using other networks; (2) The RGDiNetprecisions canof RGDiNet still locate and objects the existing very accurately model are withoutsimilar at using very otherhigh recall networks; rates; (2)and The (3) precisions of RGDiNet and the existing model are similar at very high recall rates; and (3) precisionsBecause RGDiNet of RGDiNet can measureand the existingthe visual model distance are similar using the at very PCD, high it has recall a unique rates; andad- Because RGDiNet can measure the visual distance using the PCD, it has a unique ad- (3)vantage Because of providing RGDiNet the can information measure the of visualthe visual distance distance using to real the objects. PCD, it has a unique vantage of providing the information of the visual distance to real objects. advantage of providing the information of the visual distance to real objects.

(a) (b) (c) (a) (b) (c) Figure 8. AP comparison of vision-based faster R-CNN and RGDiNet with varying IoUs: (a) IoU = 0.3 (b) IoU = 0.5 (c) IoU FigureFigure 8.8. APAP comparison comparison of of vision vision-based-based faster faster R- R-CNNCNN and and RGDiNet RGDiNet with with varying varying IoUs: IoUs: (a) IoU (a) = IoU 0.3 =(b 0.3) IoU (b )= IoU0.5 (c =) IoU 0.5 = 0.7. (=c) 0.7 IoU. = 0.7. LiDAR has a high level of robustness against environmental changes. To confirm its LiDAR has a high level of robustness against environmental changes. To confirm its intrinsic robustness even in the proposed simple input convergence network structure, intrinsic robustness even in the proposed simple input convergence network structure,

Sensors 2021, 21, 1677 11 of 17 Sensors 2021, 21, x FOR PEER REVIEW 11 of 17

LiDAR has a high level of robustness against environmental changes. To confirm its additional experiments were conducted in noise. Gaussian and Poisson noises, with zero intrinsic robustness even in the proposed simple input convergence network structure, mean and unit variance scaled by 0.1, which may occur during video and electronic signal additional experiments were conducted in noise. Gaussian and Poisson noises, with zero processing, were considered and their effects on the performance were examined. We also mean and unit variance scaled by 0.1, which may occur during video and electronic signal tested the robustness of the proposed system against light noise by adding randomly gen- processing, were considered and their effects on the performance were examined. We erated values for all the pixels in the original image, considering the changes in the ambi- also tested the robustness of the proposed system against light noise by adding randomly ent brightness due to weather conditions. The change in the brightness was classified into generated values for all the pixels in the original image, considering the changes in the twoambient categories: brightness change due from to weather −75 to conditions.75 (weak) and The change change from in the −125 brightness to 125 (strong). was classified The resultsinto two are categories: summarized change in Table from 4. −The75 toimage 75 (weak) distortion and caused change by from the− changes125 to 125 in the (strong). am- bientThe resultsbrightness are summarizeddegraded the inoverall Table 4performance. The image of distortion all the detectors caused bycompared the changes to that in underthe ambient normal brightnessconditions. degraded However, the the overall proposed performance RGDiNet ofdemon all thestrated detectors a more compared robust noiseto that tolerance under normal than the conditions. vision-based However, faster R the-CNN. proposed A higher RGDiNet robustness demonstrated was found a morepar- ticularlyrobust noise when tolerance the Gaussian than the noise vision-based was delivered. faster R-CNN.Figures A9 higherand 10 robustness show the wasdetection found resultsparticularly for each when class the in Gaussian a noisy environment. noise was delivered. It is confirmed Figures,9 as and before 10 show, that the the detection existing visionresults-based for each detection class in method a noisy is environment. sensitive to It the is confirmed, environmental as before, changes that of the increased existing darknessvision-based or brightness, detection and method thus, isthere sensitive were numerous to the environmental cases of failure changes in the detection of increased of vehiclesdarkness or or pedestrians. brightness, and thus, there were numerous cases of failure in the detection of vehicles or pedestrians. Table Summarizing,4. Performance comparison the proposed of object RGDiNet detectors demonstrated in noisy environment. its superiority by showing 1–2% higher AP at all the thresholds퐀퐏 of ( the퐈퐎퐔 IoU= ퟎ than. ퟓ) (%) the conventional퐀퐏 ( vision-based퐈퐎퐔 = ퟎ. ퟓ) (%) object de- tection method based on the performance test results using hand-labeled aerial data. The Method Class 푵품풂풖풔풔 processing speed per image푵품풂풖풔풔 was less(Weak than) 푵 1풑풐풊풔풔 s on average,(Weak) ensuring the possibility푵풑풐풊풔풔 (Strong of real-) (Strong) time application. In addition, the proposed RGDiNet maintained an AP of 5% higher than Vision-based Vehicle 57.56 88.43 54.08 60.54 the existing vision-based model and confirmed its superiority in the performance evalu- fasteration R performed-CNN Pedestrian on noise addition54.69 noise to verify71.26 the robustness35.40 against environmental57.18 changes.This study Vehicle 62.71 89.06 58.27 60.94 (RGDiNet) Pedestrian 59.62 73.22 36.58 57.91 TableSummarizing, 4. Performance the comparison proposed of RGDiNet object detectors demonstrated in noisy environment. its superiority by showing 1%– 2% higher AP at all the thresholds APof (IOUthe IoU = 0.5) than (%) the conventional AP (IOUvision = 0.5)-based (%) object detectionMethod method based Class on the performance test results using hand-labeled aerial data. N (Weak) N (Weak) N (Strong) N (Strong) The processing speed per imagegauss was less than poiss1 s on average,gauss ensuring the possibilitypoiss of realVision-based-time application.Vehicle In addition, 57.56the proposed RGDiNet 88.43 maintained 54.08 an AP of 5% 60.54 higher thanfaster the R-CNN existing Pedestrianvision-based model 54.69 and confirmed 71.26 its superiority 35.40 in the performance 57.18 This study Vehicle 62.71 89.06 58.27 60.94 evaluation(RGDiNet) performedPedestrian on noise addition 59.62 noise to 73.22 verify the robustness 36.58 against 57.91 environ- mental changes.

Figure 9. Cont.

Sensors 2021, 21, x FOR PEER REVIEW 12 of 17 Sensors 2021, 21, 1677 12 of 17

(a) (b)

FigureFigure 9. 9. ComparisonComparison of of vehicle vehicle detection detection results results by by vision vision-based-based faster faster R R-CNN-CNN ( (aa)) and and RGDiNet RGDiNet ( (bb))..

Sensors 2021, 21, x FOR PEER REVIEW 13 of 17 , , 1677 13 of 17

Figure 10. Cont.

Sensors 2021, 21, x FOR PEER REVIEW 14 of 17 Sensors 2021, 21, 1677 14 of 17

(a) (b)

FigureFigure 10. 10. ComparisonComparison of of pedestrian pedestrian detection detection results results by by vision vision-based-based faster faster R R-CNN-CNN ( a)) and and RGDiNet RGDiNet ( (b).

4.4. Conclusions Conclusions RecentRecent advances advances in in multimodal multimodal data data-based-based object object detection detection frameworks frameworks have have led led to to significantsignificant improvements improvements in in autonomous autonomous driving driving technologies technologies and and the the KITTI KITTI benchmark benchmark sitesite reports reports that that the the latest latest algorithm algorithm can can achieve achieve an an AP AP of ofup up to 96%. to 96%. However, However, these these re- markableremarkable advances advances may may be beunsuitable unsuitable for forimage image processing processing for fordirect direct object object detection detection in aerialin aerial views. views. It is It difficult is difficult to toobtain obtain satisfactory satisfactory detection detection results results from from an an air air-to-ground-to-ground viewview using using the the existing existing technologies technologies and and detecting detecting objects objects in in aerial aerial images images based based only only on on visionvision sensors, sensors, which which are are focused focused in in most most studies, studies, has has a a drawback drawback of of being being very very sensitive sensitive toto external external environmental environmental changes. changes. InIn this this pap paper,er, we we proposed proposed RGDiNet, RGDiNet, a a new new multimodal multimodal data data-based-based real real-time-time object object detectiondetection framework, framework, which which can can be be mounted mounted on on a a UAV UAV and and is is insensitive insensitive to to the the changes changes inin the the brightness brightness of of the the surrounding surrounding environment. environment. In addition, In addition, the proposed the proposed scheme scheme uses PCDuses to PCD provid to providee the visual the visual distance distance to a target, to a target, unlike unlike other other existing existing aerial aerial detection detection sys- tems.systems. The Theobject object detection detection algorithm algorithm embedded embedded in the in signal the signal processor processor of the of UAV the UAV is de- is signeddesigned based based on onan anRGD, RGD, a combination a combination of of the the RGB RGB aerial aerial images images and and the the depth depth maps maps reconstructreconstructeded by by the the LiDAR LiDAR PCD, PCD, for for computational computational efficiency. efficiency. The The developed developed algorithm algorithm forfor autonomous autonomous flight flight or or ground ground surveillance surveillance is is installed installed on on a a prototype prototype based based on on Nvidia Nvidia AGXAGX Xavier Xavier to to process process the the sensor sensor data. data. The The evaluation evaluation of of RGDiNet RGDiNet on on hand hand-labeled-labeled aerial aerial datadatasetssets under various operatingoperating conditionsconditions confirms confirms that that its its performance performance for for the the detection detec- tionof vehicles of vehicles and and pedestrians pedestrians is superior is superior to that to ofthat conventional of conventional vision-based vision-based methods, methods, while whileprocessing processing the data the at data a real-time at a real level-timeof level 1–2 of FPS. 1–2 Therefore, FPS. Therefore, the practical the practica applicationsl applica- of autonomous flying drones can be expected using the proposed detection technology. In the tions of autonomous flying drones can be expected using the proposed detection technol- near future, we plan to more efficiently improve the integration process of the sensor data ogy. In the near future, we plan to more efficiently improve the integration process of the and improve the performance by network restructuring. These will be actually utilized sensor data and improve the performance by network restructuring. These will be actually and evaluated in applications, such as delivery and search-and-rescue, which are expected utilized and evaluated in applications, such as delivery and search-and-rescue, which are to be used in practice. expected to be used in practice. Author Contributions: J.K. and J.C. took part in the discussion of the work described in this paper. Author Contributions: J.K. and J.C. took part in the discussion of the work described in this paper. All authors have read and agreed to the published version of the manuscript. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by a National Research Foundation of Korea (NRF) grant funded Funding: This research was funded by a National Research Foundation of Korea (NRF) grant by the Korean government (MOE) (No.2018R1D1A3B07041729) and the Soonchunhyang University funded by the Korean government (MOE) (No.2018R1D1A3B07041729) and the Soonchunhyang Research Fund. University Research Fund. Institutional Review Board Statement: Not applicable. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The data presented in this study are available on request from the Data Availability Statement: The data presented in this study are available on request from the corresponding author. corresponding author Acknowledgments: The authors thank the editor and anonymous reviewers for their helpful com- Acknowledgments: The authors thank the editor and anonymous reviewers for their helpful com- ments and valuable suggestions. ments and valuable suggestions.

Sensors 2021, 21, 1677 15 of 17

Conflicts of Interest: The authors declare that they have no competing interest.

References 1. Yuan, Y.; Feng, Y.; Lu, X. Statistical hypothesis detector for abnormal event detection in crowded scenes. IEEE Trans. Cybern. 2016, 47, 3597–3608. [CrossRef] 2. Goodchild, A.; Toy, J. Delivery by drone: An evaluation of unmanned aerial vehicle technology in reducing CO2 emissions in the delivery service industry. Transp. Res. Part D Transp. Environ. 2018, 61, 58–67. [CrossRef] 3. Mogili, U.R.; Deepak, B.B.V.L. Review on application of drone systems in precision agriculture. Procedia Comput. Sci. 2018, 133, 502–509. [CrossRef] 4. Besada, J.A.; Bergesio, L.; Campaña, I.; Vaquero-Melchor, D.; López-Araquistain, J.; Bernardos, A.M.; Casar, J.R. Drone mission definition and implementation for automated infrastructure inspection using airborne sensors. Sensors 2018, 18, 1170. [CrossRef] [PubMed] 5. Zhu, P.; Du, D.; Wen, L.; Bian, X.; Ling, H.; Hu, Q.; Liu, Z. VisDrone-VID2019: The vision meets drone object detection in video challenge results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019. 6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [CrossRef][PubMed] 7. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, , 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. 8. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The , 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. 9. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3431–3440. 10. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2961–2969. 11. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. 12. Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, , 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 734–750. 13. Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2147–2156. 14. Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Data-driven 3d voxel patterns for object category recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1903–1911. 15. Kaida, S.; Kiawjak, P.; Matsushima, K. Behavior prediction using 3d box estimation in road environment. In Proceedings of the 2020 International Conference on Computer and Communication Systems, Shanghai, China, 15–18 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 256–260. 16. Asvadi, A.; Garrote, L.; Premebida, C.; Peixoto, P.; Nunes, U.J. Multimodal vehicle detection: Fusing 3D-LIDAR and color camera data. Pattern Recognit. Lett. 2018, 115, 20–29. [CrossRef] 17. Wolcott, R.W.; Eustice, R.M. Robust LIDAR localization using multiresolution Gaussian mixture maps for autonomous driving. Int. J. Robot. Res. 2017, 36, 292–319. [CrossRef] 18. Xu, X.; Li, Y.; Wu, G.; Luo, J. Multi-modal deep feature learning for RGB-D object detection. Pattern Recognit. 2017, 72, 300–313. [CrossRef] 19. Zhao, J.X.; Cao, Y.; Fan, D.P.; Cheng, M.M.; Li, X.Y.; Zhang, L. Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3927–3936. 20. Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7652–7660. 21. Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, , 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 945–953. 22. Qi, C.R.; Su, H.; Nießner, M.; Dai, A.; Yan, M.; Guibas, L.J. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 5648–5656. Sensors 2021, 21, 1677 16 of 17

23. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1912–1920. 24. Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3d bounding box estimation using deep learning and geometry. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 7074–7082. 25. Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 641–656. 26. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. 27. Ševo, I.; Avramovi´c,A. Convolutional neural network based automatic object detection on aerial images. IEEE Geosci. Remote Sens. Lett. 2016, 13, 740–744. [CrossRef] 28. Hsieh, M.R.; Lin, Y.L.; Hsu, W.H. Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4145–4153. 29. Mo, N.; Yan, L. Improved faster RCNN based on feature amplification and oversampling data augmentation for oriented vehicle detection in aerial images. Remote Sens. 2020, 12, 2558. [CrossRef] 30. Sommer, L.W.; Schuchert, T.; Beyerer, J. Fast deep vehicle detection in aerial images. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 311–319. 31. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3974–3983. 32. Lu, J.; Ma, C.; Li, L.; Xing, X.; Zhang, Y.; Wang, Z.; Xu, J. A vehicle detection method for aerial image based on YOLO. J. Comput. Commun. 2018, 6, 98–107. [CrossRef] 33. Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small Object Detection on Unmanned Aerial Vehicle Perspective. Sensors 2020, 20, 2238. [CrossRef][PubMed] 34. Zhang, S.; He, G.; Chen, H.B.; Jing, N.; Wang, Q. Scale adaptive proposal network for object detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 864–868. [CrossRef] 35. Gao, M.; Yu, R.; Li, A.; Morariu, V.I.; Davis, L.S. Dynamic zoom-in network for fast object detection in large images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6926–6935. 36. Ma, Y.; Wu, X.; Yu, G.; Xu, Y.; Wang, Y. Pedestrian detection and tracking from low-resolution unmanned aerial vehicle thermal imagery. Sensors 2016, 16, 446. [CrossRef] 37. Xu, W.; Zhong, S.; Yan, L.; Wu, F.; Zhang, W. Moving object detection in aerial infrared images with registration accuracy prediction and feature points selection. Infrared Phys. Technol. 2018, 92, 318–326. [CrossRef] 38. Carrio, A.; Vemprala, S.; Ripoll, A.; Saripalli, S.; Campoy, P. Drone detection using depth maps. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, , 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1034–1037. 39. Chen, H.; Li, Y.; Su, D. Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognit. 2019, 86, 376–385. [CrossRef] 40. Chang, L.; Niu, X.; Liu, T.; Tang, J.; Qian, C. GNSS/INS/LiDAR-SLAM integrated navigation system based on graph optimization. Remote Sens. 2019, 11, 1009. [CrossRef] 41. Qu, L.; He, S.; Zhang, J.; Tian, J.; Tang, Y.; Yang, Q. RGBD salient object detection via deep fusion. IEEE Trans. Image Process. 2017, 26, 2274–2285. [CrossRef] 42. Stockman, A.; MacLeod, D.I.; Johnson, N.E. Spectral sensitivities of the human cones. JOSA A 1993, 10, 2491–2521. [CrossRef] 43. Benedek, C.; Sziranyi, T. Study on color space selection for detecting cast shadows in video surveillance. Int. J. Imaging Syst. Technol. 2007, 17, 190–201. [CrossRef] 44. Rasouli, A.; Tsotsos, J.K. The effect of color space selection on detectability and discriminability of colored objects. arXiv 2017, arXiv:1702.05421. 45. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. 46. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [CrossRef] 47. Zhang, L.; Lin, L.; Liang, X.; He, K. Is faster R-CNN doing well for pedestrian detection? In Proceedings of the 2016 European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 443–457. Sensors 2021, 21, 1677 17 of 17

48. Girshick, R. Fast r-cnn. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. 49. Douillard, B.; Underwood, J.; Kuntz, N.; Vlaskine, V.; Quadros, A.; Morton, P.; Frenkel, A. On the segmentation of 3D LIDAR point clouds. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; IEEE: Piscataway, NJ, USA; pp. 2798–2805. 50. Himmelsbach, M.; Luettel, T.; Wuensche, H.J. Real-time object classification in 3D point clouds using point feature histograms. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St Louis, MO, USA, 11–15 October 2009; IEEE: Piscataway, NJ, USA; pp. 994–1000. 51. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 2014 European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland; pp. 740–755. 52. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [CrossRef]