sensors

Article Real-Time Human Detection and Recognition for On-Board UAV Rescue

Chang Liu 1,* and Tamás Szirányi 1,2,*

1 Department of Networked Systems and Services, Budapest University of Technology and Economics, BME Informatika épület Magyar tudósok körútja 2, 1117 Budapest, Hungary 2 Machine Perception Research Laboratory of Institute for Computer Science and Control (SZTAKI), Kende u. 13-17, 1111 Budapest, Hungary * Correspondence: [email protected] (C.L.); [email protected] (T.S.); Tel.: +36-70-675-4164 (C.L.); +36-20-999-3254 (T.S.)

Abstract: Unmanned aerial vehicles (UAVs) play an important role in numerous technical and scientific fields, especially in wilderness rescue. This paper carries out work on real-time UAV human detection and recognition of body and hand rescue . We use body-featuring solutions to establish biometric communications, like yolo3-tiny for human detection. When the presence of a person is detected, the system will enter the gesture recognition phase, where the user and the drone can communicate briefly and effectively, avoiding the drawbacks of speech communication. A data-set of ten body rescue gestures (i.e., Kick, Punch, Squat, Stand, Attention, Cancel, Walk, Sit, Direction, and PhoneCall) has been created by a UAV on-board camera. The two most important gestures are the novel dynamic Attention and Cancel which represent the set and reset functions respectively. When the rescue gesture of the human body is recognized as Attention, the drone will gradually approach the user with a larger resolution for hand gesture recognition. The system

 achieves 99.80% accuracy on testing data in body gesture data-set and 94.71% accuracy on testing data  in hand gesture data-set by using the deep learning method. Experiments conducted on real-time

Citation: Liu, C.; Szirányi, T. UAV cameras confirm our solution can achieve our expected UAV rescue purpose. Real-Time Human Detection and Gesture Recognition for On-Board Keywords: unmanned aerial vehicles (UAVs); search and rescue (SAR); UAV human communication; UAV Rescue. Sensors 2021, 21, 2180. body gesture recognition; hand gesture recognition; neural networks; deep learning https://doi.org/10.3390/s21062180

Academic Editor: Radu Danescu 1. Introduction Received: 19 February 2021 With the development of science and technology, especially computer vision tech- Accepted: 17 March 2021 nology, the application of unmanned aerial vehicles (UAVs) in various fields is becoming Published: 20 March 2021 more and more widespread, such as photogrammetry [1], agriculture [2], forestry [3], remote sensing [4], monitoring [5], and search and rescue [6,7]. Drones are more mobile Publisher’s Note: MDPI stays neutral and versatile, and therefore more efficient, than surveillance cameras with fixed angles, with regard to jurisdictional claims in proportions, and views. With these advantages, combined with the state-of-art computer published maps and institutional affil- iations. vision technology, drones are therefore finding important applications in a wide range of fields. Increasingly researchers have made numerous significant research outcomes in these two intersecting areas. For example, vision-based methods for UAV navigation [8], UAV-based computer vision for an airboat navigation in paddy field [9], deep learning techniques for estimation of the yield and size of citrus fruits using a UAV [10], drone Copyright: © 2021 by the authors. pedestrian detection [11], hand gesture recognition for UAV control [12]. It is also essential Licensee MDPI, Basel, Switzerland. to apply the latest computer vision technology to the field of drone wilderness rescue. This article is an open access article The layered search and rescue (LSAR) algorithm was carried out for multi-UAVs search distributed under the terms and conditions of the Creative Commons and rescue missions [13]. An Embedded system was implemented with the capability Attribution (CC BY) license (https:// of detecting open water swimmers by deep learning techniques [14]. The detection and creativecommons.org/licenses/by/ monitoring of forest fires have been achieved using unmanned aerial vehicles to reduce 4.0/). the number of false alarms of forest fires [15]. The use of a drone with an on-board voice

Sensors 2021, 21, 2180. https://doi.org/10.3390/s21062180 https://www.mdpi.com/journal/sensors Sensors 2021, 21, 2180 2 of 21

recognition system to detect victims in earthquakes was realized in [16]. UAV has the ability to overcome the problem of fixed coverage and it also can reach difficult access areas. Therefore, it will provide great help to human beings in need of rescue. Drone rescue generally takes place in a wilderness environment and there are certain drawbacks to rescue work via speech, as speech recognition [17] is more dependent on the external environment, however, we cannot avoid some of the noise [18] generated by the external environment (e.., rotor noise), which makes it impossible to carry out rescue work effectively. Another disadvantage of speech communication between drones and humans on the ground in noisy environments is that there are many different possible languages spoken in touristic sites and even the same language can have different meanings in some cases [19], making it impossible for drones to understand the questions posed by humans in some cases. Due to these problems, a limited and well-oriented dictionary of gestures can force humans to communicate briefly. Therefore, gesture recognition is a good way to avoid some communication drawbacks, but in our rescue gestures, we need to choose the most representative gestures according to the different cultural backgrounds [20]. Human gesture recognition technology [21–24] is an emerging topic in drone appli- cations. Compared to wearable sensor-based approaches [25,26], automated methods for video analysis based on computer vision technology are almost non-invasive. The control of drones by gesture recognition has already been implemented [27]. However, most of the datasets available in this field are still limited to indoor scenarios, and therefore, it is necessary to develop more and more outdoor UAV datasets. Many researchers are currently contributing to the lack of such outdoor drone datasets, for example, an outdoor recorded drone video dataset for action recognition [28], an outdoor dataset for UAV control and gesture recognition [29], and a dataset for object detection and tracking [30], among others. However until now, there is not suitable outdoor dataset to describe some of the generic gestures that humans make when they are in the wilderness environment. In this work, a data-set of ten body rescue gestures (i.e., Kick, Punch, Squat, Stand, Attention, Cancel, Walk, Sit, Direction, and PhoneCall) has been created by a UAV on-board camera. The number 10 is an approximate number based on some of the literature cited, which is in the range of effective communication. The two most important dynamic gestures are the novel dynamic Attention and Cancel which represent the set and reset functions respectively. We use this newly created dataset (detailed in Section 2.2) and the hand gesture dataset (detailed in Section 2.3) for human gesture recognition, combining from overall body to local hand gestures for better rescue results. The motivation for this paper is as follows: the first step is to find human bodies, and the second step is body gesture recognition in order to make human UAV interaction by Attention and Cancel gestures. People coming to the foreground and making “Attention” dynamic gesture is for the further investigation. The last step is further communication with recognizing hand only happens when the user shows Attention body gesture. Communication between the user and the drone is achieved through the user’s body gesture recognition. Short and effective user feedback during this communication process can greatly improve the efficiency of the rescue. Based on the 10 basic body rescue gestures created in this work, we have chosen a pair of dynamic gestures: a two-handed waving motion (Attention) and a one-handed waving motion (Cancel) [31] as the two most basic communication vocabularies, well separated from the static gesture patterns. When the user extends both arms to call the drone, the drone will issue a warning and go into help mode. The system moves to the next stage where the drone slowly approaches the user in high resolution for localized hand gesture recognition. When a human extends only one arm, it means that the user has to cancel communication with the drone. In other words, the user does not need any help and the system is switched off. The dynamic gestures Attention and Cancel take on the functions of setting and resetting respectively in the system. For people who do not want to interact with the drone, (e.g., standing people), then no alarm will be issued. The cancellation gesture idea comes from a user-adaptive hand gesture recognition system with interactive training [31]. These dynamic gestures Sensors 2021, 21, 2180 3 of 21

have been introduced in our paper [31], to avoid the problems with the impossible need for a pre-trained gesture-pattern database, since they allow the modification and restart. In paper [32], a temporal “freezing” was used for reinforcement/cancellation procedure. Following our earlier solution, try to make the untrained user to use our system easily based on general gesture languages in this paper. Novelties and main issues of the methodology in the paper: • A limited and well-oriented dictionary of gestures can force humans to communicate with UAV briefly during the rescue. So gesture recognition is a good way to avoid some communication drawbacks for UAV rescue. • A dataset of ten basic body rescue gestures (i.e., Kick, Punch, Squat, Stand, Attention, Cancel, Walk, Sit, Direction, and PhoneCall) has been created by a UAV’s camera, which is used to describe some of the body gestures of humans in a wilderness environment. • The two most important dynamic gestures are the novel dynamic Attention and Cancel which represent the set and reset functions respectively, well separated from the static gesture patterns. • The combination of whole body gesture recognition at a distance and local hand gesture recognition at close range makes drone rescue more comprehensive and effective. At the same time, the creation and application of these datasets provide the basis for future research. In the subsequent sections, Section2 presents technical background and related work, including machine specifications, UAV connectivity, and gesture data collection strategies. In Section3, the proposed methodology is presented, followed by human detection, pose extraction, human tracking and counting, body rescue gesture recognition, and proximity hand gesture recognition, along with a description of the relevant models and training and system information. Finally, Section4 discusses the training results of the models and the experimental results. Conclusions and future work are drawn in Section5.

2. Technical Background and Related Work 2.1. Machine Specification and UAV Connection Figure1 shows that the practical implementation of this work was done on an on- board UAV with Jetson Xavier GPU. Stand-alone on-board system is crucial, since in the wilderness we do not have network to rely on. From Sabir Hossain’s experiments [33] on different GPU systems, it was evident that Jetson AGX Xavier was powerful enough to work as a replacement of ground station for a GPU system. This is the reason why the Jetson Xavier GPU has been chosen. In the fourth chapter of this paper, in the experimental session, to ensure the reliable testing conditions we could not go to the field to fly the UAV for some external reasons, so we simulated the field environment in the lab and prepared for UAV motion control. The system for the testing part was changed, as shown in Figure2. The testing part was done on a 3DR SOLO drone based on the Raspberry Pi system [34], which relies on a desktop ground station with GTX Titan GPU. The drone communicates with the computer through a local network. The comparison of the two GPUs is in Table1[35] . In Chapter 4 we also tested the real running time of the program on the proposed architecture.

Table 1. Comparison of NVIDIA GeForce GTX TITAN and Jetson AGX Xavier.

NVIDIA GeForce GTX TITAN Jetson AGX Xavier Pipeline 2688 512 CUDA cores 2688 512 Core clock speed/MHz 830 854 Boost clock/MHz 876 1137 Transistor count 7080 million 9000 million Power consumption(TDP)/Watt 250 30 Sensors 2021, 21, x FOR PEER REVIEW 4 of 22 Sensors 2021, 21, x FOR PEER REVIEW 4 of 22 Sensors 2021, 21, 2180 4 of 21

Figure 1. Practical application of an on-board unmanned aerial vehicles (UAV) with Jetson Xavier FigureFigure 1. 1.Practical Practical application application of anof on-boardan on-board unmanned unmanned aerial aerial vehicles vehicles (UAV) (UAV with Jetson) with Xavier Jetson Xavier GPU for rescue. GPUGPU for for rescue. rescue.

Figure 2. Training and local testing configuration of a Raspberry Pi system UAV with GPU-based FigureFigure 2. 2. Training Training and and local local testing testing configuration configuration of of a aRaspberry Raspberry Pi Pi system system UAV UAV with with GPU-based GPU-based ground station for rescue. groundground station station for for rescue. rescue. For the testing part in the lab, the ground station computer is equipped with an TableNVIDIATable 1. 1. Comparison GeForceComparison GTX of of NVIDIA Titan NVIDIA GPU GeForce GeForce and an GTX Intel(R)GTX TITAN TITAN Core and (TM)and Jetson Jetson I7-5930k AGX AGX CPU,Xavier. Xavier. which is also used for model training. The UAV is a raspberry pi drone, which is a single-board computer with a camera module and a 64-bit NVIDIA NVIDIA quad-core GeForce GeForce ARMv8 GTX GTX CPU. TITAN TITAN The type of Jetson camera Jetson AGX is AGX a 1080P Xavier Xavier ◦ 5MP 160PipelinePipelinefish eye surveillance camera module26882688 for Raspberry Pi with IR night 512 512 vision. Table2 CUDApresentsCUDA cores thecores specification of the 3DR SOLO 2688 2688 drone and the camera. 512 512 The resolution of the camera is to be changed according to the different steps of CoreCore clock clock speed/MHz speed/MHz 830 830 854 854 operationBoost of clock/MHz the system. The resolution of the drone876 camera is set to 640 × 480 for 1137 human detectionBoost and clock/MHz body gesture recognition and 1280 876× 960 for hand gesture recognition. 1137 In the Transistor count 7080 million 9000 million test, theTransistor drone flies count at an altitude of about 3 7080 m in themillion laboratory with the camera 9000 resolution million set asPowerPower above. consump- consump- The higher the resolution of the drone’s camera, the higher the altitude 250250 3030 at whichtion(TDP)/Watttion(TDP)/Watt the drone can fly, thinking of the minimal resolution needed for recognition. Therefore, the system can also work well at heights of more than ten meters with a high- resolution sensor of the UAV camera. ForFor the the testing testing part part in in the the lab, lab, the the grou groundnd station station computer computer is is equipped equipped with with an an NVIDIANVIDIA GeForce GeForce GTX GTX Titan Titan GPU GPU and and an an Intel(R) Intel(R) Core Core (TM) (TM) I7-5930k I7-5930k CPU, CPU, which which is is also also usedused for for model model training. training. The The UAV UAV is is a araspberry raspberry pi pi drone, drone, which which is is a asingle-board single-board com- com- puterputer with with a acamera camera module module and and a a64-bit 64-bit quad-core quad-core ARMv8 ARMv8 CPU. CPU. The The type type of of camera camera is is a a 1080P1080P 5MP 5MP 160° 160° fish fish eye eye surveillance surveillance camera camera module module for for Raspberry Raspberry Pi Pi with with IR IR night night vi- vi- sion.sion. Table Table 2 2presents presents the the specification specification of of the the 3DR 3DR SOLO SOLO drone drone and and the the camera. camera.

Sensors 2021, 21, 2180 5 of 21

Table 2. Specification of the UAV and camera for testing.

System Patch Size Output Size Fish eye surveillance camera Still picture resolution 2592 × 1944 Viewing angle 160 degrees Video supports 1080p@30fps,720p@60fps Size approx.23 × 22 mm/0.90 × 0.86 inch Raspberry Pi 3 [27] CPU 1.2 GHz 64-bit quad-core ARMv8 CPU GPU Broadcom video core 4 Memory 1G Storage Support MicroSD UAV-3DR SOLO Height: 10 in. (25 cm) Motor-to-motor dimension: 18 in. (26 cm) Payload 1.1 lbs. (500 g) Range 0.5 miles ** (0.8 km) Maximum altitude 328 ft. (100 m) Estimated flight time 25 min * ** Range varies with location, antenna orientation, background noise and multi-path. * Flight time varied with payload, wind conditions, elevation, temperature, humidity, flying style, and pilot skill. Listed flight time applies to elevations less than 2000 ft above sea level.

2.2. Body Gesture Data-Set Collection OpenPose [36] is a real-time multi-person framework displayed by the Perceptual Computing Lab of Carnegie Mellon College (CMU) to identify a human body, hand, facial, and foot key points together on single images. Based on the robustness of the OpenPose algorithm and its flexibility in extracting key points, we use it to detect the human skeleton and obtain skeletal data in different gestures of the body, thus laying the data foundation for subsequent recognition. The key thought of OpenPose is employing a convolutional neural network to produce two heap-maps, one for predicting joint positions, and the other for partnering the joints into human skeletons. In brief, the input to OpenPose is an image and the output is the skeletons of all the people this algorithm detects. Each skeleton has 18 joints, counting head, neck, arms, and legs, as appeared in Figure3. Each joint position Sensors 2021, 21, x FOR PEER REVIEW is spoken to within the image arranged with coordinate values of x and y, so there’s an6 of 22 add up to 36 values of each skeleton. Figure3 shows the skeleton data and key points information.

Figure 3. OpenPose Skeleton data and joints information. Figure 3. OpenPose Skeleton data and joints information. There are no publicly available and relevant datasets in the field of the wilderness rescueThere of are humans no publicly by drones. available To deal withand thisrelevant problem, datasets based onin ourthepreliminary field of the work wilderness [37], rescuewe of create humans a new by dataset drones. dedicated To deal to describingwith this briefproblem, and meaningful based on bodyour preliminary rescue gestures work [37],made we create by humans a new physically dataset dedicated in different to situations. describing Considering brief and thatmeaningful people in body different rescue gesturescountries made have by humans different culturalphysically backgrounds, in different some situations. gestures Considering may represent that different people in differentmeanings. countries Therefore, have different we have cultural selected andbackgrounds, defined ten some representative gestures may rescue represent gestures dif- ferentthat meanings. are used to Therefore, convey the we clear have and selected concrete and messages defined without ten representative ambiguity that humansrescue ges- make in different scenarios. These gestures include Kick, Punch, Squat, Stand, Attention, tures that are used to convey the clear and concrete messages without ambiguity that hu- mans make in different scenarios. These gestures include Kick, Punch, Squat, Stand, At- tention, Cancel, Walk, Sit, Direction and PhoneCall. This dataset can of course be extended to a larger dataset. In our dataset, the emphasis is on two dynamic gestures, Attention and Cancel, well separated from the static gesture patterns, as these represent the setup and reset functions of the system. The system will only alert when these two gestures are recognized. Atten- tion represents the need for the user to establish communication with the drone, which will fly toward the user for further hand gesture recognition. Conversely, Cancel sends an alert that the user does not need to establish contact and the system is automatically switched off. The system will not sound an alarm when other rescue gestures are recog- nized. Except for Attention and Cancel, the remaining eight body gestures are considered as signs of normal human activity and therefore do not interact further with the drone. However, this is not absolute, for example, we can also set the PhoneCall gesture as an alarming gesture according to the actual demand, and when the user is recognized to be in the PhoneCall gesture, the drone quickly issues an alarm and later goes to recognize the phone number given by the user through the hand gesture, which can also achieve the rescue purposes. However, this specific function will not be discussed in this paper, be- cause the latter hand gesture dataset we collected is limited and no recognition of numbers is added. From the gesture signs of usual human activity, we can build up a limited but effective clear vocabulary set for communicating simple semantics. It is not the task of the present paper, but it will be developed in the future; now the emphasis is on the gesture- based communication. The datasets are collected using a 1080P 160° fish eye surveillance camera module for raspberry pi on the 3DR SOLO UAV system. Six people from our lab participated in UAV body rescue gesture data collection and real-time prediction, the genders were four males and two females, aged btw 22 and 32 years old. They also have performed each rescue gestures with all the possible variations. The proposed system recognizes the very common ten body rescue gestures in real-time, including the ones listed above. In these ten body gestures, we collected as many as possible of the two gestures of Attention and Cancel, to make the system’s setup and reset functions more powerful. It is important to note that this dataset describes the gesture signs of usual human activity that humans would make in a wilderness environment. Not all gestures will sound an alarm for rescue. Table 3 describes the details of each UAV body rescue gesture. Table 4 describes the details of the UAV body rescue dataset.

Sensors 2021, 21, 2180 6 of 21

Cancel, Walk, Sit, Direction and PhoneCall. This dataset can of course be extended to a larger dataset. In our dataset, the emphasis is on two dynamic gestures, Attention and Cancel, well separated from the static gesture patterns, as these represent the setup and reset functions of the system. The system will only alert when these two gestures are recognized. Attention represents the need for the user to establish communication with the drone, which will fly toward the user for further hand gesture recognition. Conversely, Cancel sends an alert that the user does not need to establish contact and the system is automatically switched off. The system will not sound an alarm when other rescue gestures are recognized. Except for Attention and Cancel, the remaining eight body gestures are considered as signs of normal human activity and therefore do not interact further with the drone. However, this is not absolute, for example, we can also set the PhoneCall gesture as an alarming gesture according to the actual demand, and when the user is recognized to be in the PhoneCall gesture, the drone quickly issues an alarm and later goes to recognize the phone number given by the user through the hand gesture, which can also achieve the rescue purposes. However, this specific function will not be discussed in this paper, because the latter hand gesture dataset we collected is limited and no recognition of numbers is added. From the gesture signs of usual human activity, we can build up a limited but effective clear vocabulary set for communicating simple semantics. It is not the task of the present paper, but it will be developed in the future; now the emphasis is on the gesture-based communication. The datasets are collected using a 1080P 160◦ fish eye surveillance camera module for raspberry pi on the 3DR SOLO UAV system. Six people from our lab participated in UAV body rescue gesture data collection and real-time prediction, the genders were four males and two females, aged btw 22 and 32 years old. They also have performed each rescue gestures with all the possible variations. The proposed system recognizes the very common ten body rescue gestures in real-time, including the ones listed above. In these ten body gestures, we collected as many as possible of the two gestures of Attention and Cancel, to make the system’s setup and reset functions more powerful. It is important

SensorsSensors 2021 2021, 21, 21, x, FORx FOR PEER PEER REVIEW REVIEW to note that this dataset describes the gesture signs of usual human activity that humans7 7 of of 22 22

SensorsSensors 2021 2021, 21, 21, x, FORx FOR PEER PEER REVIEW REVIEW would make in a wilderness environment. Not all gestures will sound an alarm for7 rescue.7 of of 22 22 Table3 describes the details of each UAV body rescue gesture. Table4 describes the details of the UAV body rescue dataset. TableTable 3. 3. UAV UAV body body rescue rescue gestures gestures and and corresponding corresponding key key points. points. TableTable 3. 3. UAV UAV body body rescue rescue gestures gestures and and corresponding corresponding key key points. points. NumberNumber Name Name Table Body 3. BodyUAV Rescue Rescue body rescue Gesture Gesture gestures and Number Number corresponding key Name Name points. Body Body Rescue Rescue Gesture Gesture NumberNumber Name Name Body Body Body Rescue Rescue Rescue Gesture Gesture Number Number Name Name Body Body Body Rescue Rescue Rescue Gesture Gesture

CancelCancel 1 1 KickKick 6 6 (Dynamic)(Dynamic)CancelCancel 11 1 Kick KickKick 66 6 (Dynamic)(Dynamic)

22 2 Punch PunchPunch 77 7 Walk Walk 2 2 PunchPunch 7 7 WalkWalk

3 3 SquatSquat 8 8 SitSit 3 3 SquatSquat 8 8 SitSit

4 4 StandStand 9 9 DirectionDirection 4 4 StandStand 9 9 DirectionDirection

AttentionAttention 5 5 1010 PhoneCallPhoneCall (Dynamic)Attention(Dynamic)Attention 5 5 1010 PhoneCallPhoneCall (Dynamic)(Dynamic)

TableTable 4. 4. UAV UAV body body rescue rescue gestures gestures dataset dataset details. details. TableTable 4. 4. UAV UAV body body rescue rescue gestures gestures dataset dataset details. details. NumberNumber Name Name Number Number of of Skeleton Skeleton Data Data NumberNumber1 1 NameKick NameKick Number Number of of 784Skeleton 784Skeleton Data Data 12 12 PunchKickPunchKick 784583 784583 23 23 PunchSquaPunchSquat t 583711 583711 34 34 SquaStandSquaStandt t 711 907711 907 45 45 AttentionAttentionStandStand 1623 907 1623 907 56 56 AttentionAttentionCancelCancel 16231994 16231994 67 67 CancelWalkCancelWalk 1994 722 1994 722 78 78 WalkSitWalkSit 722942 722942 8 8 SitSit 942 942

SensorsSensors 2021 2021, 21, 21, x, FORx FOR PEER PEER REVIEW REVIEW 7 7 of of 22 22

SensorsSensors 2021 2021, 21, 21, x, FORx FOR PEER PEER REVIEW REVIEW 7 7 of of 22 22

SensorsSensors 2021 2021, 21, 21, x, FORx FOR PEER PEER REVIEW REVIEW 7 of7 of 22 22

TableTable 3. 3. UAV UAV body body rescue rescue gestures gestures and and corresponding corresponding key key points. points. TableTable 3. 3. UAV UAV body body rescue rescue gestures gestures and and corresponding corresponding key key points. points. NumberNumber Name Name Body Body Rescue Rescue Gesture Gesture Number Number Name Name Body Body Rescue Rescue Gesture Gesture TableTable 3. 3.UAV UAV body body rescue rescue gestures gestures and and corresponding corresponding key key points. points. NumberNumber Name Name Body Body Rescue Rescue Gesture Gesture Number Number Name Name Body Body Rescue Rescue Gesture Gesture NumberNumber Name Name Body Body Rescue Rescue Gesture Gesture Number Number Name Name Body Body Rescue Rescue Gesture Gesture

CancelCancel 1 1 KickKick 6 6 (Dynamic)(Dynamic)CancelCancel 1 1 KickKick 6 6 (Dynamic)(Dynamic)CancelCancel 1 1 KickKick 6 6 (Dynamic)(Dynamic)

2 2 PunchPunch 7 7 WalkWalk Sensors 2021, 21, 2180 7 of 21 2 2 PunchPunch 7 7 WalkWalk 2 2 PunchPunch 7 7 WalkWalk

Table 3. Cont.

Number Name Body Rescue Gesture Number Name Body Rescue Gesture

3 3 SquatSquat 8 8 SitSit 3 3 SquatSquat 8 8 SitSit 3 33 Squat SquatSquat 88 8 Sit SitSit

4 4 StandStand 9 9 DirectionDirection 44 4 Stand StandStand 99 9 DirectionDirection 4 4 StandStand 9 9 DirectionDirection

AttentionAttention 55 5 101010 PhoneCall PhoneCall (Dynamic)Attention(Dynamic)(Dynamic)Attention 5 5 1010 PhoneCallPhoneCall (Dynamic)Attention(Dynamic)Attention 5 5 1010 PhoneCallPhoneCall (Dynamic)(Dynamic)

TableTable 4. 4.4. UAV UAV body bodybody rescue rescuerescue gestures gesturesgestures dataset datasetdataset details. details.details. TableTable 4. 4. UAV UAV body body rescue rescue gestures gestures dataset dataset details. details. NumberNumber Name Name Number Number of of Skeleton Skeleton Data Data TableTable 4. 4.UAV UAV Numberbody body rescue rescue gestures gestures dataset dataset details. details. Name Number of Skeleton Data NumberNumber1 1 NameKick NameKick Number Number of of 784Skeleton 784Skeleton Data Data 1 Kick 784 Number Name Number of Skeleton Data Number12 12 2 NamePunchKickPunchKick Punch Number of 784583Skeleton 784583 Data 583 123 123 3PunchSquaKickPunchSquaKick t t Squat 784583711 583784711 711 234 234 4PunchSquaStandPunchSquaStandt t Stand 583711 907 583711907 907 345 345 5AttentionAttentionSquaStandStandSquat t Attention 1623711 907 1623 907711 1623 6 Cancel 1994 45 456 AttentionAttentionStandCancelStand 1623 907 16231994 907 6 7Cancel Walk 1994 722 567 567 8AttentionCancelAttentionWalkCancelWalk Sit 16231994 722 16231994 722 942 678 678 9CancelWalkCancelSitWalkSit Direction 1994 722942 1994 722942 962 78 78 10WalkSitWalkSit PhoneCall 722942 942722 641 8 8 SitSit 942 942 2.3. Hand Gesture Data-Set Collection

Based on the description in Section 2.2, when the user sends a distress signal to the drone, in other words, when the drone recognizes the human gesture as Attention, the

system enters the final stage for hand gesture recognition and the drone automatically adjusts the resolution to 1280 × 960, while slowly approaching the user needs assistance. For hand gesture recognition, there are already many widely used datasets, the dataset for hand gesture recognition in this work is partly derived from GitHub [38] and partly defined by ourselves. We have adapted some of the gesture meanings to the needs. An outstretched palm means Help is needed, an OK gesture is made when the rescue is over, the gestures Peace and Punch are also invoked. Finally, we added Nothing for completeness of hand gesture recognition. In addition to the above four hand gestures, we also added the category Nothing, and in the dataset we collected some blank images, arm images, partial arm images, head images, and partial head images to represent the specific gesture of Nothing. Table5 shows the details of hand gesture dataset. Sensors 2021, 21, x FOR PEER REVIEW 8 of 22

Sensors 2021, 21, x FOR PEER REVIEW 8 of 22

Sensors 2021, 21, x FOR PEER REVIEW 8 of 22

Sensors 2021, 21, x FOR PEER REVIEW 9 Direction 962 8 of 22 109 PhoneCallDirection 641962 Sensors 2021, 21, x FOR PEER REVIEW 8 of 22 10 PhoneCall 641 9 Direction 962 2.3. Hand Gesture Data-Set Collection 109 PhoneCallDirection 962641 2.3. HandBased Gesture on the Data-Set description Collection in Section 2.2, when the user sends a distress signal to the 109 PhoneCallDirection 641 962 drone,Based in other on the words, description when thein Section drone reco2.2, gnizeswhen thethe user human sends gesture a distress as Attention, signal to the 2.3. Hand Gesture Data-Set Collection system enters10 the final stagePhoneCall for hand gesture recognition and 641 the drone automatically 2.3.drone, Hand in Gestureother words, Data-Set when Collection the drone recognizes the human gesture as Attention, the adjustssystemBased theenters onresolution thethe descriptionfinal to stage 1280×960, forin Sectionhand while gest 2.2, slowlyure wh recognitionen approaching the user sendsand the the auser distressdrone needs automatically signal assistance. to the Based on the description in Section 2.2, when the user sends a distress signal to the drone,Foradjusts2.3. handHand inthe othergestureGesture resolution words, Data-Setrecognition, to when 1280×960, Collection the there drone while are alreadyreco slowlygnizes many approaching the widelyhuman used thegesture user datasets, asneeds Attention, theassistance. dataset the drone, in other words, when the drone recognizes the human gesture as Attention, the systemforFor handhandBased enters gesturegesture on thethe recognition recognition, finaldescription stage infor therein this handSection workare gest already 2.2, isure partly wh recognition enmany derivedthe widelyuser and fromsends used the GitHub a dronedatasets,distress [38] automatically signal theand dataset partly to the system enters the final stage for hand gesture recognition and the drone automatically adjustsdefinedfordrone, hand theinby gesture other resolutionourselves. words, recognition Weto when 1280×960, have in the adaptedthis drone while work some reco slowlyis partlygnizes of approachingthe derived the gesture human from meanings the gesture GitHub user needs toas [38] theAttention, assistance.andneeds. partly Anthe adjusts the resolution to 1280×960, while slowly approaching the user needs assistance. Foroutstretcheddefinedsystem hand entersby gesture ourselves. palm the recognition, meansfinal We stage Help have forthere is adapted needed,hand are gest alreadysome anure OK of recognitionmanygesture the gesture widely is made and meanings used whenthe datasets,drone the to rescuethe automatically the needs. datasetis over, An For hand gesture recognition, there are already many widely used datasets, the dataset fortheoutstretchedadjusts handgestures the gesture resolution Peacepalm recognition meansand toPunch Help1280×960, in are is this needed,also whilework invok anisslowly ed. partlyOK Finally, gesture approaching derived we is made addedfrom the whenGitHub Nothing user the needs [38] rescue for and complete-assistance. is partly over, for hand gesture recognition in this work is partly derived from GitHub [38] and partly definednesstheFor gestures handof handby gesture ourselves. Peace gesture recognition,and recognition.We Punch have are thereadapted alsoIn areaddition invok somealreadyed. ofto Finally, manythe gestureabove widelywe added four meanings used hand Nothing datasets, gestures,to the for the needs.complete- we dataset also An defined by ourselves. We have adapted some of the gesture meanings to the needs. An outstretchedaddednessfor hand of the hand gesture category palm gesture recognition means Nothing, recognition. Help and in is this needed,in In the workaddition dataset an is OK partly to we gesture the collected derived above is made fourfromsome when hand GitHubblank thegestures, images, [38]rescue and arm weis partlyover, alsoim- outstretched palm means Help is needed, an OK gesture is made when the rescue is over, theages,addeddefined gestures partial the by category ourselves. Peacearm images, and Nothing, WePunch head have and are images, adapted inalso the invok anddataset some ed.partial ofweFinally, the collectedhead gesture we images added some meanings to Nothingblank represent toimages, forthe the complete-needs. armspecific im- An the gestures Peace and Punch are also invoked. Finally, we added Nothing for complete- nessgestureages,outstretched of partial handof Nothing. armpalmgesture images, means Table recognition. head5Help shows images,is needed, Inthe addition details and an partial OKof to hand gesturethe head abovegesture imagesis made four dataset. to handwhen represent gestures,the rescue the wespecific is over,also ness of hand gesture recognition. In addition to the above four hand gestures, we also addedgesturethe gestures the of categoryNothing. Peace and Nothing,Table Punch 5 shows and are in alsothe the details invok dataseted. of we handFinally, collected gesture we addedsome dataset. blank Nothing images, for complete- arm im- addedages,Tableness partialof5. the UAVhand category armhand gesture images, rescue Nothing, recognition. gesture head and images,dataset. in In the addition anddataset partial weto the collectedhead above images somefour to handblank represent gestures,images, the arm specificwe im-also ages,gestureTableadded partial 5. theofUAV Nothing. category armhand images, rescue TableNothing, gesture head5 shows and images,dataset. inthe the details and dataset partial of handwe head collected gesture images somedataset. to blankrepresent images, the specificarm im- gestureNumber of Nothing. Table Name 5 shows the details Hand ofRescue hand Gesturegesture dataset. Number of Images Sensors 2021, 21, 2180 ages, partial arm images, head images, and partial head images to represent the8 ofspecific 21 Number Name Hand Rescue Gesture Number of Images Tablegesture 5. UAV of Nothing. hand rescue Table gesture 5 shows dataset. the details of hand gesture dataset. Table 5. UAV hand rescue gesture dataset. Number1 NameHelp Hand Rescue Gesture Number803 of Images TableNumber 5.1 UAV hand rescue NameHelp gesture dataset.Hand Rescue Gesture Number803 of Images Table 5. UAV hand rescue gesture dataset. Number Name Hand Rescue Gesture Number of Images 1 Help 803 Number Name Hand Rescue Gesture Number of Images 1 Help 803 2 Ok 803

121 HelpHelpOk 803803803

2 Ok 803 22 Nothing OkOk 803803 3 803 (somethingNothing else) 32 Ok 803803 (something else) Nothing 3 3 Nothing (something else) 803803 (somethingNothing else) 3 803 4 (somethingNothingPeace else) 803 3 803 4 (somethingPeace else) 803 4 Peace 803

4 Peace 803 4 Peace 803 55 PunchPunch 803803 4 Peace 803 5 Punch 803

5 Punch 803 WeWe use5 use hand hand gesture gesturePunch recognition recognition to to allow allow the the drone drone to to go go further further inin discoveringdiscovering803 the theneeds needsWe of of5 usethe the user.hand user. Whole-bodygesture Whole-bodyPunch recognition gesture gesture torecognit allow recognition ionthe isdrone distant, is distant, to go while further while hand in hand gesture discovering803 gesture recogni- the recognitiontionneeds is ofclose. isthe close. user.The combination TheWhole-body combination of gesture the ofwhole therecognit wholebodyion gesture body is distant, gesture and thewhile and partial hand the hand partialgesture gesture hand recogni- can gesturemaketion canisWe the close. makeuse rescue handThe the workcombination rescuegesture more work recognition adequate of more the whole adequate toand allow meaningful.body andthe gesture drone meaningful. We and to chosego the further Wepartial the chose limited in hand discovering the gesture limited canvo-the We use hand gesture recognition to allow the drone to go further in discovering the gestureneedscabulariesmake vocabularies theof the rescue for user. rescue forwork Whole-body rescue gesture more gesture adequaterecognition gesture recognition andrecognit to meaningful. force toion forcethe is distant,user the We userto chose communicatewhile to communicate the hand limited gesture briefly gesture briefly recogni- and vo- ef- needs of the user. Whole-body gesture recognition is distant, while hand gesture recogni- andtionfectivelycabularies effectively isWe close. withuse for with The hand therescue thecombination drone gesturedrone gesture under underrecognition recognition of certain the certain whole condit to conditions. toallow body forceions. the gesture Comparedthe drone Compareduser and to tocommunicategothe speech tofurther partial speech recognition inhand recognitiondiscovering briefly gesture andrescue, can theef- tion is close. The combination of the whole body gesture and the partial hand gesture can rescue,makegesturefectivelyneeds gesture the of recognition withtherescue recognition user. the work drone Whole-body is amore underbetter is a adequate better certainway gesture wayto conditandget recognit to rid meaningful. getions. of ridion the Compared ofis interference thedistant, We interference chose to while speech of the thehand limited recognition of external thegesture externalgesture environ- recogni- rescue, vo- make the rescue work more adequate and meaningful. We chose the limited gesture vo- environment.cabulariesment.gesturetion is close. recognition for The rescue combination is gesture a better recognition ofway the to whole get to rid forcebody of thethegesture interferenceuser and to communicate the ofpartial the external hand briefly gesture environ- and canef- cabularies for rescue gesture recognition to force the user to communicate briefly and ef- fectivelyment.makeHereHere wethe with haverescuewe thehave selected dronework selected moreonlyunder five onlyadequate certain hand five condit gestures handand meaningful.ions. gestures for Compared recognition, for We recognition, tochose butspeech ofthe course, recognitionlimitedbut of we course,gesture could rescue, wevo- fectively with the drone under certain conditions. Compared to speech recognition rescue, alsogesturecouldcabularies includeHere also recognition more we includefor haverescue gestures more isselected gesturea bettersuchgestures only asrecognition way numbers. suchfive to getashand numbers.torid As force gesturesof we the the discussed Asinterference userfor we recognition,discussedto incommunicate Section of the in but2.2externalSection, ofbriefly when course, 2.2, environ- theand when we ef- gesture recognition is a better way to get rid of the interference of the external environ- user’sment.thecouldfectively body user’s also gesture with bodyinclude the recognitiongesture drone more recognitionundergestures in the certain previoussuch in condit asthe numbers. phaseprions.evious resulted Compared As phase we in discussedresulted a to PhoneCall, speech in in recognitiona Section thenPhoneCall, the 2.2, user rescue, when then ment. canthe alsogesture user’sHere give recognition thewebody phonehave gesture selected numberis a recognitionbetter only to way the five drone toin handgetthe by ridpr gestures usingevious of the hand phaseinterference for numberrecognition, resulted gestureof inthe buta externalPhoneCall, recognition. of course, environ- then we Wecould referment.Here here also toweinclude the have result more selected of gestures Chen only [39 such],five what handas willnumbers. gestures be developed As for we recognition, discussed in the future in but Section phase. of course, 2.2, From when we thecouldthe gesture user’s Herealso signs bodyincludewe ofhave gesture usual more selected human recognitiongestures only activity, such five in as thehand we numbers. pr can eviousgestures build As phase up wefor a discussed recognition, limitedresulted but inin a effectiveSectionbut PhoneCall, of course,2.2, clear when then we vocabularythe user’s set body for communicating gesture recognition simple in semantics. the previous In the phase hand resulted gesture recognitionin a PhoneCall, phase, then could also include more gestures such as numbers. As we discussed in Section 2.2, when we canthe alsouser’s capture body gesture the new recognition hand gestures in the given previous by the phase user andresulted later enterin a PhoneCall, the names then of the new gestures. By retraining the network, we can get the new model of gesture recognition with the new hand gesture of the user. Hand gesture prediction is shown in Chapter 4 by a real-time prediction bar chart.

3. Methodology The system framework proposed in this paper is based on rescue gesture recognition for UAV and human communication. In this section, human detection, counting, and tracking are described. Body gesture recognition with set and reset functions and hand gesture recognition at close range are explained in detail. Figure4 shows the framework of the whole system. First, the server on the onboard action unit drone side is switched on and the initial resolution of the drone camera is set. The input to the system is the live video captured by the drone‘s camera and the process is as follows: in the first step human detection is performed and when a person is detected by the drone, the system proceeds to the next step of rescue gesture recognition. In the second step, pose estimation is performed by OpenPose and the human is tracked and counted. The third step is the recognition of the body rescue gestures. Feedback from the human is crucial to the UAV rescue. The cancellation gesture idea comes from our user-adaptive hand gesture recognition system with interactive training [31]. When the user’s body gesture recognition results in Attention, the system proceeds to the final step of hand gesture recognition. If Sensors 2021, 21, x FOR PEER REVIEW 9 of 22

video captured by the drone‘s camera and the process is as follows: in the first step human detection is performed and when a person is detected by the drone, the system proceeds to the next step of rescue gesture recognition. In the second step, pose estimation is per- formed by OpenPose and the human is tracked and counted. The third step is the recog- nition of the body rescue gestures. Feedback from the human is crucial to the UAV rescue. Sensors 2021,The21, 2180 cancellation gesture idea comes from our user-adaptive hand gesture recognition sys- 9 of 21 tem with interactive training [31]. When the user’s body gesture recognition results in Attention, the system proceeds to the final step of hand gesture recognition. If the user’s body gesture recognitionthe user’s is bodya cancellation, gesture recognition then the is system a cancellation, switches then off the directly system and switches auto- off directly matically. The systemand automatically.uses gesture recognition The system usestechnology gesture to recognition force the technologyuser to communi- to force the user to cate briefly, quickly,communicate and effectively briefly, wi quickly,th the anddrone effectively in specific with environments. the drone in specific environments.

Figure 4. Framework of the wholeFigure system. 4. Framework of the whole system. 3.1. Human Detection 3.1. Human Detection YOLO [40,41] is an open-source state-of-the-art object detection framework for real- YOLO [40,41]time is an handling. open-source Using state-of-the a completely-art differentobject detection approach, framework YOLO has for a few real- advantages, time handling. Usingcompared a completely to earlier different region object approach, detection YOLO systems has anda few classification advantages, systems, com- within the pared to earlier regionway itobject performs detection detection system and prediction.s and classification Region proposal systems, classification within the systems way perform it performs detectiondetection and prediction. by applying Region the model proposal to an image classification with multiple systems predictions perform in different de- image tection by applyingregions the model and scales. to an High-scoring image with regionsmultiple are predictions considered asin detections,different image however, YOLO uses a one-stage detector methodology and its design is similar to a fully convolutional regions and scales. High-scoring regions are considered as detections, however, YOLO neural network. The advantage of YOLO for real-time object detection is the improvement uses a one-stage detectorof deep learning-basedmethodology locationand its method.design is In similar our system, to a highfully speed convolutional is required. Previous neural network. TheYOLO advantage versions of apply YOLO a softmax for real-time work to object convert detection scores into is the probabilities improvement with an entirety of deep learning-basedrise to location 1.0. Instead, method. YOLOv3 In ou [42r] system, uses multi-label high speed classification is required. by replacing Previous the softmax YOLO versions applyfunction a softmax with free work logistic to convert classifiers scores to calculate into probabilities the probability with of an an entirety input belonging to rise to 1.0. Instead,a specificYOLOv3 label. [42] Hence, uses mult the modeli-label makes classification multiple by predictions replacing over the different softmax scales, with function with freehigher logistic accuracy, classifiers in any to casecalculate of the the predicted probability object’s of size. an input belonging to a specific label. Hence,Considering the model themakes real-time mult problemiple predictions of our proposed over different system, thisscales, paper with selects yolo3- tiny [42] for human detection. The dataset used in this method is a widely used COCO higher accuracy, in any case of the predicted object’s size. dataset [43], which contains a total of 80 categories of objects. Comprising a change of Considering theYOLO, real-time yolo3-tiny problem treats of detection our proposed to some system, degree differently this paper by selects predicting yolo3- boxes on two tiny [42] for humandifferent detection. scales The whereas dataset features used arein this extracted method from is the a widely base network. used COCO Its higher perfor- dataset [43], whichmance contains compared a total to of YOLO 80 categories was the most of importantobjects. Comprising reason for its a selection.change of The model’s YOLO, yolo3-tiny architecturetreats detect consistsion to some of thirteen degree convolutional differently layers by predicting with an input boxes size on of two416 × 416 im- different scales whereasages. Although features itare can extracte detect thed from 80 objects the base provided network. by the Its COCO higher dataset perfor- very well, in mance compared ourto YOLO system was we onlythe needmost toimport detectant people. reason When for the its objectselection. category The detected model’s by the UAV is a person, the system will issue an alarm and then proceed to the next human gesture recognition. The main aim of the first stage is to find the human, if no human is detected then the system will remain in this stage until the drone detects a human.

Sensors 2021, 21, x FOR PEER REVIEW 10 of 22

architecture consists of thirteen convolutional layers with an input size of 416 × 416 im- ages. Although it can detect the 80 objects provided by the COCO dataset very well, in our system we only need to detect people. When the object category detected by the UAV is a person, the system will issue an alarm and then proceed to the next human gesture recognition. The main aim of the first stage is to find the human, if no human is detected then the system will remain in this stage until the drone detects a human.

3.2. Body Gesture Recognition Figure 5 shows the flowchart for the human body gesture recognition. OpenPose al- gorithm is adopted to detect human skeleton from the video frames. These skeleton data are used for feature extraction, which is then fed into a classifier to obtain the final recog- nition result. We make the real-time pose estimation by OpenPose through a pre-trained model as the estimator [44]. OpenPose is followed by Deep Neural Network (DNN) model to predict the user’s rescue gesture. The Deep SORT algorithm [45] is used for human tracking for the multiple people scenario. The main reasons for choosing this latest method are as follows. Human tracking is not only based on distance and velocity but also Sensors 2021, 21, 2180 based on the features that a person looks like. The main difference from the original10 of 21 SORT algorithm [46] is the integration of appearance information based on a deep appearance descriptor. Deep SORT algorithm allows us to add this feature by computing deep fea- tures3.2. Body for Gestureevery bounding Recognition box and using the similarity between deep features to factor into the trackingFigure5 showslogic. the flowchart for the human body gesture recognition. OpenPose algorithmAfter isOpenPose adopted toskeleton detect human extraction skeleton and fromDeep the SORT video human frames. tracking, These skeleton we can obtain informationdata are used about for feature human extraction, beings. which By counting is then fed the into number a classifier of people, to obtain we the finally final deter- minedrecognition the following result. We three make scenarios: the real-time nobody, pose estimationindividual, by and OpenPose multiple through people. a If pre- the drone doestrained not model detect as anyone, the estimator then [44 the]. OpenPosecommunication is followed between by Deep the Neural drone Network and the (DNN) user will not bemodel established to predict and the the user’s gesture rescue recognition gesture. The is Deepfruitless. SORT If algorithmthe drone [45 detects] is used one for or more people,human trackingthen the for drone the multiple will enter people the scenario.gesture recognition The main reasons phase for for choosing those people this latest and show method are as follows. Human tracking is not only based on distance and velocity but also different recognition results based on the user’s body gesture to achieve communication based on the features that a person looks like. The main difference from the original SORT betweenalgorithm the [46 user] is the and integration the drone of to appearance assist humans. information We are based mainly on a concerned deep appearance with the two gesturesdescriptor. Attention Deep SORT and algorithm Cancel, allowswhich us represent to add this the feature two functions by computing of setting deep features and resetting respectively,for every bounding so when box these and using two gestures the similarity appear, between the system deep features will show to factor a warning, into the turn on helptracking mode logic. or cancel the interaction.

FigureFigure 5. WorkflowWorkflow of of the the human human body bo gesturedy gesture recognition. recognition. After OpenPose skeleton extraction and Deep SORT human tracking, we can obtain informationCompared about to human other beings. gesture By countingrecognition the numbermethods, of people, such weas finallyusing determined3D convolutional neuralthe following networks three [47], scenarios: we finally nobody, chose individual, the skeleton and multiple as the basic people. feature If the for drone human does gesture recognition.not detect anyone, The reason then the is that communication the features between of the human the drone skeleton and the are user concise, will not intuitive, andbe established easy to distinguish and the gesture between recognition different is fruitless.human gestures. If the drone In detectscontrast, one 3DCNN or more is both time-consumingpeople, then the drone and willdifficult enter to the train gesture large recognition neural networks. phase for thoseAs for people the classifiers, and show we ex- perimenteddifferent recognition with four results different based classifiers, on the user’s including body gesture kNN to[48], achieve SVM communication [49], deep neural net- workbetween [50], the and user random and the droneforest to[51]. assist The humans. implem Weentation are mainly of these concerned classifiers with the was two from the gestures Attention and Cancel, which represent the two functions of setting and resetting Python library “sklearn.” After testing the different classifiers the DNN was finally chosen respectively, so when these two gestures appear, the system will show a warning, turn on andhelp the mode DNN or cancel showed the interaction.us the best results. Compared to other gesture recognition methods, such as using 3D convolutional neural networks [47], we finally chose the skeleton as the basic feature for human gesture recognition. The reason is that the features of the human skeleton are concise, intuitive, and easy to distinguish between different human gestures. In contrast, 3DCNN is both time-consuming and difficult to train large neural networks. As for the classifiers, we experimented with four different classifiers, including kNN [48], SVM [49], deep neural network [50], and random forest [51]. The implementation of these classifiers was from the Python library “sklearn.” After testing the different classifiers the DNN was finally chosen and the DNN showed us the best results. The DNN model has been programmed using Keras Sequential API in Python. There are four layers with batch normalization behind each one and 128, 64, 16, 10 units in each dense layer sequentially. The last layer of the model is with Softmax activation and 10 outputs. The model is applied for the recognition of body rescue gestures. Based on the establishment of the above DNN model, the next step is training. The model is compiled using Keras with TensorFlow backend. The categorical cross-entropy loss function is utilized because of its suitability to measure the performance of the fully connected layer’s Sensors 2021, 21, x FOR PEER REVIEW 11 of 22

The DNN model has been programmed using Keras Sequential API in Python. There are four layers with batch normalization behind each one and 128, 64, 16, 10 units in each dense layer sequentially. The last layer of the model is with Softmax activation and 10 outputs. The model is applied for the recognition of body rescue gestures. Based on the establishment of the above DNN model, the next step is training. The model is compiled using Keras with TensorFlow backend. The categorical cross-entropy loss function is uti- lized because of its suitability to measure the performance of the fully connected layer’s output with Softmax activation. Adam optimizer [52] with an initial learning rate of 0.0001 Sensors 2021, 21, 2180is utilized to control the learning rate. The demonstration has been trained for 100 epochs 11 of 21 on a system with an Intel i7-5930K CPU and NVIDIA GeForce GTX TITAN X GPU. The total training dataset is split into two sets: 90% for training, and 10% for testing. Specific information such as the final body gesture recognition model accuracy and loss is de- output with Softmax activation. Adam optimizer [52] with an initial learning rate of 0.0001 scribed specifically in Section 4. is utilized to control the learning rate. The demonstration has been trained for 100 epochs on a system with an Intel i7-5930K CPU and NVIDIA GeForce GTX TITAN X GPU. The 3.3. Hand Gesture Recognition total training dataset is split into two sets: 90% for training, and 10% for testing. Specific Further informationinteraction with such the as the drone final is body esta gestureblished recognitionby the user model through accuracy an Attention and loss is described body gesture.specifically Whether it in is Section a single4. person or a group of people, the drone enters help mode whenever a user is recognized by the drone in a body gesture of Attention. The camera resolution3.3. Hand is automatically Gesture Recognition adjusted to 1280×960 as the drone slowly approaches the user. This is theFurther final stage interaction of this system, with the which drone is is hand established gesture byrecognition. the user through an Attention Figure 6 bodyshows gesture. the flowchart Whether regarding it is a single this section. person Hand or a group gesture of recognition people, the is drone im- enters help plemented bymode using whenever a convolutional a user isneural recognized network by (CNN) the drone [53]. in The a body 12-layer gesture convolu- of Attention. The tional neuralcamera network resolution model is is compiled automatically using adjusted Keras towith 1280 TensorFlow× 960 as the backend. drone slowly The approaches CNN model thecan user.recognize This is5 thepre-trained final stage gestures: of this system,Help, Ok, which Nothing is hand (i.e., gesture when recognition.none of the above gesturesFigure are input),6 shows Peace, the Punch. flowchart The system regarding can thisguess section. the user’s Hand gesture gesture based recognition is on the pre-trainedimplemented gestures. by A usinghistogram a convolutional of real-time neuralpredictions network can (CNN)also be [drawn.53]. The The 12-layer convo- combination lutionalof recognition neural of network overall modelbody gesture is compiled at a usingdistance Keras and with hand TensorFlow gesture at a backend. The close distanceCNN makes model drone can rescue recognize more 5comprehensive pre-trained gestures: and effective. Help, Ok, Although Nothing the (i.e., ges- when none of tures that canthe be above recognized gestures at are this input), stage Peace,are limited, Punch. the The system system can can also guess capture the user’s and gesture based define new gestureson the pre-trained given by the gestures. user as Aneeded histogram and get of real-time a new model predictions by retraining can also the be drawn. The CNN. As an example,combination we can of recognition add the recognition of overall bodyof numbers gesture by at human a distance hand and gestures hand gesture as at a close described beforedistance in Section makes 2.3, drone when rescue the body more gesture comprehensive recognition and in the effective. previous Although section the gestures results in a PhoneCall,that can be at recognized which point at this the stagetwo can are limited,be used thein combination, system can also and capture the user and define new can provide thegestures drone given with the by thephone user number as needed to be and dialed get a via new hand model gesture by retraining recognition, the CNN. As an thus also allowingexample, for rescue we can purposes. add therecognition of numbers by human hand gestures as described The datasetbefore has in a Sectiontotal of 40152.3, when gesture the images body gesture in 5 categories, recognition with in 803 the image previous samples section results in a in each category.PhoneCall, The total atwhich dataset point is split the twointo cantwo be sets: used 80% in combination,for training, and and 20% the user for can provide testing. Afterthe training drone for with 20 epochs, the phone the model number achieves to be dialed 99.77% via precision hand gesture on training recognition, data thus also and 94.71% accuracyallowing on for testing rescue data. purposes.

Figure 6. WorkflowFigure of the 6. humanWorkflow hand of gesture the human recognition. hand gesture recognition.

4. Experiment The dataset has a total of 4015 gesture images in 5 categories, with 803 image samples in each category. The total dataset is split into two sets: 80% for training, and 20% for testing. After training for 20 epochs, the model achieves 99.77% precision on training data and 94.71% accuracy on testing data.

4. Experiment In this section, the model and performance of the proposed human detection and rescue gesture recognition system for UAVs are described as follows. Based on the intro- duction in Chapter 2, the testing phase of the designed system was done in the laboratory in a simulated field environment, and Table6 shows the real running time required for each phase of the program to run on a proposed Jetson AGX Xavier GPU-based UAV. It should be noted that the results below are cutting images, and the original image should Sensors 2021, 21, x FOR PEER REVIEW 12 of 22

In this section, the model and performance of the proposed human detection and Sensors 2021, 21, 2180 rescue gesture recognition system for UAVs are described as follows. Based on the intro-12 of 21 duction in Chapter 2, the testing phase of the designed system was done in the laboratory in a simulated field environment, and Table 6 shows the real running time required for each phase of the program to run on a proposed Jetson AGX Xavier GPU-based UAV. It be in a 4 to 3 ratio, as we have tried to recreate the field environment without some clutter should be noted that the results below are cutting images, and the original image should suchbe in as a tables4 to 3 ratio, and chairsas we have that wetried did to notrecreate want the to field be included, environment so we without have cut some a fixed clutter area ofsuch the outputas tables video. and chairs Figure that7 showswe did thenot want results to ofbe humanincluded, detection so we have via cut yolo3-tiny. a fixed area It is worthof the bringing output video. up the Figure point that7 shows we have the resu simulatedlts of human wild forest detection scenarios via yolo3-tiny. in the lab, It but is of course,worth itbringing can detect up humansthe point in that other we scenarioshave simulated as well. wild We forest can see scenarios that based in the on thelab, COCObut dataset,of course, plants, it can squatting, detect humans and standing in other personsscenarios can as bewell. detected. We can Ifsee no that person based is detected,on the theCOCO system dataset, will not plants, display squatting, a warning. and standing Immediately persons after can the be warning detected. appears If no person the system is goesdetected, into the the recognition system will phase not display of the a human warning. rescue Immediately body gestures. after the warning appears the system goes into the recognition phase of the human rescue body gestures. Table 6. Real running time of the proposed Jetson AGX Xavier GPU-based UAV platform. Table 6. Real running time of the proposed Jetson AGX Xavier GPU-based UAV platform. Phase Real Running Time Phase Real Running Time Human Detection 10 ms Body GestureHuman Recognition Detection 25 10 ms ms HandBody Gesture Gesture Recognition Recognition 20 25 ms ms Hand Gesture Recognition 20 ms

(a) (b)

FigureFigure 7. 7.Human Human detection detection standingstanding (a)) and squatting ( (bb)) by by using using yolo3-tiny. yolo3-tiny.

BasedBased on on thethe bodybody rescuerescue gesture dataset dataset created created in in Table Table 3,3 ,we we trained trained the the model model throughthrough a a deep deep neural neural networknetwork to finally finally obtain obtain the the accuracy accuracy and and loss loss of of the the body body gesture gesture recognitionrecognition model.model. The The changes changes in inaccuracy accuracy and and loss lossfunction function are shown are shown in Figure in Figure8 over 8 overthe course the course of training. of training. At first, At the first, training the training and testing and accuracies testing accuracies increase quickly. increase After- quickly. Afterward,ward, slow slow growth growth between between 10 epochs 10 epochs and 20 and epochs 20 epochs and merging and merging happens happens after 25 after 25epochs. epochs. The The accuracy accuracy and and loss loss approach approach to th toeir their asymptotic asymptotic values values were wereseen after seen 40 after epochs with minor noise in between. The weights of the best fitting model with the highest 40 epochs with minor noise in between. The weights of the best fitting model with the test accuracy are preserved. Both, training as well as testing loss diminished consistently highest test accuracy are preserved. Both, training as well as testing loss diminished and converged showing a well-fitting model. consistently and converged showing a well-fitting model.

Sensors 2021, 21, x FOR PEER REVIEW 13 of 22 Sensors 2021, 21, 2180 13 of 21

(a) (b)

FigureFigure 8. 8.Body Body gesture gesture recognition recognition model model accuracy accuracy (a ()a and) and loss loss (b ()b over) over the the epochs. epochs.

AfterAfter training training for for 100 100 epochs, epochs, the the model model achieves achieves 99.79% 99.79% precision precision on training on training data anddata 99.80%and 99.80% accuracy accuracy on testing on testing data. Indata. Figure In Figure9, the diagram 9, the diagram on the lefton the presents left presents the confusion the con- matrixfusion with matrix predicted with predicted labels on labelsX-axis on and X-axis true labelsand true on thelabelsY-axis on the for Y predictions-axis for predictions utilizing ourutilizing model our tested model on the tested training on the dataset. training The dataset. diagram The on diagram the right on presents the right the presents confusion the matrixconfusion with matrix predicted with labels predicted on X-axis labels and on true X-axis labels and on true the Ylabels-axis foron predictionsthe Y-axis for utilizing predic- ourtions model utilizing on the our testing model dataset. on theThe testing high datase densityt. The at the high diagonal density shows at the that diagonal most ofshows the bodythat rescuemost of gestures the body were rescue predicted gestures correctly. were predicted The performance correctly. is The well performance over and close is well to perfectover and in most close of to the perfect gestures. in most In the of confusionthe gestures. matrix, In the we confusion can see that matrix, the amount we can ofsee data that forthe Attention amount of and data Cancel for Attention is relatively and large. Cancel This is re islatively because, large. in the This data is collectionbecause, in part, the data we collectcollection the largest part, we amount collect of the data largest for Attention amount and of data Cancel. for Attention These two and gestures Cancel. are These dynamic two bodygestures gestures are dynamic and well body separated gestures from and the well static separated gesture patterns, from the which staticrepresent gesture patterns, the set andwhich reset represent functions the respectively. set and reset In Figurefunctions 10, respectively. the diagram onIn theFigure left presents10, the diagram the standard on the matrixleft presents with predicted the standard labels matrix on X-axis with and predicted true labels labels on theon XY-axis forand predictions true labels utilizingon the Y- ouraxis model for predictions tested on theutilizing training our dataset. model tested The diagram on the training on the right dataset. presents The diagram the standard on the matrixright presents with predicted the standard labels on matrixX-axis with and predicted true labels labels on the onY-axis X-axis for and predictions true labels utilizing on the ourY-axis model for onpredictions the testing utilizing dataset. our The model standard on the matrix testing is dataset. a scale forThe correctly standard identified matrix is a gesturesscale for and correctly mistakes, identified it shows gestures that all and body mistak gestureses, init shows the training that all set body have gestures reached in 1.00, the andtraining in the set test have set, allreached body gestures1.00, and except in the Punch test set, 0.98, all Attention body gestures 0.99,and except Walk Punch 0.99 also0.98, reachAttention 1.00. 0.99, The sumand Walk of each 0.99 row also in reach a balance 1.00. andThe normalizedsum of each confusionrow in a balance matrix and is 1.00, nor- becausemalized each confusion row sum matrix represents is 1.00, 100% because of the each elements row sum ina represents particular 100% gesture. of the In addition elements toin using a particular the confusion gesture. matrix In addition as an evaluationto using the metric, confusion we also matrix analyzed as an evaluation the performance metric, ofwe the also model analyzed from the other performance standard metric. of the model we use from the equationsother standard below metric. to calculate we use the the macro-average.equations below Based to calculate on the truethe macro-average. positive (TP), false Based positive(FP), on the true false positive negative(FN), (TP), false and pos- true negative(TN) of the samples, we calculate the p value (Precision), and R value (Recall), itive(FP), false negative(FN), and true negative(TN) of the samples, we calculate the p respectively, and the result macro F1 value is mostly close to 1.00. value (Precision), and R value (Recall), respectively, and the result macro F1 value is mostly close to 1.00. TP TP Precision = , Recall = TP + FP TP + FN 𝑇𝑃 𝑇𝑃 Precision 1 n , Recall 1n macroP = ∑𝑇𝑃Pi, macro 𝐹𝑃 R = ∑𝑇𝑃Ri 𝐹𝑁 n i=1 n i=1 12 × macroP × macro1R macromacroF 𝑃1 = 𝑃 , macro 𝑅 𝑅 𝑛 + 𝑛 macroP macroR

2 macro 𝑃 macro 𝑅 macro 𝐹1 macro 𝑃 macro 𝑅

Sensors 2021, 21, x FOR PEER REVIEW 14 of 22 Sensors 2021, 21, 2180 14 of 21

((aa)) ((bb))

FigureFigure 9.9. ConfusionConfusionFigure 9. Confusion matrixmatrix with with matrix predicted predicted with labels labelspredicted on onX -axisX labels-axis and andon true Xtrue-axis labels labels and on ontrue the the Ylabels-axis Y-axis testedon tested the on Y -axis bodyon body tested gesture gesture on training training set (seta) and (a) and in testingbody in testing gesture dataset dataset training (b). (b). set (a) and in testing dataset (b).

(a) (b) Figure 10. Standard matrix with predicted labels on X-axis and true labels on the Y-axis tested on body gesture training Figure 10. Standard matrix with predicted labels on X-axis and true labels on the Y-axis tested on body gesture training set set (a) and in testing dataset (b). (a) and in testing dataset (b).

AsAs communicationcommunication betweenbetween thethe dronedrone andand thethe GPU-basedGPU-based groundground stationstation inin thethe lablab isis dependentdependent onon thethe locallocal network,network, requestsrequests sentsent fromfrom thethe client-sideclient-side andand acceptedaccepted byby thethe serverserver directlydirectly reducereduce thethe valuevalue ofof thethe FPS,FPS, causingcausing thethe systemsystem toto runrun veryvery slowly.slowly. TheThe systemsystem onlyonly reachesreaches approximatelyapproximately 55 FPSFPS inin aa real-timereal-time operation.operation. ButBut runningrunning directlydirectly onon aa dronedrone loadedloaded withwith aa JetsonJetson XavierXavier GPUGPU wouldwould solvesolve thisthis problem,problem, i.e.,i.e., aa practicalpractical applicationapplication scenario,scenario, asas shownshown inin FigureFigure1 1.. It It has has a a Jetson Jetson Xavier Xavier GPU GPU as as powerful powerful as as the the groundground stationstation (GTX (GTX Titan Titan GPU) GPU) and and does does not not need need to communicate to communicate over over the localthe local network, net- itwork, will beit will fast be enough fast enough to meet to practical meet practical needs. Inneeds. the laboratory In the laboratory tests, the tests, drone the was drone always was flownalways at flown an oblique at an oblique position position above above the person, the person, approximately approximately 2 to 3 2 mto away3 m away from from the userthe user in the in the hand-gesture hand-gesture recognition recognition (close) (close) position. position. The The oblique oblique position position ensures ensures thatthat thethe entireentire humanhuman bodybody cancan bebe recognizedrecognized withwith aa higherhigher probabilityprobability thanthan flyingflying directlydirectly

1

Sensors 2021, 21, 2180 15 of 21

above the user’s head and downwards vertically. Because the work is based on the human skeleton, the flying position of the drone has some limitations on the recognition results. Figure 11 shows the recognition of the Cancel gesture and Attention gesture with warning messages in real-time. Figure 11 also gives information about the number of Sensors 2021, 21, x FOR PEER REVIEW 15 of 22 people, time, frame, and FPS. Next are the recognition display and detailed description of two basic gestures that we randomly selected from the dataset. In Figure 12, the diagram on the leftabove shows the user’s us head that and when downwards a user vertically. points Because in a specificthe work is direction,based on the human the purpose is to alert the droneskeleton, to look the in flying the position direction of the the drone person has some is limitations on to. the Forrecognition example, results. when the direction Figure 11 shows the recognition of the Cancel gesture and Attention gesture with pointedwarning has someone messages in lying real-time. on Figure the 11 ground, also gives information this gesture about solvesthe number the of peo- problem that when somebodyple, lyingtime, frame, on and the FPS. ground, Next are the UAV recogn cannotition display recognize and detailed the description skeleton of two information about basic gestures that we randomly selected from the dataset. In Figure 12, the diagram on the lyingthe person left shows well us that due when to a flight user points position in a specific of thedirection, drone. the purpose Direction is to alert gesture the is also helpful to the fainteddrone to orlook unconscious in the direction the people, person is when pointing there to. For isexample, a group when of the people, direction those who have motion canpointed use has the someone Direction lying on gesture the ground, to givethis gesture instructions solves the problem to the dronethat when to save those who somebody lying on the ground, UAV cannot recognize the skeleton information about the cannot move.lying person Practically, well due to asflight the position main of issue,the drone. our Direction proposed gesture is system also helpful is forto helping people in a badthe situation, fainted or unconscious but we do people, not when want there to is disturb a group of persons people, those who who do have not mo- want or could not interact.tion The can on-board use the Direction system gesture may to give send instructions messages to the to drone the to central save those about who can- non-moving people, not move. Practically, as the main issue, our proposed system is for helping people in a but we leavebad situation, them but in we peace do not ifwant they to disturb simply persons have who a do rest. not want In Figureor could not 12 inter-, the diagram on the right showsact. The the on-board user’s system gesture may send to makemessages a to phone the central call, about which non-moving can bepeople, linked but to hand gesture we leave them in peace if they simply have a rest. In Figure 12, the diagram on the right numbershows recognition the user’s gesture at a later to make stage. a phone When call, which the can user be linked poses to tohand make gesture a num- call, we can perform hand numberber recognition recognition at a later atstage. a laterWhen stagethe user toposes get to themake phone a call, we number can perform the hand user wants to dial. number recognition at a later stage to get the phone number the user wants to dial.

(a) (b) Figure 11. Attention gesture recognition (a) and Cancel gesture recognition (b). Sensors 2021, 21, x FOR PEER REVIEW 16 of 22 Figure 11. Attention gesture recognition (a) and Cancel gesture recognition (b).

(a) (b) Figure 12. Direction gesture recognition (a) and PhoneCall gesture recognition (b). Figure 12. Direction gesture recognition (a) and PhoneCall gesture recognition (b). During the human body gesture recognition, Attention and Cancel are dynamic ges- tures that function as set and reset respectively and should therefore confuse the UAV board recognition during the frame-by-frame check. When either of these two gestures is detected, the system will immediately give an alert. Figure 13 shows that when there are multiple people, one of them sends an Attention gesture to the drone. At this point, the drone sends a warning to inform that someone needs help. We can also see in Figure 12 that other people’s gestures are well recognized in addition to the person making the At- tention gesture. In our recognition system, about 10 people can be recognized at the same time during human body gesture recognition. Figure 13 also shows the basic gesture recognition of multiple people without warning. We can see some people standing, some people walking, and some people kicking. Also, the number of people, time, frame, and FPS will be displayed. It should be noted that if a person is not fully present in the drone camera, then it will not be recognized. People’s movements are generated continuously in real-time, and Figure 13 is a photo we took from the video, so there will be some inaccurate skeleton information. Of course, if a person’s gesture is not in our dataset, that person’s gesture will not be recognized and the recognition result information above it will be blank. When the result given by the user in the previous stage is the body gesture of Attention, then the drone adjusts the resolution to 1280×960 and slowly approaches the user to perform the recognition of the hand gesture. We selected two more representative hand gesture recognition results to show, a Help gesture and an Ok gesture, where the user establishes further communication with the drone through the Attention body gesture in the previous stage. In the last close hand gesture recognition stage, the user can inform the drone that it needs to help him/her through the Help hand gesture, and when the drone is done helping the user, the user can inform it through the Ok hand gesture. Figure 14 shows us the results of the recognition of the Help and Ok gestures. From the displayed results we can see that the user’s hand gesture recognition results can be well predicted by the histogram. Of course, we can also capture and define new gestures for the user on a case-by-case basis and add the new gestures to the gesture dataset by retraining the network. In Figure 15, the diagram on the left presents the confusion matrix with predicted labels on X-axis and true labels on the Y-axis for predictions utilizing our model tested on the training dataset. The diagram on the right presents the confusion matrix with predicted labels on X-axis

Sensors 2021, 21, 2180 16 of 21

During the human body gesture recognition, Attention and Cancel are dynamic gestures that function as set and reset respectively and should therefore confuse the UAV board recognition during the frame-by-frame check. When either of these two gestures is detected, the system will immediately give an alert. Figure 13 shows that when there are multiple people, one of them sends an Attention gesture to the drone. At this point, the Sensors 2021, 21, x FOR PEER REVIEWdrone sends a warning to inform that someone needs help. We can also see in17 ofFigure 22 12

that other people’s gestures are well recognized in addition to the person making the Attention gesture. In our recognition system, about 10 people can be recognized at the sameand time true duringlabels on human the Y-axis body for gesture predictions recognition. utilizing Figure our model 13 also on showsthe testing the basicdataset. gesture recognitionThe high density of multiple at the people diagonal without shows warning.that most of We the can body see rescue some gestures people standing,were pre- some peopledicted walking, correctly. and The someperformance people is kicking. well over Also, and close the numberto perfect of in people,most of the time, gestures. frame, and FPSIn will Figure be 16, displayed. the diagram It should on the left be presents noted that the ifstandard a person matrix is not with fully predicted present labels in the ondrone camera,X-axis then and true it will labels not on be the recognized. Y-axis for predictions People’s movements utilizing our are model generated tested on continuously the train- in real-time,ing dataset. and The Figure diagram 13 is aon photo the right we tookpresents from the the standard video, matrix so there with will predicted be some labels inaccurate skeletonon X-axis information. and true labels Of course,on the Y if-axis a person’s for predictions gesture utilizing is not inour our model dataset, on the that testing person’s dataset. The standard matrix shows that the corresponding values for the five categories gesture will not be recognized and the recognition result information above it will be blank. of hand gestures can reach 0.99 or 1.0 on the training set and 0.9 or more on the test set.

(a) (b)

FigureFigure 13. 13.Multiple Multiple people people with with ((aa)) andand without without (b (b) )communication communication with with UAV. UAV.

When the result given by the user in the previous stage is the body gesture of Attention, then the drone adjusts the resolution to 1280 × 960 and slowly approaches the user to perform the recognition of the hand gesture. We selected two more representative hand gesture recognition results to show, a Help gesture and an Ok gesture, where the user establishes further communication with the drone through the Attention body gesture in the previous stage. In the last close hand gesture recognition stage, the user can inform the drone that it needs to help him/her through the Help hand gesture, and when the drone is done helping the user, the user can inform it through the Ok hand gesture. Figure 14 shows us the results of the recognition of the Help and Ok gestures. From the displayed results we can see that the user’s hand gesture recognition results can be well predicted by the histogram. Of course, we can also capture and define new gestures for the user on a case-by-case basis and add the new gestures to the gesture dataset by retraining the network. In Figure 15, the diagram on the left presents the confusion matrix with predicted labels on X-axis and true labels on the Y-axis for predictions utilizing our model tested on(a the) training dataset. The diagram on the right presents(b) the confusion matrix with Figurepredicted 14. Help hand labels gesture on X recognition-axis and ( truea) and labels Ok hand on gesture the Y-axis recognition for predictions (b). utilizing our model on the testing dataset. The high density at the diagonal shows that most of the body rescue

Sensors 2021, 21, x FOR PEER REVIEW 17 of 22

and true labels on the Y-axis for predictions utilizing our model on the testing dataset. The high density at the diagonal shows that most of the body rescue gestures were pre- dicted correctly. The performance is well over and close to perfect in most of the gestures. In Figure 16, the diagram on the left presents the standard matrix with predicted labels on X-axis and true labels on the Y-axis for predictions utilizing our model tested on the train- ing dataset. The diagram on the right presents the standard matrix with predicted labels on X-axis and true labels on the Y-axis for predictions utilizing our model on the testing dataset. The standard matrix shows that the corresponding values for the five categories of hand gestures can reach 0.99 or 1.0 on the training set and 0.9 or more on the test set.

Sensors 2021, 21, 2180 17 of 21

gestures were predicted correctly. The performance is well over and close to perfect in most of the gestures. In Figure 16, the diagram on the left presents the standard matrix with predicted labels on X-axis and true labels on the Y-axis for predictions utilizing our model tested on the training dataset. The diagram on the right presents the standard matrix with predicted labels on X-axis and true labels on the Y-axis for predictions utilizing our model on the testing dataset. The standard matrix shows that the corresponding values for the five(a) categories of hand gestures can reach 0.99 or 1.0 on( theb) training set and 0.9 or more on Figurethe 13. test Multiple set. people with (a) and without (b) communication with UAV.

Sensors 2021, 21, x FOR PEER REVIEW( a) (b) 18 of 22

Figure 14. Help hand gesture recognition (a) and Ok hand gesture recognition (b). Figure 14. Help hand gesture recognition (a) and Ok hand gesture recognition (b).

(a) (b) Figure 15. Confusion matrix with predicted labels on X-axis and true labels on the Y-axis tested on hand gesture training Figure 15. Confusion matrix with predicted labels on X-axis and true labels on the Y-axis tested on hand gesture training set (a) and in testing dataset (b). set (a) and in testing dataset (b).

(a) (b) Figure 16. Standard matrix with predicted labels on X-axis and true labels on the Y-axis tested on hand gesture training set (a) and in testing dataset (b).

5. Conclusion and Future Work In this paper, we propose a real-time human detection and gesture recognition sys- tem for onboard UAV rescue. Practical application and laboratory testing are two different systems. The system not only detects people, tracks them, and counts the number of peo- ple, but also recognizes human rescue gestures in a dynamic system. First of all, the drone detects the human at a longer distance with a resolution of 640×480, and the system issues an alarm to enter the recognition stage when a person is detected. A dataset of ten basic body rescue gestures (i.e., Kick, Punch, Squat, Stand, Attention, Cancel, Walk, Sit, Direc- tion, and PhoneCall) has been created by a UAV’s camera. The two most important dy- namic gestures are the novel dynamic Attention and Cancel which represent the set and reset functions respectively, through which users can generate communication with the drone. After the Cancel gesture is recognized, the system automatically shuts down, and

Sensors 2021, 21, x FOR PEER REVIEW 18 of 22

(a) (b) Sensors 2021, 21, 2180 18 of 21 Figure 15. Confusion matrix with predicted labels on X-axis and true labels on the Y-axis tested on hand gesture training set (a) and in testing dataset (b).

(a) (b) Figure 16. Standard matrix with predicted labels on X-axis and true labels on the Y-axis tested on hand gesture training Figure 16. Standard matrix with predicted labels on X-axis and true labels on the Y-axis tested on hand gesture training set set (a) and in testing dataset (b). (a) and in testing dataset (b).

5. Conclusion Conclusions and and Future Future Work Work In thisthis paper,paper, we we propose propose a a real-time real-time human human detection detection and and gesture gesture recognition recognition system sys- fortem onboard for onboard UAV UAV rescue. rescue. Practical Practical application application and and laboratory laboratory testing testing are are two two differentdifferent systems. TheThe systemsystem notnot only only detects detects people, people, tracks tracks them, them, and and counts counts the the number number of people,of peo- butple, alsobut also recognizes recognizes human human rescue rescue gestures gestures in aindynamic a dynamic system. system. First First of of all, all, thethe dronedrone detects thethe humanhuman atat aa longerlonger distance distance with with a a resolution resolution of of 640 640×480,× 480, and the system issues an alarm to enter thethe recognitionrecognition stagestage whenwhen aa personperson isis detected.detected. A datasetdataset ofof tenten basicbasic body rescuerescue gesturesgestures (i.e., (i.e., Kick, Kick, Punch, Punch, Squat, Squat, Stand, Stand, Attention, Attention, Cancel, Cancel, Walk, Walk, Sit, Sit, Direction, Direc- andtion, PhoneCall)and PhoneCall) has been has been created created by a UAV’sby a UAV’s camera. camera. The two The most two importantmost important dynamic dy- gesturesnamic gestures are the are novel the dynamicnovel dynamic Attention Attent andion Cancel and Cancel which which represent represent the set the and set reset and functionsreset functions respectively, respectively, through through which which users canusers generate can generate communication communication with the with drone. the Afterdrone. the After Cancel the gestureCancel gesture is recognized, is recogniz the systemed, the automatically system automatically shuts down, shuts and down, after and the Attention gesture is recognized, the user can establish further communication with the drone. People coming to the foreground and making “Attention” dynamic gesture is for further investigation. The system enters the final hand gesture recognition stage to assist the user. At this point, the drone will automatically adjust the resolution to 1280 × 960 and gradually approach the user for close hand gesture recognition. From a drone rescue perspective, we did a good job of getting feedback from users. This work lays some groundwork for subsequent user rescue route design. The detection of the human body is achieved through yolo3-tiny. A rescue dataset of 10 gestures is collected by using a fisheye surveillance camera for 6 different individuals in our lab. OpenPose algorithm is used to capture the user’s skeleton and detect their joints. We built a deep neural network (DNN) to train and test the model. After training for 100 epochs, the framework achieves 99.79% precision on training data and 99.80% accuracy on testing data. For the final stage of hand gesture recognition, we use data collected online combined with our definitions to obtain a relevant dataset, which is trained by a convolutional neural network to obtain a model to achieve hand gesture recognition. Gestures can also be added or removed as required. The drone flies at an altitude of approximately 3 m and is flown diagonally above the user, rather than directly above the user. However, there are some difficulties and limitations when the system applies to the real wildness. In practice, the proposed system is subject to some extreme weather conditions and resolution issues. Another limitation is the flying position of the UAV. The system proposed in this paper requires drones fly over people at an angle in order to detect Sensors 2021, 21, 2180 19 of 21

the human body gestures more accurately, rather than in a vertical user overhead position. For gathering enough experiment data we need more time and battery life-time limits the real-life data-gathering. For this reason, real-life data are only used for demonstration in Figure1, while the exhaustive testing needed laboratory-based environment. The main innovations and contributions of this paper are as follows: First, it is worth affirming that gesture recognition for wilderness rescue can avoid the interference of the external environment, which is the biggest advantage compared to voice recognition for rescue. A limited and well-oriented dictionary of gestures can force humans to communi- cate briefly. So gesture recognition is a good way to avoid some communication drawbacks. Second, a dataset of ten basic body rescue gestures (i.e., Kick, Punch, Squat, Stand, Atten- tion, Cancel, Walk, Sit, Direction, and PhoneCall) has been created by a UAV’s camera, which is used to describe some of the body gestures of humans in the wild. For the gesture recognition dataset, not only the whole body gestures but also the local hand gestures were combined to make the recognition more comprehensive. Finally, the two most important dynamic gestures are the novel dynamic Attention and Cancel which represent the set and reset functions respectively. It should confuse the UAV-board recognition when checking frame-by-frame with a system warning. The system switches to a warning help mode when the user shows Attention to the UAV, and the user can also cancel the communication with the UAV at any time as needed. In future work, more generic rescue gestures and larger hand gesture data sets could be included. The framework can be executed in real-time recognition with self-training. The system can automatically retrain the model based on the new data in a very short time to get a new model with new rescue gestures. Last but not the least, we also needed to conduct outdoor tests on a drone carrying a Jetson Xavier GPU. The interpretation of the gesture based communication without predetermined vocab- ulary and unknown users will be a great challenge to linguistic research. Attention and Cancellation dynamic gestures will have a main role in generating a dynamic linguistic communication.

Author Contributions: C.L. designed and implemented the whole human detection and gesture rescue recognition system and wrote and edited the paper. She also created the body rescue gesture data-set with her colleagues. T.S. had the initial idea of this project and worked on the paper, who supervised and guided the whole project. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by Stipendium Hungaricum scholarship and Scholarship Council. The research was supported by the Hungarian Ministry of Innovation and Technology and the National Research, Development and Innovation Office within the framework both of the National Lab for Autonomous Systems and National Lab for Artificial Intelligence. Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by Ethics Committee of Institute for Computer Science and Control (H-1111 Budapest, Kende utca 13, approved on 4 September 2020). Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Data Availability Statement: The data presented in this study are available on request from the author. Acknowledgments: The work is carried out at Institute for Computer Science and Control (SZTAKI), Hungary and the authors would like to thank their colleague László Spórás for the technical support and also for the helpful scientific community of Machine Perception Research Laboratory. Conflicts of Interest: The authors declare no conflict of interest. Sensors 2021, 21, 2180 20 of 21

References 1. Gonçalves, J.; Henriques, R. UAV photogrammetry for topographic monitoring of coastal areas. ISPRS J. Photogramm. Remote. Sens. 2015, 104, 101–111. [CrossRef] 2. Rokhmana, C.A. The potential of UAV-based remote sensing for supporting precision agriculture in Indonesia. Procedia Environ. Sci. 2015, 24, 245–253. [CrossRef] 3. Torresan, C.; Berton, A.; Carotenuto, .; Di Gennaro, S.F.; Gioli, B.; Matese, A.; Miglietta, F.; Vagnoli, C.; Zaldei, A.; Wallace, L. Forestry applications of UAVs in Europe: A review. Int. J. Remote. Sens. 2016, 38, 2427–2447. [CrossRef] 4. Min, B.C.; Cho, C.H.; Choi, .M.; Kim, D.H. Development of a micro quad-rotor UAV for monitoring an indoor environment. Adv. Robot. 2009, 6, 262–271. [CrossRef] 5. Samiappan, S.; Turnage, G.; Hathcock, L.; Casagrande, L.; Stinson, P.; Moorhead, R. Using unmanned aerial vehicles for high- resolution remote sensing to map invasive Phragmites australis in coastal wetlands. Int. J. Remote. Sens. 2017, 38, 2199–2217. [CrossRef] 6. Erdelj, M.; Natalizio, E.; Chowdhury, K.R.; Akyildiz, I.F. Help from the sky: Leveraging UAVs for disaster management. IEEE Pervasive Comput. 2016, 16, 24–32. [CrossRef] 7. Al-Kaff, A.; Gómez-Silva, M.J.; Moreno, F.M.; De La Escalera, A.; Armingol, J.M. An appearance-based tracking algorithm for aerial search and rescue purposes. Sensors 2019, 19, 652. [CrossRef] 8. Lu, Y.; Xue, Z.; Xia, G.-S.; Zhang, L. A survey on vision-based UAV navigation. Geospat. Inf. Sci. 2018, 21, 21–32. [CrossRef] 9. Liu, Y.; Noguchi, N.; Liang, L. Development of a positioning system using UAV-based computer vision for an airboat navigation in paddy field. Comput. Electron. Agric. 2019, 162, 126–133. [CrossRef] 10. Apolo-Apolo, .; Martínez-Guanter, J.; Egea, G.; Raja, P.; Pérez-Ruiz, M. Deep learning techniques for estimation of the yield and size of citrus fruits using a UAV. Eur. J. Agron. 2020, 115, 126030. [CrossRef] 11. Aguilar, W.G.; Luna, M.A.; Moya, J.F.; Abad, V.; Parra, H.; Ruiz, H. Pedestrian detection for UAVs using cascade classifiers with meanshift. In Proceedings of the 2017 IEEE 11th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 30 January–1 February 2017; pp. 509–514. [CrossRef] 12. Hu, B.; Wang, J. Deep learning based hand gesture recognition and UAV flight controls. Int. J. Autom. Comput. 2019, 17, 17–29. [CrossRef] 13. Alotaibi, E.T.; Alqefari, S.S.; Koubaa, A. LSAR: Multi-UAV collaboration for search and rescue missions. IEEE Access 2019, 7, 55817–55832. [CrossRef] 14. Lygouras, E.; Santavas, N.; Taitzoglou, A.; Tarchanidis, K.; Mitropoulos, A.; Gasteratos, A. Unsupervised human detection with an embedded vision system on a fully autonomous UAV for search and rescue operations. Sensors 2019, 19, 3542. [CrossRef] [PubMed] 15. Sudhakar, S.; Vijayakumar, V.; Kumar, C.S.; Priya, V.; Ravi, L.; Subramaniyaswamy, V. Unmanned aerial vehicle (UAV) based forest fire detection and monitoring for reducing false alarms in forest-fires. Comput. Commun. 2020, 149, 1–16. [CrossRef] 16. Yamazaki, Y.; Tamaki, M.; Premachandra, C.; Perera, C.J.; Sumathipala, S.; Sudantha, B.H. Victim detection using UAV with on-board voice recognition system. In Proceedings of the 2019 Third IEEE International Conference on Robotic Computing (IRC), , , 25–27 February 2019; pp. 555–559. [CrossRef]

17. Oneat, ă, D.; Cucu, H. Kite: Automatic speech recognition for unmanned aerial vehicles. arXiv 2019, arXiv:1907.01195. 18. Jokisch, O.; Fischer, D. Drone Sounds and Environmental Signals—A First Review. 2019. Available online: www.essv.de/paper. php?id=84 (accessed on 9 February 2021). 19. Zhang, Y.; Ding, H. The effect of ambiguity awareness on second language learners’ prosodic disambiguation. Front. Psychol. 2020, 11, 3520. [CrossRef] 20. Navarro, J. The Dictionary of : A Field Guide to Human Behavior; William Morrow: New York, NY, USA, 2018. 21. Liu, H.; Wang, L. Gesture recognition for human-robot collaboration: A review. Int. J. Ind. Ergon. 2018, 68, 355–367. [CrossRef] 22. Nalepa, J.; Grzejszczak, T.; Kawulok, M. Wrist localization in color images for hand gesture recognition. Adv. Hum. Factors Bus. Manag. Train. Educ. 2014, 3, 79–86. [CrossRef] 23. Sharma, A.; Mittal, A.; Singh, S.; Awatramani, V. Hand gesture recognition using image processing and feature extraction techniques. Procedia Comput. Sci. 2020, 173, 181–190. [CrossRef] 24. Xu, P. A real-time hand gesture recognition and human-computer interaction system. arXiv 2017, arXiv:1704.07296. 25. Asokan, A.; Pothen, A.J.; Vijayaraj, R.K. ARMatron—A wearable gesture recognition glove: For control of robotic devices in disaster management and human rehabilitation. In Proceedings of the 2016 International Conference on Robotics and Automation for Humanitarian Applications (RAHA), Kerala, India, 18–20 December 2016; pp. 1–5. [CrossRef] 26. Kim, M.; Cho, J.; Lee, S.; Jung, Y. IMU sensor-based hand gesture recognition for human-machine interfaces. Sensors 2019, 19, 3827. [CrossRef] 27. Niu, R.G.C. UAV Gesture Control—Python. 2020. Available online: github.com/RobertGCNiu/UAV-Gesture-Control_Python (accessed on 9 February 2021). 28. Perera, A.G.; Law, Y.W.; Chahl, J. Drone-action: An outdoor recorded drone video dataset for action recognition. Drones 2019, 3, 82. [CrossRef] 29. Perera, A.G.; Wei Law, Y.; Chahl, J. UAV Gesture: A Dataset for UAV Control and Gesture Recognition. Available online: https://openaccess.thecvf.com/content_eccv_2018_workshops/w7/html/ (accessed on 9 February 2021). Sensors 2021, 21, 2180 21 of 21

30. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. Lect. Notes Comput. Sci. 2018, 30, 375–391. [CrossRef] 31. Licsár, A.; Szirányi, T. User-adaptive hand gesture recognition system with interactive training. Image Vis. Comput. 2005, 23, 1102–1114. [CrossRef] 32. Licsar, A.; Sziranyi, T.; Kovacs, L.; Pataki, B. A folk song retrieval system with a gesture-based interface. IEEE Multimed. 2009, 16, 48–59. [CrossRef] 33. Hossain, S.; Lee, D.-J. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors 2019, 19, 3371. [CrossRef][PubMed] 34. 1 Raspberry Pi 3 Model B. Available online: www.raspberrypi.org/products/raspberry-pi-3-model-b/ (accessed on 9 February 2021). 35. GeForce GTX TITAN vs Jetson AGX Xavier. Available online: https://technical.city/en/video/GeForce-GTX-TITAN-vs-Jetson- AGX-Xavier (accessed on 9 February 2021). 36. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y.A. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [CrossRef] 37. Chang, L.; Szirányi, T. Gesture Recognition for UAV-based rescue operation based on Deep Learning. Improve 2021, accepted. 38. Asingh33 CNN Gesture Recognizer. Available online: github.com/asingh33/CNNGestureRecognizer/tree/master/imgfolder_b (accessed on 9 February 2021). 39. Chen, Z.-H.; Kim, J.-T.; Liang, J.; Zhang, J.; Yuan, Y.-B. Real-time hand gesture recognition using finger segmentation. Sci. World J. 2014, 2014, 1–9. [CrossRef] 40. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. arXiv 2016, arXiv:1506.02640, 779–788. 41. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. arXiv 2016, arXiv:1612.08242. 42. Redmon, J.; Farhadi, A. YOLO v3: An incremental improvement. arXiv 2018, arXiv:1804.02767. 43. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. computer vision. ECCV 2014, 6, 740–755. [CrossRef] 44. Kim, I. Ildoonet—Tf Pose Estimation. 2021. Available online: github.com/ildoonet/tf-pose-estimation (accessed on 25 January 2021). 45. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [CrossRef] 46. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [CrossRef] 47. Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 4724–4733. 48. Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”; Springer: Berlin/Heidelberg, Germany, 2003. 49. Mavroforakis, M.E.; Theodoridis, S. A geometric approach to support vector machine (SVM) classification. IEEE Trans. Neural Netw. 2006, 17, 671–682. [CrossRef][PubMed] 50. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [CrossRef] 51. Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote. Sens. 2005, 26, 217–222. [CrossRef] 52. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. 53. Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modelling sentences. arXiv 2014, arXiv:1404.2188.