Conservation AI: Live Stream Analysis for the Detection of Endangered Species Using Convolutional Neural Networks and Drone Technology

Conservation AI: Live Stream Analysis for the Detection of Endangered Species Using Convolutional Neural Networks and Drone Technology C. Chalmers, P.Fergus, Serge Wich and Aday Curbelo Montanez Abstract— Many different species are adversely affected by medicinal treatments and as a status symbol within many poaching. In response to this escalating crisis, efforts to stop cultures. The market is mainly focused in Asia, particularly poaching using hidden cameras, drones and DNA tracking have China and Vietnam, which has a growing consumer base. been implemented with varying degrees of success. Limited Existing solutions for preventing rhino poaching, have resources, costs and logistical limitations are often the cause of unfortunately shown marginal effect and so rhino poaching most unsuccessful poaching interventions. The study presented in this paper outlines a flexible and interoperable framework for the continues. For example, GPS trackers have been attached to automatic detection of animals and poaching activity to facilitate animals to track and locate animals in natural habitats – when a early intervention practices. Using a robust deep learning pipeline, loss of GPS movement is detected suspected occurrences of a convolutional neural network is trained and implemented to poaching is inferred and later confirmed when the injured or detect rhinos and cars (considered an important tool in poaching dead animal is found [4]. This is a reactive approach that does for fast access and artefact transportation in natural habitats) in not prevent the poaching event [5]. Preventive efforts are in the study, that are found within live video streamed from drones place to detect poachers using drones [6], [7], including Transfer learning with the Faster RCNN Resnet 101 is performed AirShepherd2. However, they either required in-field human to train a custom model with 350 images of rhinos and 350 images analysis or off-line processing of video footage. Therefore, the of cars. Inference is performed using a frame sampling technique to address the required trade-off control precision and processing goal is to use computer vision to automate the detection process speed and maintain synchronisation with the live feed. Inference and identify objects of interest (Lambda et al. 2019). models are hosted on a web platform using flask web serving, The use of computer vision has increased dramatically, OpenCV and TensorFlow 1.13. Video streams are transmitted largely due to advancements in deep learning (DL), high from a DJI Mavic Pro 2 drone using the Real-Time Messaging Protocol (RMTP). The best trained Faster RCNN model achieved performance hardware and data availability. Examples of a mAP of 0.83 @IOU 0.50 and 0.69 @IOU 0.75 respectively. In computer vision applications include those in medicine [8], comparison an SSD-mobilenetmodel trained under the same manufacturing [9], and engineering [10]. DL in particular, has experimental conditions achieved a mAP of 0.55 @IOU .50 and facilitated developments in image processing, none more so 0.27 @IOU 0.75.The results demonstrate that using a FRCNN and that through the availability of frameworks like TensorFlow off-the-shelf drones is a promising and scalable option for a range [11], Microsoft Cognitive Tool Kit (CNTK) [12] and Caffe 2 of conservation projects. [13]. Convolutional Neural Networks (CNNs) have become the Index Terms— Conservation, Deep Learning, Convolutional gold standard for image processing tasks, in particular, object Neural Networks, Inferencing, Drone Technology detection and segmentation, which are now considered important techniques for analysing animals in video footage. I.INTRODUCTION The computational requirements to train and host CNNs, such oaching is a global issue that impacts many animal species as the Faster-Region-based Convolutional Neural Network P [1] . For instance, none more so than the Rhinocerotidae (Faster-RCNN), is a significant challenge, particularly when (rhinoceros or rhino for short) where approximately 6000 real-time video processing is required, due to their complex rhinoceroses have been illegally hunted in South Africa since network architectures. Graphical processing units (GPUs) help, 20081. The current level of poaching and their subsequent loss however enterprise-level cards are expensive and trade-offs are as a result of poaching activity has left rhinoceros species at required depending on the application and services required. real risk of becoming extinct in the near future [2]. The driving Nonetheless, object detection has been incorporated into factor behind rhino poaching is the illegal trading of rhino horns numerous conservation projects [14]. Solutions are typically which has increased exponentially from $4700 per kilogram in focused on low-resolution images and lightweight models such 1993 to $65,000 per kilogram in 2012 [3]. Demand for rhino as You Only Look Once (YOLO) [15] and the Single Shot horn is largely driven by the belief that it can be used in 1https://www.environment.gov.za/mediarelease/molewa_highl 2 https://airshepherd.org/ ightsprogress_againstrhinopoaching Multibox Detector (SSD)-mobilenet [16]. While they have a inference on. For example, [22] use a YOLO model install on a useful place in conservation, they have limited utility when Nvidia Jetson TX1 attached to a drone is trained to detect detailed image analysis is required. RCNNs incorporate region- pedestrians. The results show it was possible to inference 2 fps based proposals which significantly improves the accuracy of with an image input size of 256 * 256 pixels. In practice such object detections overall but particularly in noisy images systems would provide limited utility as higher altitudes in affected by distance, terrain and occlusion [17]. This is often conservation would be required for safety and conservational confounded with the fact analyses is conducted offline by needs. In a similar approach [23] implemented an optimal dense domain experts. YOLO method to detect vehicles at greater altitudes. The drone was flown between 60-70 metres off the ground and using In this study we propose an alternative approach that YOLOv3 the model could reach 86.7% mAP with a frame rate combines off-the-shelf drones with near-real time detection of of 17 fps compared with YOLOv2, which achieved a 49.3% rhinos and cars for different conservation tasks. mAP and a frame rate of 52 fps. The paper does not include the The remainder of the paper is structured as follows. A IOU metrics making it difficult to evaluate how well the model background discussion on current anti-poaching solutions and performed. their associated limitations is introduced in Section 2. Section An ensemble model is presented in [24] comprising a 3 details the proposed methodology before the results are lightweight image region detector to identify objects of interest. presented in Section 4. Section 5 discusses the results before Each identified frame is processed in a cloud infrastructure to the paper is concluded and future work is presented in Section classify objects. The local node decides if the current frame 6. contains an object of interest using the Binarized Normed II. RELATED WORK Gradients (BING) algorithm to measure the objectness on every input frame. An image resolution of 320 x 240 pixels at 60 (fps) Wildlife management systems have become increasingly was used for the initial detection. The paper reports a popular in poaching prevention strategies. Sophisticated, reasonable inference time of 1.29 seconds using a cloud hosted efficient and cost-effective solutions exist yet there is still little Faster-RCNN. However, the results were only possible when evidence to show that poaching is not being effectively the drone was flown a few metres above the object of interest. controlled [18], [19]. The remainder of this section provides a Again, for safety and conservational needs this approach would discussion on some of the current systems in operation today find limited utility in complex conservational projects. and highlights the key limitations in each approach that this paper aims to address. B. Limitations A. Current Solutions The studies previously discussed combine drones, computer vision and machine learning to provide new and interesting Camera systems are used extensively in conservation to tools for use in conservational studies. While, they all provide observe wildlife and detect poaching activity [20]. The some interesting results, several challenges still remain. In identification of objects, such as animals, in video footage has particular, studies that utilise lightweight models show that typically been undertaken by humans. However, there is now while inference time is fast the overall accuracy of objects significant interest in automating this process using computer detected is poor. This is a particular challenge in conservation vision technology. While significant progress has been made in when flying at higher altitudes is necessary to ensure that model accuracy, inferencing speed has become the focal point animals are not disturbed and poachers are not alerted. It has within the conservation domain. For example, in [6] a Faster- been widely reported that YOLO suffers from spatial RCNN was implemented to identify poachers and animals constraints and this consequently limits the number of nearby using videos obtained from drones. The results for video objects that can be detected [25]. In other words YOLO-based processing with CPU, GPU and cloud processing are presented models perform poorly when detecting small objects or objects to benchmark inferencing time. The key findings show that a that appear in groups or which are occluded. YOLO down Faster-RCNN model inferences at approximately 5 frames per samples images to 448 * 448 pixels and lower. So, while it is second (fps) using a single K40 GPU. Live video stream on the possible to inference in excess of 45(fps) using a Titan X GPU other hand runs at 25 (fps). This creates a bottleneck in the [25] and support real-time inferencing, down sampling results detection pipeline and suffers with issues surrounding video in a significant loss of attainable features, which in turn synchronisation.

Load more