MSc Artificial Intelligence Master Thesis

Open the black box: Visualize Yolo

by Peter Heemskerk 11988797

August 18, 2020

36 EC credits autumn 2019 till summer 2020

Supervisor: Dr. Jan-Mark Geusebroek Assessor: Prof. Dr. Theo Gevers

University of Amsterdam Contents

1 Introduction 2 1.1 Open the neural network black box ...... 2 1.2 Project Background ...... 2 1.2.1 Wet Cooling Towers ...... 2 1.2.2 The risk of Legionellosis ...... 2 1.2.3 The project ...... 3

2 Related work 4 2.1 Object Detection ...... 4 2.2 Circle detection using ...... 4 2.3 Convolutional Neural Networks (ConvNet) ...... 5 2.4 You Only Look Once (Yolo) ...... 5 2.4.1 Yolo version 3 - bounding box and class prediction ...... 6 2.4.2 Yolo version 3 - object recognition at different scales ...... 7 2.4.3 Yolo version 3 - network architecture ...... 7 2.4.4 Feature Pyramid Networks ...... 7 2.5 The Black Box Explanation problem ...... 9 2.6 Network Visualization ...... 10

3 Approach 10 3.1 Aerial imagery dataset ...... 10 3.2 Yolo version 3 ...... 12 3.2.1 Pytorch Implementation ...... 12 3.2.2 Tuning approach ...... 12 3.3 Evaluation ...... 12 3.3.1 Training, test and validation sets ...... 12 3.3.2 Evaluation Metrics ...... 12 3.4 Network Visualization ...... 13 3.4.1 Introduction ...... 13 3.4.2 Grad-CAM ...... 13 3.4.3 Feature maps ...... 14

4 Experiment 15 4.1 Results ...... 15 4.1.1 Hough Transform prediction illustration ...... 15 4.1.2 Yolo Prediction illustration ...... 16 4.1.3 Yolo Tuning ...... 17 4.1.4 Yolo Validation ...... 18 4.1.5 DCMR Wet Cooling Tower prediction ...... 18 4.1.6 Yolo Visualization - Grad-CAM ...... 18 4.1.7 Yolo Visualization - Feature maps ...... 21 4.2 Discussion ...... 25

5 Conclusion 26

6 Acknowledgement 27

7 Attachment 30

1 1 Introduction

HIS thesis is based on the project work to automatically detect Wet Cooling Towers on aerial imagery T using a deep neural network. The theme of the thesis is to open the neural network black box. 1.1 Open the neural network black box Neural networks and more specifically deep convolutional networks have shown amazing results in image clas- sification and object detection. Deep neural networks break down a problem like object detection in a millions of pieces and combine them to generate predictions. The human brain does not work that way and therefore we may have problems interpreting the way an algorithm has reached its conclusion. We tend to regard the internal behaviour of the neural network as a black box [22]. Algorithms based on neural networks have gotten an important role in real life decision taking. Medical doc- tors accept computer based advice based on radio or MRI image patterns, autonomous vehicles make continuous critical decisions based on what is captured by video and other sensors. But sometimes the neural network makes prediction errors humans would not make. And there are numerous examples where neural networks trained on real life data tend to have a bias in their decisions. For automated neural networks it is therefore important that we can give insight in the way the network comes to its conclusion. By visualizing what is happening in the neural network we aim to give users of the network some evidence that decisions are made on correct assumptions.

This thesis aims to demonstrate that Yolo version 3, a modern deep convolutional network architecture, is capable of the task of Wet Cooling Tower object detection. We aim to optimize results by the use of different imagery types and the use of transfer learning. The results from the Yolo version 3 will be compared with a classical and simpler object detection method, the circle Hough Transform. The reasoning behind the prediction of the Yolo deep neural network is complex and extensive. By visualizing the inner working of the deep convolutional network, evidence is provided that the network decisions are based on the right pieces of information. Opening the neural network blackbox is the aim. This thesis describes Grad-CAM and Feature Maps techniques for visualisation. To our knowledge, the use of these visualisation techniques on a Yolo version 3 architecture is novel.

1.2 Project Background 1.2.1 Wet Cooling Towers Cooling towers have the aim to cool down a large building or part of an industrial process with a sizeable cooling need. With first versions originating in the 19th century around steam engines, from early 20th century two main types of cooling methods have emerged. The Wet Cooling Tower operates on the principle of evaporative cooling and is an open circuit cooler. When liquid is converted into vapor it consumes thermal energy and therefore the temperature of the surrounding air drops. The heat transfer principle is much alike sweating. It differs from a common refrigerator, where the converted vapor is collected in a sealed system and compressed to liquid. Dry Cooling Towers on the other hand are closed circuit towers where the working fluid is separated from ambient air that is cooled using convection. Wet Cooling Towers are more efficient than Dry Cooling Towers given the higher heat transfer of water compared to air [47]. Also hybrid types of cooling systems exist. Both of these methods require air drawing along the point of heat transfer. Well known are the Dutch invented hyperboloid towers [25] that use a natural draft of warm air rising as in a normal chimney. We see these massive towers as part of energy plants. For this report a different and more frequently used type of cooling towers is considered, one that uses a fan to induce a draft mechanically. See figure 1.

1.2.2 The risk of Legionellosis The use of water evaporation in open cooling systems induces the risk of growing Legionella bacteria which may cause Legionellosis disease [28]. Legionellosis symptoms include cough, shortness of breath, fever, muscle pains and headaches. Treatment is done with antibiotics and hospitalisation is often required. Approximately 10% of infected people die. It is known as Legionaires’ disease since the first known outbreak was at a convention of the American Legion, US military veterans, in 1976 were over 200 sickened and 34 died. The cause appeared to be Legionella bacteria that live in nature in low concentrations but can grow in man made equipment in a specific environment, including stagnant areas and a temperature between 20 and 45 degrees Celcius. When Legionella containing water is distributed, infection occurs when people breath air containing aerosols, small drops of water with the bacteria.

2 (a) (b)

Figure 1: Cooling Towers (a) This power plant’s cooling tower is a typical example of natural air draft by a hyperboloid tower. Source: Paharpur. (b) Cooling Tower with a fan for mechanical draft. This type is used for this report. Source: SPX Cooling Technologies

Legionella occurs in swimming pools, spa’s, showers, but the most common described source of Legionella are cooling towers. A 2003 Legionallosis break-out in Pas-de-Calais, France resulting in 18 deaths was investigated and caused by a cooling tower on a 6 km distance. In Amsterdam, the Netherlands during summer 2006 a large legionella outbreak occured with 29 people sick and 2 deceased. The source of the outbreak appeared to be the wet cooling tower of the Post CS building. Since 2010, Dutch law requires owners of Wet Cooling Towers to understand and reduce the risks of cooling towers [27]. Since 2017 by ’Besluit Omgevingsrecht’ the Dutch environmental agencies have the obligation to map Wet Cooling Towers and their operating companies [57]. By estimation there are 4000 Wet Cooling Towers in the Netherlands, but for most towers the exact location or holding company is unknown [59].

1.2.3 The project This project aims to automatically identify and map Wet Cooling Towers in a specific area using computer vision and machine learning techniques on publicly available aerial imagery. The project is a cooperation of the Utrecht based data science company Ynformed [62] with one of the 29 Dutch environmental agencies, DCMR Milieudienst Rijnmond. See figure 2 for the DCMR working area.

Figure 2: DCMR working area is the larger Rotterdam and Rijnmond area.

3 2 Related work 2.1 Object Detection UMANS are well able to quickly detect and identify the object on an image. With the current development H of autonomous cars and robots, there is also the need for fast and accurate algorithms for letting computers identify objects. Already in 1959, Paul Hough wrote: ’Many people have suggested that a modern digital computer should be able to recognize a fairly complex pattern of tracks in a bubble chamber photograph’ [43]. He was right. Using automated algorithms, and large amounts of data, object detection has been proven to deliver very successful results. Object detection is the task of classifying and localizing objects on an image or in a video, and is now a core problem in computer vision. Due to large variations in viewpoints, poses, occlusions and lighting conditions image object detection has been difficult to solve. Traditionally, the task of object detection has been divided in the following main subtasks.

1. feature extraction Extracting a set of features from the image is an important step in detection pipelines. In 1972 the Hough Transform method of image line and circle features was proposed [42]. During the 1990’s and 2000’s for representing local key points in an image several methods have been developed, with the most commonly known HAAR [55] [38], SIFT [13], and HOG [7]. More recently it has become clear that one can also rely on learned features, and that moderately deep unsupervised models outperform the state of the art gradient based features [16], and with the use of back propagating [36] deep convolutional networks could learn features relevant for object recognition [53]

2. informative region selection If no information is available on the location of the object on the image, it is natural to search everywhere. Since this is an expensive approach these sliding window techniques use a course search grid and fixed aspect ratio’s [39][9]. Other techniques include segmentation given the fact that there is often an hierarchy in objects in an image [30]. Also mixed models have been proposed [11].

3. classification With a classifier like a support vector machine [58] the target object is distinguished from other objects.

Several approaches have been proposed to combine and integrate these tasks. With Deformable Parts Model (DPM) [15] the task of feature extraction with HOG, region selection and classification are combined but still disjoint. R-CNN and its improvements Fast R-CNN and Faster R-CNN use a convolutional network. R-CNN and Fast R-CNN do selective search to determine a predefined number of informative region and uses a convolutional network to extract 4096 features. Faster R-CNN uses a separate convolutional network to determine the informative area’s [46] [32]. Multigrasp is an integrated approach for detecting robotic grasps. Developed by Redmon [1], it is the predecessor of the You Only Look Once approach (Yolo) that is the subject of this paper. With Yolo [4] feature extraction and region detection are done in one go using a convolutional network.

2.2 Circle detection using Hough transform Given the specific circular features of Wet Cooling Tower objects circle Hough Transform is considered. Circle Hough Transform has been studied in other real life situations [35].

Hough transform aims to find aligned points in imperfect images that create lines or circles. Hough transform can be performed on binary images, like images processed with a . For a line it works at follows. A line can be represented by its angle θ and its intersection ρ, and so every line can be represented as a point in the 2 dimensional Hough parameter space defined by θ and ρ. Each point on the image-edges can be translated line in the θ, ρ space. If all the points of the image-edges are translated, certain points in the θ, ρ space will have gotten more votes than others. The points in θ, ρ space with more votes than a certain threshold will be regarded as a line. For a circle the analogue procedure applies. In Hough parameter space a circle is considered as a point defined by it center coordinates x and y and it radius r. Optimize in a 3D (x, y, r) space will however not easily converge to a point. For circle detection the Hough gradient method is used [29]. To make the procedure successful, Gaussian blurring and edge detecting pre-processing is applied. Circle Hough Transform as implemented in Python’s OpenCV includes the latter by Canny .

4 2.3 Convolutional Neural Networks (ConvNet) The swift progress in recent years in image object detection, are mostly based on the use of large scale deep neural networks and more specifically, convolutional networks. Convolutional neural networks are a specific type of neural networks. In a regular, fully connected neural network every neuron of a layer is connected with with all neurons of the previous layer. Convolutional networks assume an image the input matrix, generate bounding boxes at different locations and sizes, and train the network being a combination of convolutional filters and pooling filters. In a convolu- tional filter the neurons in a layer are connected only to a small portion of the layer before it. A convolutional filter is sliding over the input image and produces a 2 dimensional activation map which gives the responses of that filter at every spatial position. Compared to the regular network, these filters consist of learnable weights which are now shared over the spatial area. See figure 3. We typically use more filters in every layer. At the end stage, classification of the object is done with a fully connected network. Bounding boxes with highest object probability are kept. Two things matter in ConvNets. Firstly, by sharing, the number of network parameters is much smaller than in a regular fully connected network. This is important given the fact that an image is a large size input vector. Images have easily the size of 416 x 416 x 3 and if we would like to fully connect an input - image in many layers, the number of network parameters are exploding. Secondly, by using these filters, we learn spatial structures, which are local in space, along width and height of the image, but full along the color channels of the input image. A convolutional network architecture usually consists of several modules, with in each module a convolutional layer and a pooling layer which reduces the size of the layers.

(a) (b)

Figure 3: ConvNet (CNN). (a) A regular, fully connected 3-layer Neural Network. (b) A ConvNet arranges its neurons in three dimensions (width, height, depth). Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example we see in red an image with size 32 x 32 x 3 and 5 convolutional filters, each filter with a certain height and width and channel depth equal to 3. Each filter slides over the image resulting in an hidden layer output of 32 x 32 x 5. In this example the convolution is designed in such a way, by smartly choosing stride and padding, that spatial size (height and width) are preserved. [51]

2.4 You Only Look Once (Yolo) In 2015 the You Only Look Once architecture (Yolo) has been invented. In Yolo, bounding boxes and object detection are generated in one go. This architecture improves speed of object recognition a lot. Where previous object detection methods use a pipeline to first determine windows/frames and then assign probabilities to each frame for a certain object class, Yolo does this all in one step. Object detection is reframed as a single regression problem, straight from image pixels to bounding box coordinates and class. Since its inception, several improvements have been made. The current recent version is Yolo version 3, which includes techniques as batch normalization, residual functions and feature pyramids. At the start of the project this technique was best in class in qualitative performance and speed [4].

In spring 2020 Yolo version 4 is published [26]. Yolo version 4 includes many new techniques as weighted- residual-connections, self-adversarial-training and mosiac data-augmentation and is best in class in average precision and speed. This Yolo version 4 is further not discussed in this thesis.

5 2.4.1 Yolo version 3 - bounding box and class prediction In Yolo version 3 [3], as in Yolo version 2 [2], an image is divided in S x S grid cells. In each grid cell B bounding boxes are predicted, with for each bounding box its 4 coordinates and a box confidence score. The box confidence score is equal to the probability that a box contains an object times the Intersection Over Union (IoU) between predicted box and ground truth box. For each box a set of C conditional class probabilities are predicted. Therefore, the predictions output tensor has dimensions: S x S x B x (4 + 1 + C). In Yolo version 3, B = 3, so three bounding boxes are predicted for each grid cell. The bounding box sizes are not learned randomly but initialised at the most commonly box-sizes. These are selected using a K-means clustering on the COCO data set. To make a final prediction, we keep those with high box confidence scores (usually larger than 0.25). Refer to figures 4 and 5 for some typical examples of a Yolo prediction.

Figure 4: Yolo example with B = 2

Figure 5: Yolo example with B = 2, different anchor boxes

For training, the loss function seeks to maximize overlap between boxes, and maximize confidence that there is an object in the box and maximizes the probability that the object belongs to one of the classes. Therefore the loss function is the sum of several parts: 1. loss related to the predicted bounding box position: x, y

S2 B X X obj 2 2 λcoord Iij (xi − xˆi) + (yi − yˆi) i j

2. loss related to the box width w and height h

S2 B q X X obj √ p 2 p ˆ 2 λcoord Iij ( wi − wˆi) + ( hi − hi) i j

6 3. loss related to the box confidence score C for each bounding box predictor

S2 B S2 B X X obj ˆ 2 X X no−obj ˆ 2 Iij (Ci − Ci) + λno−obj Iij (Ci − Ci) i j i j

4. classification loss with pi(c) the conditional class probability

S2 X obj X 2 Ii (pi(c) − pˆi(c)) i c∈classes

2.4.2 Yolo version 3 - object recognition at different scales The earlier Yolo versions had difficulties finding small objects. The Yolo version 3 network is designed to detect images at three different scales. When the image is divided in S x S grid cells, S will have three different values, depending on the input image size. With input image size 416 x 416, the scale values are S = 52, S = 26 and S = 13. The initialized box sizes are different at each level. Figure 6 illustrates the resulting process flow. At each of the three scales a grid of S x S is considered. At each grid three bounding boxes are predicted, and for each of these bounding boxes the following features are predicted: 4 bounding box coordinates, the box confidence score of the bounding box, and for each class the conditional class probability. For 80 classes, 3 x 85 features are predicted, for 1 class 3 x 6 = 18 features are predicted. For each grid cell, the prediction with maximum box confidence score above a certain threshold (e.g. > 0.25) is kept.

Figure 6: The Yolo version 3 network delivers predictions at three different scales, scale 1 for small objects, scale 3 for larger objects. For each scale for each grid, three boxes are predicted. Each box has 4+1+80 features predicted for 80 classes, and 4+1+1 features for 1 class.

2.4.3 Yolo version 3 - network architecture Figure 7 presents the Yolo version 3 network which consists of 78 convolutional layers [3]. The architecture consist of a DARKNET-53 backbone that delivers convolutional features on three scales. In each of the convo- lutional blocks, residual connections are added to improve accuracy [33]. See figure 8. The three output layers take the outputs at three scales from the backbone and combine these following the principles of Feature Pyramid Network, which is described in paragraph 2.4.4. Finally, the channel detection results of each output layer are combined. Then thresh-holding and non- maximum suppression are performed, resulting in final bounding box predictions.

2.4.4 Feature Pyramid Networks ConvNets have suffered in detecting objects at very different scales, in particular for small objects. In order to solve this problem one may use the same image at different scales for detection, but training of such a pipeline is very time and memory consuming. A more efficient way is making use of the features maps of a ConvNet. Typical ConvNets consist of a combination of convolutional and pooling layers and so are equipped to deliver features at different scales. So to predict also smaller objects, one can predict based on feature maps at different network depths, in other words at different scales. See figure 9a. Lin e.a.[49] and Tiyara [34] proposed to go a step further to combine the feature maps at one scale with up-sampled features maps at deeper levels in a so called Feature Pyramid Network (FPN). See figure 9b.

7 Figure 7: The Yolo version 3 network is based on DARKNET-53. The architecture consists of 78 layers and captures images at 3 scales. In this paper we count the backbone blocks in the gray box at the left as layers 0 till 5. Layer 3 produces features for scale 1: S x S = 52 x 52 and 256 channels, layer 4 produces features for scale 2: S x S = 26 x 26 and 512 channels. Layer 5 produces features for scale 3: S x S = 13 x 13 and 1024. S determined by input size 416 x 416. In our experiment image size was 256 x 256 thus giving other S values. In beige output layer 6 (scale 1), purple output layer 7 (scale 2) and blue output layer 8 (scale 3).

Figure 8: Features of Yolo version 3 network architecture. In a resnet building block an identity connection is added. Now the networks trains on the residual function F(x), which is easier to train on the total feature function F(x) + x. This residual trick generally improves training for deeper networks by avoiding vanishing gradients in back propagation.

(a) (b)

Figure 9: Pyramid features (a) Pyramidal feature hierarchy. For predicting at different scales we could use the features at different scales which are produced by a ConvNet. (b) In a feature pyramid network (FPN) building block a feature map at a certain depth is merged with upsampled feature maps at deeper depths.

8 2.5 The Black Box Explanation problem Deep neural networks as Yolo break down a problem like object detection in a millions of pieces and combine them to generate predictions. The human brain does not work that way and therefore we may have problems interpreting the way an algorithms has reached its conclusion. We tend to regard the internal behaviour of the neural network as a black box [22]. Neural networks are trained by providing them with sufficient examples and at prediction the trained network comes in some mystical way to its conclusion. Many examples exist where deep neural algorithms give wrong answers that humans would not give. A canonical example is shown in figure 10 (a) where a husky is wrongly classified as a wolf since it sits in a snowy environment [8]. In figure 10 (b) a funny example shows that an algorithm cannot always distinguish a skier from a mountain [54]. Numerous examples exist decision algorithms appearing to have gender or racial bias in facial recognition, recruiting or judging advisory [56].

(a) (b)

Figure 10: Some failures in image processing (a) An example where the network classifies a Husky wrongly as a wolf since the Husky is surrounded by snow. (b) An example where Google Photo confuses a skier with a mountain when stitching images together to one panoramic picture.

The human brain is not perfect in decision making either but humans have the power to open up their black box, the working of their brains, and explain their reasoning. A medical doctor can explain its findings when telling us that a visual scan of tissue gives suspicion of cancer. We usually can explain why we stopped a car (because the child started running after a ball). So also for automated neural networks it is important that we can give insight in the why the network came to its conclusion. This is the Black Box Explanation problem. By explaining and visualizing what is happening in the neural network we aim to give users of the network some evidence that decisions are taken on correct assumptions.

The Black Box Explanation problem is still not solved. In 2015 Pasquale [22] points out that the last decade has witnessed the rise of a black box society. Black box AI systems map a user’s features into a class predicting the behavioural traits of individuals, such as credit risk or health status without exposing the reasons why. This is problematic not only for lack of transparency, but also for possible biases inherited by the algorithms from human prejudices and collection artifacts hidden in the training data, which may lead to unfair or wrong decisions. The Black Box Explanation problem can be decomposed in three problems [14]: • model explanation when the explanation involves the whole (global) logic of the black box classifier;

• outcome explanation when the target is to (locally) understand the reasons for the decision of a given record; • model inspection when the object is to understand how internally the black box behaves changing the input by means of a visual tool.

A example of outcome explanation that is investigated is multi model artificial intelligence [41]. The use of two modes, visual and textual, for training and for making explanations on what is happening on an image, contributes to higher trust in the model. This thesis focuses model inspection in one mode, the visual mode.

9 2.6 Network Visualization So expainability matters. This thesis aims to visualize what happens in a deep neural network. In recent years much effort is put in making visible what is happening in within the layers of a convolutional network. Already the first publications on ConvNets visualised the learned kernel weights [12] which is especially insightful for the first low level layers. Later, established clustering techniques as nearest neighbour are developed to visualise the high dimensional last layers of a neural net. Features and referencing images can be visualised in in two dimensions [23]. From 2014, Erhan [44] and Symonian [5] visualised the hidden feature layers of a trained network by carrying out a numerical optimisation in the image space. Symonian proposed a image specific saliency map by using a single back propagation pass through a ConvNet. For ConvNet visualization, Zeiler [45] proposed a deconvolution network to reconstruct an input image based on the output: strongest activation of a specific layer and channel. Springenberg [40] added to that work with the use of guided backpropagation. Inspired by the creator of the Keras Library [20], in 2017 the Gradient-weighted Class Activation Mapping is invented (Grad-CAM) [10]. This method uses the gradient flowing into the last convolutional layer to visualise the most important features.

3 Approach

OAL of the experiment is to predict the existence and location of Wet Cooling Towers on aerial imagery G using a Yolo based deep neural convolutional network. 3.1 Aerial imagery dataset Given the quite distinctive visual features of these Wet Cooling Towers with a fan visible from above, the assumption is that these cooling tower objects can be detected using aerial imagery. Object detection in areal or satellite imagery plays a significant role for different types of applications, such as detection of geological hazard, urban planning, land use and cover mapping, environmental monitoring, geographic information systems and agriculture. Using aerial imagery for analysis is not new, but with the availability of high resolution satellite and aerial images, having sub-meter resolution, different ranges of objects can be recognized. Object detection from aerial and imagery can now be used for analysis of roads, buildings, solar panels, vehicles [37], and in this work, cooling towers. In figure 11 some areal imagery sources are illustrated. For this project the imagery is used from Pub- lieke Dienstverlening op de Kaart (PDOK) [60], being the best known freely available aerial imagery in the Netherlands with 25 cm resolution.

(a) (b) (c)

Figure 11: A Nijmegen based Wet Cooling Tower example from three different imagery sources. (a) google maps, resolution usually better than 50 cm.(b) PDOK aereal imagery with a resulution of 25 cm [60] (c) satelietdataportaal offers public satellite imagery with 50 cm resolution, based on the recently launched TripleSat (panchromatic resolution 0.8 m) and SuperView (panchromatic resolution 0.5 m) satelites [61]

With a set of longitude/latitude coordinates a data set of aerial images (so called ’tiles’) is downloaded with the highest PDOK zoomfactor 19. By choosing the highest zoomfactor it is avoided that the object to detect is a too small part of the image. Each image has size 256 x 256 x 3. The object to detect may be difficult to detect if it is situated on the tile’s edge and only partially visible. Therefore for each tile also tiles where produced with 50% overlap in each direction. This way, data was augmented by a factor 9. See figure 12. For tuning the model, the ground truth is manually generated by a list of 113 known Wet Cooling Tower addresses in the DCMR area, manually converted to longitude/latitude coordinates with the use of Google

10 maps. Based on these known coordinates, tiles are downloaded from PDOK for each of the years 2016, 2017, 2018 and 2019. Having the same location image for more years gives more variation during training in lighting and shading conditions. For each tile its 8 neighbours are downloaded and generated. On each of the center and surrounding tiles the ground truth boxes are manually set with the aid of a Python annotation tool [24]. This counts in total to 4 x 9 x 113 = 4068 RGB images, but due to geographical overlap, the total tuning data set is smaller and comprises of 3492 labeled RGB images.

Figure 12: center tile (dark) with 8 50% overlapping surrounding tiles

For validation a second set of 72 coordinates that do not overlap with the tuning data set. Tiles are downloaded from the same 4 years, ground truth is added. After augmenting and overlap reduction a validation test set of 1152 images remains. In addition to RGB, PDOK also delivers exactly the same tiles in the Colored-Infrared (CIR) spectrum. As can be seen from figure 13 this captures a broader range of the electromagnetic spectrum into 3 channels. One may hypothesise that the information in this CIR bandwidth being broader than the visible RGB bandwidth, gives access to more feature information and improves prediction. For tuning Yolo version 3 parameters only RGB images are used. For final validation both RGB and CIR images are used separately and combined.

Figure 13: In Colored-Infrared (CIR) the capturing of visible light (0.38 µ m to 0.7 µm) is extended to near infra-red (0.70 µ m to 1.0 µm). Near infrared is shifted to the Red channel, and the actual colors Red, Green and Blue are shifted to Green, Blue en Black (none) respectively. On CIR imagery vegetation typically appears red and water appears black. Near infrared does not overlap with thermal infrared (3.5 µm to 20 µm) [50].

11 3.2 Yolo version 3 3.2.1 Pytorch Implementation The Yolo version 3 experiments are based on Liu’s implementation with Pytorch [6]. Training is done on a MSI GeForce RTX 2070 Aero 8G GPU processor.

Other Yolo frameworks that have been tested in the Wet Cooling Towers project are the Yolo version 2 im- plementation by Yolo’s inventor Redmon [2][31] and Van Etten’s SIMRDWN framework [19][17][18]. Redmon’s Yolo version 2 implementation is based on Darknet backbone, which is implemented in C/CUDA. Van Etten’s SIMRDWN is a framework which combines deep learning architectures Faster R-CNN, R-FCN, SSD, and You Only Look Twice. You Only Look Twice is an extension on Yolo version 2, including smart stitching of aerial imagery.

Although both methods appeared to get installed and working quickly they have not been used for this report. For this report the Yolo version 3 Pytorch implementation is chosen, not only for the advancement of the version 3 methods compared to Yolo version 2, but also that with the use of Pytorch we were able to amend the backbone which provided more flexibility in implementing transfer learning and visualisation.

3.2.2 Tuning approach Most of the experiments are based on standard training where we used the provided initialised weights dark- net53 weights pytorch.pth and trained from there. Backbone and output layers were trained based on a training set of labeled images. Image width and height were set to 256, the sizes of the input images. Parameters are set as in table 1. We tuned on nr of epochs and learning rate.

Parameter tr/tst Value Tuning Range number of classes C tr/tst 1 batch size tr/tst 4 width tr/tst 256 height tr/tst 256 freeze backbone tr False decay gamma tr 0.1 decay step tr 10 nr of epochs tr 30 5 .. 200 backbone learning rate tr 0.02 0.001 .. 0.04 other learning rate tr 0.02 0.005 .. 0.02 confidentiality threshold tst 0.4 0.1 .. 0.9

Table 1: Yolo version 3 parameters Standard Learning. Number of epochs, learning rate are tuned. At testing and validating the confidentiality threshold is used at different levels for getting different precision/recall values.

For a separate transfer learning experiment, training was started with provided COCO pretrained weights darknet53 weights pytorch.pth for 80 classes. For enabling this, Liu’s Pytorch implementation was amended to be able to import the COCO weights into the Yolo convolutional backbone. By rebuilding Yolo output layer from scratch with 18 channels these pretrained COCO weights could be used for 1 class prediction.

3.3 Evaluation 3.3.1 Training, test and validation sets For tuning the RGB tuning dataset of 3492 images is split in 80% for training and 20 % for testing. For validation we retrained the network with the tuned parameter based on 100% of tuning dataset. Vali- dation is done on the separate 1152 images large Validation test set. Retraining and Validation is done for all 9 combinations of RGB, CIR and RGB+CIR images.

3.3.2 Evaluation Metrics For calculating Precision and Recall, the usual definitions are used. T rueP ositives Recall = . T otalGroundT ruth

12 T rueP ositives P recision = . T otalP redicted These values are calculated on box level. At evaluation, a box is counted if the overlap prediction/ground truth is larger than 30%. Only the box with highest overlap is counted. By evaluating the model under different values of confidence level threshold, different precision and recall figures are obtained. Average Precision is defined as Area under the Precision/Recall Curve. X AP = (rn+1 − rn) ∗ pinterp(rn+1), n with: pinterp(rn+1) = max p(˜r). r˜≥rn+1 p(r) is the precision value at recall value r. The precision value p is smoothed by taking the max of the values at higher r into pinterp thus removing zig-zags of the precision which are typically seen in Precision/Recall curves.

3.4 Network Visualization 3.4.1 Introduction For opening the neural network black box, two network visualization methods have been investigated. Both method were proven in main stream ConvNets like VGG. Applicability on the Yolo version 3 architecture is novel.

3.4.2 Grad-CAM Gradient-weighted Class Activation Mapping (Grad-CAM) is a localisation technique applicable for convolu- tional networks without having to change the architecture or retrain the network. Ulyanin [52] published a Grad-CAM algorithm for VGG. In this experiment this procedure is amended to work with Yolo version 3. The algorithm aims to identify those pixels in an image which have the most impact on the prediction. The assumption is that the gradient flowing into the last convolutional layer gives information on the features that matter when classifying. The algorithms works as follows:

1. identify the last convolutional layer Ak we want to study. For Yolov3 we have taken each of the three back bone output layers at different scales. 2. freeze the weights in the network. 3. make a prediction for an image. A Yolov3 prediction consists of 3 box predictions for each cell in the S X S grid. In this project we have only one class. We selected the prediction p with the highest probability, and ignored class probability.

∂p 4. calculate the gradients k of this highest predicted probability with respect to the activations of each ∂Aij channel k of the convolutional layer of study. Following Ulyanin, in a Pytorch implementation this can be accomplished using a hook. 5. pool the gradient across channels. Result is α that shows the importance of feature map k for the prediction. k 1 P ∂p α = ij k N ∂Aij 6. weight the activations of the convolutional layer with the corresponding pooled gradients P k k HGrad−CAM = RELU( k α A ). 7. the result is a heatmap of same size as channels of convolutional layer. Resize this heatmap to image size.

The result is an overlay of the image where we can identify the pixels which are most relevant of the prediction.

The procedure above is for a one class prediction. Outside the scope of this project is multi class prediction. For multi class, a slightly more complicated approach is needed for step 3. The procedure should take place for each box prediction selected by Yolov3: for each grid cell the box with maximum box probability score above the threshold. For each of these selected boxes the gradient should have been taken of the highest class confidence score with respect to the activations of the convolutional layer of study. Note that class confidence score is equal to the box confidence score times the conditional class probability. We would then get a heatmap for each box prediction in an image.

13 3.4.3 Feature maps Feature maps can be made visible based on the work of Erhan [44] and the blog of Graetz [21]. The procedure is performed on a trained network with frozen weights. The main idea is to start with an input image consisting with noise, forward this image through the network and change the values of the image in such a way that it maximizes the activation of the layer you want to study. The optimization aims to maximize the activation of a certain layer. So the loss function is defined as minus the mean of activations Ak of the channel k in the layer we want to investigate:

1 X Loss = − Ak N ij ij The input image is created with Pytorch command require grad = True. When now loss.backward() is applied in a further frozen network, it will calculate the gradient of the pixels of the noisy input image. We optimize the input image with an Adam optimizer. This process does create learned images that visualize feature maps, but it is difficult to see some structure in these images, given the relative high frequency of the structures. The visualised feature maps may be perceived as grained noise. For later references, this method is called no scaling. In order to find also lower frequency structures in the feature map, the procedure is performed with a low resolution (e.g. 32 x 32) noisy image, the resulting image is then up-scaled by some factor and the procedure is performed again. By step by step up-scaling the resulting image does have features with different resolutions. For the experiment, we started at 32 pixels and up-scaled 8 times with a factor 1.2, resulting in an 448 x 448 image. The Pytorch implementation of our Yolov3 network only accepts scales that are a multiplication of 32 x 32. Method: up-scaling. Up-scaling may cause aliasing effects meaning that high spatial frequencies may return ’aliased’ as low fre- quencies [48]. Aliasing can be prevented by first applying a low-pass filter before the up-scaling. Therefore we applied blurring in each up-scaling step. Blurring with box filter and with a Gaussian filter are compared. Gaussian blur should give better results. If we would apply a Fourier transform on a box filter, it will result in a sinus-like output with high frequency structures. Applying a Fourier transform on a Gaussian filter results in a Gaussian [48]. For box filtering a normalized box filter (sizes: 3 x 3 or 9 x 9 pixels) is used. Gaussian blur used filters with σ = 0.8, filter size 3 x 3 and with σ = 1.7, filter size 9 x 9. In the third method the image is flattened to gray scale just before up-scaling.

Of interest is whether these activation maps differ when optimizing the activations of the different Yolov3 layers. With the use of the Pytorch Yolov3 implementation, any convolutional layer in the Yolov3 backbone could be accessed for optimization.

14 4 Experiment I N this chapter the main experimental results are outlined and discussed. 4.1 Results 4.1.1 Hough Transform prediction illustration Predicting Wet Cooling Towers with circle Hough Transform give meager results. As an illustration, figure 14 show 9 example tiles with ground truth and predicted circles. We see that some ground truth Wet Cooling Towers are detected, some accurate fitting, some non accurate fitting. Also not-existing circles are predicted. The Precision/Recall curves of figure 15 show that precision is very low at higher recall values. Higher recall values only occur where parameters are tuned in such a way that circles are predicted randomly everywhere on the image.

Figure 14: examples with ground truth boxes around Wet Cooling Towers (red) and circle Hough Transform predictions (black). Tuned with Gaussian kernel size = 9 and sigma = 1.7, upper threshold for the Canny edge detector (OpenCV param1) = 50, Hough accumulator threshold (OpenCV param2) = 33 and maximum radius 32 pixels.

(a) (b)

Figure 15: Precision/Recall Curves for Hough illustrative experiments.(a) Gaussian blur kernel = 9. param1 = 50, param2 in range between 30 and 45 delivering different P/R values. (b) Gaussian blur kernel = 11. param1 = 50, param2 in range between 30 and 45 delivering different P/R values

15 4.1.2 Yolo Prediction illustration Figure 16 shows illustrative that Yolo predicts with higher precision. In many cases the green prediction boxes overlap the red ground truth boxes.

Figure 16: examples with ground truth boxes around Wet Cooling Towers (red) and Yolo predictions (green, we see the label ’nat’).

16 4.1.3 Yolo Tuning The Yolo version 3 network is tuned on the RGB tuning data-set. Figure 17 shows that Yolo version 3 needs only 20 epochs to converge at standard training. Standard is without transfer learning. For evaluation 30 epochs is taken as default.

Also can be seen that average precision is best at backbone learning rate 0.02 (tuned between 0.01 and 0.04). We kept the learning rate of output layers = 0.01. Using higher learning values for the other layers gave unstable learning, making a structural tuning approach difficult.

With transfer learning (on the pre-trained COCO weights), average precision results are comparable. Trans- fer learning is not used for further validation.

(a) (b)

(c) (d)

Figure 17: Precision/Recall Curves for Yolo version 3 tuning.(a) Average Precision AP at different nr epochs. AP stabilizes at 20 epochs training at AP = 0.95.(b) Average Precision AP of at different back bone learning rates. Best results at learning rate 0.020(c) Average Precision AP at different test set percentages of the tuning data set. For best results at minimum of 50% of the tuning data set of 3492 images. (d) Precision/Recall curve with 30 epochs training, learning rate 0.020. AP = 0.98. Elbo at Recall = 0.94% and Precision = 0.95%.

17 4.1.4 Yolo Validation Table 2 shows the average precision values at validation. Training is done on RGB, or CIR or RGB+CIR combined tuning data-set. Validation is done with RGB or CIR of RGB+CIR combined on the validation data-set. In figure 18 the Precision/Recall curves are shown for two situations. The average precision values are comparable in all these combinations, around 0.85.

Training Validation RGB CIR RGB+CIR RGB 0.853 0.861 0.848 CIR 0.849 0.854 0.847 RGB+CIR 0.851 0.857 0.847

Table 2: Validation: Average Precision for validation of different training sets. Each training set is a column.

(a) (b)

Figure 18: Precision/Recall Curves for Yolo version 3 - Validation. (a) is RGB validation set after training on RGB tuning set. AP=0.853. (b) is RGB validation set after training on CIR tuning set. AP=0.861.

4.1.5 DCMR Wet Cooling Tower prediction The validated Yolo model is used to predict Wet Cooling Towers ’for real’ in the DCMR area. Based on the aerial imagery of the whole DCMR area 600 Wet Cooling Tower predictions with highest confidence levels have been presented to DCMR.

From this list DCMR has confirmed 137 not yet known Wet Cooling Towers.

The other predictions most often were no Wet Cooling Towers but often a different circular construction of about the same size. Another reason for not confirming a new WCT was overlap with previous known Wet Cooling Towers. Here it had to be taken into account that DCMR was looking for addresses of companies with Wet Cooling Towers and not for individual Wet Cooling Towers.

4.1.6 Yolo Visualization - Grad-CAM In figures 19 and 20 the result of the Grad-Cam technique is shown for two images, each on three output scales of the Yolo version 3 backbone. These heatmaps show that area of the image that is most important for the Wet Cooling Tower detection. In these examples, we see the gradient maximized at the location of the Wet Cooling Towers. We see also some activation at some other edges and structures. To be more specific on the scale-level that Yolo picks for the prediction, the heatmaps were renormalised over all three levels. In figures 19 and 20 these are shown in the most right column. In both examples we see that scale 2 is most important for prediction. The heatmap values for scale 1 and 3 nearly disappear.

18 (a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 19: Heatmaps of Grad-CAM procedure. We see most important gradients at Yolov3 backbone layers of each of the three scales. Refer to figure 7, were the gray Yolov3 backbone at the left has outputs at three scales. With input 258 x 256, the output dimensions S x S of the backbone are 8 x 8 (scale3, large objects), 16 x 16 (scale2) and 32 x 32 (scale 1, small objects). On each line we see at left the heatmap, in middle the heatmap merged in image and at right the heatmap renormalised over all scales merged in the image. (a)/(b)/(c) Scale 3 8x8 heatmap, and merged in image: normal and renormalised heatmap. (d)/(e)/(f) Scale 2 6x16 heatmap and merged in image: heatmap and renormalised heatmap. (g)/(h)/(i) Scale 1 32x32 heatmap and merged in image: heatmap and renormalised heatmap

19 (a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 20: Heatmaps of Grad-CAM procedure. We see most important gradients at Yolov3 backbone layers of each of the three scales. Refer to figure 7, were the gray Yolov3 backbone at the left has outputs at three scales. With input 256 x 256, the output dimensions S x S of the backbone are 8 x 8 (scale3, large objects), 16 x 16 (scale2) and 32 x 32 (scale 1, small objects). On each line we see at left the heatmap, in middle the heatmap merged in image and at right the heatmap renormalised over all scales merged in the image. (a)/(b)/(c) Scale 3 8x8 heatmap, and merged in image: normal and renormalised heatmap. (d)/(e)/(f) Scale 2 6x16 heatmap and merged in image: heatmap and renormalised heatmap. (g)/(h)/(i) Scale 1 32x32 heatmap and merged in image: heatmap and renormalised heatmap

20 4.1.7 Yolo Visualization - Feature maps Feature maps visualize the structures in an images that maximize the network’s activations. These Feature maps are created by starting from a noisy image and then pixel values being altered in such direction that it maximizes the activation of a certain channel in a certain layer. In figures 21, 22 and 23 the Feature maps from the network as trained on Wet Cooling Towers are shown. Figure 21 shows feature structures maximizing activations of different layers of the Yolov3 convolutional backbone without up-scaling. Figure 22 shoes the same layers and channels with up-scaling. It is shown that without up-scaling it is difficult to distinguish structures in the feature maps but that up-scaling procedure does generate feature maps with distinguishable structures. In the first layers of the convolutional backbone these structures are relative simple in first layers. As expected, moving to deeper backbone layers the features become more expressive.

(a) (b) (c)

(d) (e) (f)

Figure 21: Activation Features - method no scaling - mapped on images at Yolov3 convolutional backbone layers. (a) layer 0, channel 10 (b) layer 1, channel 12 (c) layer 2, channel 1, (d) layer 3, channel 1, (e) layer 4, channel 3, (f) layer 5, channel 12. Little structure seen, higher layers show more feature structures. Note that each layer has more channels. Layer 0: 32, layer 1: 64, layer 2: 128, layer 3: 256, layer 4: 512, layer 5: 1024. In this figure one channel per layer is picked. The channels correspond with figure 22.

21 (a) (b) (c)

(d) (e) (f)

Figure 22: Activation Features - method up-scaling - mapped on images at Yolov3 convolutional backbone layers. (a) layer 0, channel 10 (b) layer 1, channel 12 (c) layer 2, channel 1, (d) layer 3, channel 1, (e) layer 4, channel 3, (f) layer 5, channel 12. Higher layers are somewhat more features, but little structure seen. Note that each layer has more channels. Layer 0: 32, layer 1: 64, layer 2: 128, layer 3: 256, layer 4: 512, layer 5: 1024. In this figure one channel per layer is picked. The channels correspond with figure 21.

Additional blurring improves the results. In figure 23 the different additional blurring methods are com- pared on the feature maps of channel 3 of backbone layer 4. The different blurring methods are compared all with 8 step up-scaling. Blurring combined with graying do deliver better visible structures. Best and proposed smoothing method is a combined method, where at each up-scaling step the image is grayed and Gaussian smoothed. The result is shown in figure 23(i). For Gaussian smoothing, the kernel size = 3 and σ = 0.8. The σ value is chosen small enough to have the full Gaussian filter fitted in the kernel,

Layer 4 produces convolutional features for scale 2 that is most determinant for Wet Cooling Tower predic- tion. Figure 24 shows more feature map examples from this layer 4 generated with the proposed smoothing method. We see nice features, some of which resemble Wet Cooling Tower circular features. More examples of feature maps of other layers of the Yolo backbone and generated with the proposed smoothing method can be found in the Attachment in figures 25 and 26.

Following the same procedure, feature maps can be produced for the Yolo output layers (layers 6, 7 and 8). Some care need to be taken since in these layers, each of the 18 channels represent a specific meaning (e.g. bounding box coordinates) and minimizing the loss as defined in section 3.4.3 does not make sense for every channel. Therefore only Feature maps for the channels 4, 10 and 16 that represent the IoU confidence level are presented. In the attachment in figure 27 these activation maps can be found. These Feature maps generated on the Yolo output layers generate less visible structure.

22 (a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 23: Activation Features mapped on images on Yolo layer 4 channel 3. All with up-scaling. Different blurring methods. (a) no blurring (b) initial Gaussian blurring of noisy image - effect is minimal (c) graying at each up scaling step - structures become more eminent (d) 3 pixel box filter blurring(e) 9 pixel box filter blurring - higher frequencies occur (f) Gaussian blurring at each up scaling step, σ = 0.8, kernel size = 3. Circular structures become better visible. (g) Gaussian blurring, σ = 1.1, kernel size = 9. (h) Gaussian blurring at each up-scaling step (σ = 1.9, kernel size = 15). (i) Proposed combined method: at each up-scaling step graying and Gaussian blurring. σ = 0.8, kernel size = 3.

23 Figure 24: Feature maps from 12 picked channels of layer 4. Layer 4 has 512 channels

24 4.2 Discussion The Yolo version 3 network produces good prediction results, with fairly limited number of approx 100 images of size 256 x 256 x 3 for training and limited tuning effort. On the tuning dataset an average precision of 0.98 is reached. This number of 0.98 is relative high due to the object dense tuning set and our augmenting method. For every aerial image the half overlapping 8 surrounding aerial images were added to our tuning set. So for tuning the model it has been accepted that testing images may include image parts also used in training images. Validating our tuned model on a non overlapping validation set resulted in an Average Precision of 0.85.

In ’real life’, the model predicted in the DCMR area 137 not yet locations with Wet Cooling Towers, which doubled their list of 113 known locations. So the model shows its added value for DCMR. The DCMR organi- sation is supporting a campaign to use the Ynformed prediction model for all Dutch environmental agencies.

The experiment aimed to improve results by the use of Colored-Infrared images and by the use of transfer learning. Colored-Infrared (CIR) covers a broader range of the electromagnetic spectrum, and may therefore contain more information of the object to detect. We compared the combinations of either RGB, either CIR, either the combination of RGB and CIR for training and validating. Table 2 shows that there is no significant difference in results. This suggests that the near-infrared may have added extra color information, but that it has not added extra gradient information that is needed for activating Yolo convolution. Transfer learning may improve results by making use of weights that are pre-trained on a similar data set that is not specifically trained for Wet Cooling Towers. For a transfer learning test we used weights that are trained on 80 classes on the COCO data set, and stripped these weights to use for the Wet Cooling Tower pre- diction, that has only 1 class. This appeared to be a Pytorch challenge, so getting it done is already a technical result. Figure 17 shows that transfer learning gives similar results when comparing it to standard learning. When applying transfer learning, only the pretrained weights in the convolutional backbone are preserved and the Yolo output layer has to be rebuilt for only 1 class. Getting comparable results may indicate that the Yolo output layer makes the difference for prediction. The convolutional backbone apparently delivers sufficient reach features in standard and transfer learning.

On the matter of explainability, a well defined detection method like the classical circle Hough Transform would be preferable. Users know in advance that ’circles’ is what they are looking for. Figure 15 however show meager prediction results. At somewhat acceptable precision level (> 0.5) the recall remains low (< 0.1). This may be explained by the fact that the aerial imagery is not perfect perpendicular: the circles appear as ovals. Shades can destroy the circular image. And the human eye and brain may construct circular structures around a fan which are simply not there in pixel space. The more involved Yolo convolutional neural network does give good results and with the Pytorch imple- mentation we could amend the code to open the black box and visualize the working of the network. The Grad-CAM heatmaps of figures 19 and 20 visualise that the network’s activation is high at the pixels on and directly adjacent to the Wet Cooling Tower locations. Also it is visualised that the network uses the second of three scales to base its prediction on. This procedure may be of aid when detecting other objects in future projects. The Feature maps of figures 22 and 23 show that with up-scaling, combined with blurring and graying feature structures can be nicely distinguished. It appeared of importance not to cut of the Gaussian kernel to soon. The presented maps of channel 3 of layer 4 in figure 23 do resemble Wet Cooling Tower structures. Further feature maps on the convolutional layers of Yolo backbone can be seen in the attachment. Figures 25, 24 and 26 show great, arty structures when maximizing activation in the Yolo backbone convolutional layers 3, 4 and 5. So this novel procedure gives useful results for the convolutional layers of the Yolo backbone. However for the Yolo output layers this procedure gives less satisfying results. It must be noted that the output layers consist of 3x6 channels for three prediction boxes. 4 out of the 6 channels are coordinates, so useless when we produce feature maps by maximizing the activation, as described in paragraph 3.4.3. Figure 27 shows that when maximising the activation of confidence level, a non coordinate channel, the features still do not give satisfactory results. This matter needs further investigation.

25 5 Conclusion

E conclude that a convolutional neural network as Yolo version 3 detects objects well. The project has W demonstrated that the network can relatively easily be trained for predicting Wet Cooling Towers from aerial imagery with 25 cm resolution. Yolo version 3 has good prediction power, and has demonstrated its value by more than doubling the number of known Wet Cooling Towers in the Rotterdam/Rijnmond area.

Yolo version 3 is an extensive network working on three scales. For prediction of Wet Cooling Towers we see in figures 19 and 20 that mostly the middle scale 2 is activated. So a question to further investigate is whether for this type of objects with rather uniform scales, this extensive and relatively slow Yolo version 3 network actually predicts better compared to simpler convolutional networks. Given the off line character of predicting Wet Cooling Towers objects training and prediction speed is not an issue for this specific project. The Yolo version 3 may differentiate more from other simpler network architectures in the detection of smaller objects, since it offers detection at different scales. This may require aerial imagery with higher resolution than 25 cm.

On the matter of the Black Box Explanation problem, we have been able to visualize the inner working of the Yolo version 3 deep convolutional network. With Grad-CAM we demonstrate that the networks ’looks’ at the right places in the image. Feature maps visualise the structural patterns in an image on which the model activates. The working of these techniques to the Yolo version 3 architecture are novel. Visual inspection of the feature maps demonstrate that with our proposed combination of re-scaling, blurring and graying we improved on the technique as presented in the literature.

Were good results are reached in visualising the Yolo convolutional layers, that is not the case in visualising the Feature Maps of the Yolo output layers. Further investigation is needed to understandably visualise the Feature Maps of these output layers.

For this specific use case detecting Wet Cooling Tower the results can easily be verified before critical decisions were taken. For other situations, take automatic driving driving or image driven medical advice we have to trust the outcome of the decision process in critical situations. Using the presented tools for model inspection may help opening the black box in this type of critical decision making.

26 6 Acknowledgement

I thank my fellow Ynformed project members Med and Ruben who have been very supportive in getting the implementation at the GPU server to work. The Transfer Learning experiment is based on the code Annanina provided. Furthermore I thank my daughter Sanne who has supported me greatly in the tedious work of labeling the Wet Cooling Tower images.

References

[1] J.Redmon; A.Angelova. Real-time grasp detection using convolutional neural networks. ArXiv, 2015. [2] J.Redmon; A.Farhadi. Yolo9000: better, faster, stronger. arXiv, 2016. [3] J.Redmon; A.Farhadi. Yolov3: An incremental improvement. arXiv, 2018. [4] J.Redmon; S.Divvala; R.Girshick; A.Farhadi. You only look once: Unified, real-time object detection. arXiv, 2015. [5] K.Simonyan; A.Vedaldi; A.Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034v2, 2014. [6] B.Liu. Full implementation of yolo version 3 in pytorch. github, 2019. [7] N.Dalal; B.Triggs. Histograms of oriented gradients for human detection. CVPR, 2005. [8] M.T.Ribeiro; S.Singh; C.Guestrin. “why should i trust you?” explaining the predictions of any classifier. arXiv:1602.04938v3, 2016. [9] H.Harzallah; F.Jurie; C.Schmid. Combining efficient object localization and image classification. LEAR, INRIA Grenoble, LJK, 2009. [10] R.R.Selvaraju; M.Cogswell; A.Das; R.Vedantam; D.Parikh; D.Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. arXiv, 2019. [11] J.R.R.Uijlings; K.E.A.van de Sande; T.Gevers; A.W.M.Smeulders. Selective search for object recognition. Int J Comput Vis, 2013. [12] G.Huang; Z.Liu; L.van der Maaten; K.Q.Weinberger. Densely connected convolutional networks. arXiv:1608.06993v5, 2018. [13] D.G.Lowe. Distinctive image features from scale-invariant keypoints. Int. J. of Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004. [14] R.Guidotti; A.Monreale; D.Pedreschi. The ai black box explanation problem. KDNuggets.com, 2019. [15] P.F.Felzenszwalb; R.B.Girshick; D.McAllester; D.Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. [16] X.Ren; D.Ramanan. Histograms of sparse codes for object detection. CVPR, 2013. [17] A.Van Etten. Satellite imagery multiscale rapid detection with windowed networks. arXiv, 2018. [18] A.Van Etten. Satellite imagery multiscale rapid detection with windowed networks (simrdwn). github, 2018. [19] A.Van Etten. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv, 2018. [20] F.Chollet. Deep Learning with Python. Manning Publications, 2017. [21] F.M.Graetz. How to visualize convolutional features in 40 lines of code. towardsdatascience.com, 2019. [22] F.Pasquale. The black box society. Harvard University Press, 2015. [23] A.Krizhevsky; I.Sutskever; G.E.Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012. [24] github. Labelimg: a graphical image annotation tool. github, 2015.

27 [25] F.Iterson; G.Kuyters. Improved construction of cooling towers of reinforced concrete. Patent, 1918. [26] A.Bochkovskiy; C.-Y.Wang; H.M.Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934v1, 2020. [27] InfoMil. Toezicht op natte koeltorens: heeft het bedrijf het in de vingers? InfoMil Perspectief, 30 — jaargang 8 Jaargang 8, 2018. [28] Cooling Technology Institute. Legionellosis guideline: Best practices for control of legionella. cti.org, 2008. [29] H.K.Yuen; J.Princen; J.Illingworth; J.Kittler. A comparative study of hough transform methods for circle finding. AVC 1989 doi:10.5244, 1989. [30] P.Arbel´aez;M.Maire; C.Fowlkes; J.Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 898–916, 2011. [31] A.Farhadi J.Redmon. Yolo: Real-time object detection. github, 2016. [32] S.Ren; K.He; R.Girshick; J.Sun. Faster r-cnn: Towards realtime object detection with region proposal networks. NIPS, 2015. [33] S.Ren J.Sun K.He, X.Zhang. Deep residual learning for image recognition. arXiv, 2015. [34] H.Tayara; K.T.Chong. Object detection in very high-resolution aerial images using one-stage densely connected feature pyramid network. Sensors, 2018. [35] H.Liu; Y.Qian; S. Lin. Detecting persons using hough circle transform in surveillance video. VISAPA 2010 - International Conference on Computer Vision Theory and Applications, 2010. [36] Y.LeCun; B.Boser; J.Denker; D.Henderson; R.Howard; W.Hubbard; L.Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989. [37] T.Mujtaba; M.A.Wani. Object detection from satellite imagery using deep learning. Proceedings of the 12thINDIACom; INDIACom-2018; IEEE Conference ID: 42835 2018 5th International Conference on “Computing for Sustainable Global Development”, 2018. [38] P.Viola; M.J.Jones. Rapid object detection using a boosted cascade of simple features. Computer cool Vision and Pattern Recognition, 2001. [39] P.Viola; M.J.Jones. Robust real-time face detection. International Journal of Computer Vision 57(2), 137–154, 2004. [40] J.T.Springenberg; A.Dosovitskiy; T.Brox; M.Riedmiller. Striving for simplicity: The all convolutional net. arXiv:1412.6806v3, 2015. [41] D.H.Park; L.A.Hendricks; Z.Akata; A.Rohrbach; B.Schiele; T.Darrell; M.Rohrbach. Multimodal explana- tions: Justifying decisions and pointing to the evidence. arXiv:1802.08129v1, 2018. [42] R.O.Duda; P.E.Hart. Use of the hough transformation to detect lines and curves in pictures. Comm. ACM, Vol. 15, pp. 11–15 (January, 1972), 1972. [43] P.V.C.Hough. Machine analysis of bubble chamber pictures. Conf.Proc.C 590914 (1959) 554-558, 1959. [44] D.Erhan; A.Courville; Y.Bengio; P.Vincent. Visualizing higher layer features of a deep network. Technical Report 1341 Departement d’Informatique et Recherche Operationnelle, 2009. [45] M.D.Zeiler; R.Fergus. Visualizing and Understanding Convolutional Networks. Springer International Publishing Switzerland, 2013. [46] R.Girshick. Fast r-cnn. ICCV, 2015. [47] M.Schlecht; R.Meyer. Site selection and feasibility analysis for concentrating solar power (csp) systems. Concentrating Solar Power Technology, 2012. [48] R.Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010. [49] T.-Y.Lin; P.Dollar; R.Girshick; K.He; B.Hariharan; S.Belongie. P1 -feature pyramid networks for object detection. arXiv, 2017. [50] Minnesota IT services. Color-infrared (cir) imagery. mngeo.state.mn.us, 2020.

28 [51] Stanford. Convolutional neural networks for visual recognition. Syllabus Standford Education, 2018. [52] S.Ulyanin. Implementing grad-cam in pytorch. medium.com, 2019. [53] J.Donahue; Y.Jia; O.Vinyals; J.Hoffman; N.Zhang; E.Tzeng; T.Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv, 2013. [54] T.Murray. Google photos tried to fix this ski photo — but it didn’t go to plan. businessinsider.com, 2018. [55] C.P.Papageorgiou; M.Oren; T.Poggio. A general framework for object detection. in computer vision. sixth international conference on, pages 555–562. IEEE, 1998.

[56] T.Simonite. When it comes to gorillas, google photo remains blind. wired.com, 2018. [57] Staatsblad van het Koninkrijk der Nederlanden. Besluit van 21 april 2017 tot wijziging van het besluit omgevingsrecht (verbetering vergunningverlening, toezicht en handhaving). Staatsblad van het Koninkrijk der Nederlanden, 2017. [58] C.Cortes; V.Vapnik. Support vector machine. Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.

[59] www.atlasleefomgeving.nl. Waterverneveling, legionella en natte koeltorens. Atlas Leefomgeving, 2017. [60] www.pdok.nl. Publieke dienstverlening op de kaart, 2020. [61] www.spaceoffice.nl. Satelietdataportaal, 2020.

[62] www.ynformed.nl. Company website, 2020.

29 7 Attachment

Figure 25: feature maps from 12 channels of layer 3. Layer 3 has 256 channels.

30 Figure 26: feature maps of 12 channels from layer 5. Layer 5 has 1024 channels

31 Figure 27: feature maps from output layer 6 (top), 7 (middle), 8 (bottom). For each layer three channels are shown: channel 4 (left), channel 10 (middle), channel 16 (right). These channels represents IoU Confidence in the Yolo output. Difficult to see any structure.

32