A Data-Driven Approach to Object Classification through Fog

by Alisha Saxena

Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

Massachusetts Institute of Technology

June 2018

ã 2018 Alisha Saxena. All rights reserved.

The author hereby grants to M.I.T. permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created.

Author: ______Department of Electrical Engineering and Computer Science May 25, 2018

Certified by: ______Ramesh Raskar, Associate Professor, Thesis Supervisor May 25, 2018

Accepted by: ______Katrina LaCurts, Chair, Master of Engineering Thesis Committee

2 A Data-Driven Approach to Object Classification

Through Fog

by

Alisha Saxena

Submitted to the Department of Electrical Engineering and Computer Science May 25, 2018 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science

ABSTRACT Identifying objects through fog is an important problem that is difficult even for the human eye. Solving this problem would make autonomous vehicles, drones, and other similar systems more resilient to changing natural weather conditions. While there are existing solutions for dehazing images occluded by light fog, these solutions are not effective in cases of very dense fog. Hence, we present a system that uses a combination of time resolved sensing, specifically using Single Photon Avalanche Photodiode (SPAD) cameras, and with convolutional neural networks to detect and classify objects when imaged through extreme scattering media like fog. This thesis describes our three-pronged approach to solving this problem: (1) building simulation software to gather sufficient training data, (2) verifying and benchmarking output of simulation with real-life fog data, (3) training deep learning models to classify objects occluded by fog.

Thesis Supervisor: Ramesh Raskar Title: Associate Professor

3

4 Acknowledgements

I would firstly like to thank Professor Ramesh Raskar for giving me the opportunity to work on this project for my MEng. A huge thanks to Guy Satat for providing invaluable guidance for my research and patiently helping me overcome roadblocks in my work. His mentorship has allowed me to become a better researcher and learner. Without his support,

I would not have been able to start, let alone complete, this thesis.

I’d also like to thank my fellow members of the Camera Culture group, in particular,

Tristan Swedish, Otkrist Gupta, and Matt Tancik, for providing useful insights for a variety of tasks. Every single one of the group members has been helpful to me in some way throughout my time in the lab. It was a privilege to work with the Camera Culture group, and the whole team made this masters an amazing learning experience.

Thanks to Maggie Church and Maggie Cohen, who have been instrumental in managing the logistics of completing this thesis. A special thanks also to Anne Hunter for helping me through my BS and MEng degrees with her encyclopedic knowledge about the nuances of the EECS department.

Lastly, I would like to thank my family, who have always supported me and helped me dream big. This milestone would not have been possible without their years of sacrifice and dedication, and for that I am very grateful.

5

6

Table of Contents 1 Introduction ...... 11

2 Related Work ...... 12

2.1 Fog Properties ...... 12

2.2 Simulation Software ...... 12

2.3 Combined Approach ...... 13

3 Overview of Approach ...... 15

4 Simulation Design ...... 16

4.1 Technical Details ...... 16

4.2 Setup ...... 17

4.3 Input Parameters ...... 18

4.3.1 Fog properties ...... 19

4.3.2 Light sources ...... 20

4.3.3 Target ...... 21

4.3.4 SPAD Camera ...... 23

4.3.5 Lens ...... 23

4.4 Algorithm ...... 24

4.5 Output ...... 24

4.6 Challenges ...... 25

4.6.1 Lens ...... 25

4.6.2 Speed ...... 26

5 Verification ...... 27

5.1 Unit testing ...... 27

5.2 Visual testing ...... 28

5.3 Verifying with lab measurements ...... 29

7 6 Data Generation ...... 32

7 Deep Learning ...... 34

7.1 Input ...... 34

7.2 VGG ...... 35

7.2.1 Architecture ...... 36

7.2.2 Training ...... 37

7.2.3 Results ...... 37

7.3 AlexNet ...... 39

7.3.1 Architecture ...... 39

7.3.2 Training ...... 40

7.3.3 Results ...... 40

7.4 C3D ...... 41

7.4.1 Architecture ...... 41

7.4.2 Training ...... 42

7.4.3 Results ...... 43

8 Future Work ...... 45

8.1 Simulation ...... 45

8.1.1 Target ...... 45

8.1.2 Speed ...... 46

8.2 Data Generation ...... 47

8.3 Deep Learning ...... 48

9 Contributions ...... 49

10 Conclusion ...... 50

Bibliography ...... 51

8 Table of Figures

Figure 1: Two possible layouts for the components in the simulator...... 17

Figure 2: Each line shows the path for a photon traveling through a different level of fog.

The photons all originate from a single point source at the bottom, traveling upwards.

The breaks in the lines represent points of scattering. As the total interaction coefficient

(µt) is increased, scattering events occur more frequently, meaning the fog is denser.

...... 19

Figure 3: Different light sources are used to illuminate a light-absorbing object (MNIST

handwritten "3") in transmission mode with zero fog. The cone light source on the left

produces a circular imprint on the detector that is more concentrated towards the

center because the angles of the photons are randomly sampled. The plane light source

on the right produces a rectangular imprint that is uniform because all the photons are

traveling in the same direction directly towards the detector...... 21

Figure 4: The figure on the left shows the 3d voxel representation of an object in our

simulation, while the right figure shows the 2d representation generated from an image.

...... 22

Figure 5: Starting from the top going left to right, this figure shows how the object becomes

more obscured as the fog becomes denser...... 29

Figure 6: From left to right: A bar chart and line plot showing the fraction of total photons

detected at each timestep, and a kernel density estimate of the simulated data overlaid

on the frequency line plot of the experimental lab data. From top to bottom:

Comparison graphs for different levels of fog from low to high density...... 30

Figure 8: Layout of components used to collect training data in simulator...... 32

9 Figure 9: Example model input of 8 time-resolved frames from the simulation. The target

object is a handwritten "0" from the MNIST dataset, and the photons are traveling

through mild fog (µt = 0.01)...... 35

Figure 10: VGG model architecture from [8], with input modifications to support 64 time-

resolved frames...... 36

Figure 11: These graphs show the progression of accuracy and loss for each training

iteration. The red line is for pre-trained VGG, which trained exceptionally fast and

had a very high accuracy. The blue line is for VGG trained from scratch, which had a

very low accuracy and did not learn, as evidenced by the loss line remaining stagnant.

...... 38

Figure 12: The modified 3d AlexNet model, based off the architecture from [10]...... 39

Figure 13: The accuracy and loss of different 3d AlexNet trials as the models trained. The

loss of all the models exponentially exploded, as shown in the right graph, meaning

that the models did not learn much...... 41

Figure 15: The top row of images shows 8 frames of photons arriving at the detector with

zero fog. The bottom row shows the arrival of photons in light fog, where µt = 0.01.

...... 42

10 Chapter 1

Introduction

Fog is well known as a driving hazard because it can severely limit visibility and distort drivers' visual cues. With the increasing interest and speed of innovation in autonomous cars and drones, it is important to ensure that these complex systems are equipped to safely handle extreme environments before they can be released to the public.

Many such autonomous vehicles are outfitted with a comprehensive kit of LiDAR or radar- based sensors to cover certain cases of limited visibility. However, in extreme cases of dense fog, LiDAR simply does not work and radar has poor resolution. This leads to the problem statement:

Is it possible to reliably classify objects through dense fog?

This thesis presents a data driven object classification system that uses measurements from a time resolved camera, Single Photon Avalanche Photodiode (SPAD), coupled with deep learning software to detect and classify objects occluded by fog. Using time resolved sensing allowed us to supplement existing fog dynamics algorithms and classify objects through the fog. We developed a toolkit to simulate photons traveling through scattering media and reflecting off objects in the scene. We then used this toolkit to rapidly generate thousands of foggy time resolved videos labeled by the type of target object behind the fog. Finally, we used these video frames to train a convolutional neural network to classify the occluded object.

11 Chapter 2

Related Work

This project combines approaches from prior work done in a variety of fields, including physics, hardware, and software, to find an efficient and effective solution to classification through fog. In this section, we introduce the relevant concepts and preceding research that was especially helpful in building background knowledge and developing our approach.

2.1 Fog Properties

To learn how to model fog, we looked at some work published in the American

Meteorological Society that presents the microphysical and optical properties of 19 different types of clouds and aerosol components [1]. The scattering and absorption coefficients they specified for various levels of real-life fog were helpful for benchmarking and comparing simulation performance to real world data.

2.2 Simulation Software

Fang and Boas presented a GPU-accelerated Monte Carlo simulation software for photons traveling through turbid media [2]. Their software, called Monte Carlo eXtreme

(MCX), is highly optimized for efficiency, demonstrating 300-400x speedups over a single- threaded process run on a CPU. However, MCX is not easy to customize for special cases like non-homogeneous turbid media and complicated objects in the scene. This presents a

12 need for a simulation software that can model complicated scenes while still maintaining high efficiency.

Eric Dumont has done work using Monte Carlo light tracing simulations to model road visibility under foggy conditions [3]. He was successfully able to add simulated fog to images of roads, demonstrating that Monte Carlo simulations of fog are representative of actual fog. This validates our approach of using artificially generated fog simulations to train our models, described further in detail in the following sections.

2.3 Combined Approach

Satat et al. have demonstrated that it is possible to classify objects through scattering media using a data-driven, deep learning approach [4]. They simulate photon scattering using a Monte Carlo model and feed sets of frames showing the detected photons and their locations at given times into a convolutional neural network model. Their approach achieves 76.6% accuracy on real-world measurements, which proves that this approach has promise. Their work is focused specifically on imaging through a single sheet of paper, which means that there is only one scattering event between the light source and target object. We extended this work and apply it to fog, a more complex and dynamic medium, so there will be multiple scattering events between the light source and target object.

In another work, Satat et al. show the benefits of using ultrafast measurements from

Single Photon Avalanche Diode (SPAD) cameras for imaging through dense fog [5]. Using time-resolved photon measurements, they are able to computationally subtract the noise from fog from the measurement and recover the depth and image of a target object. This

13 method is based on a probabilistic algorithm that does not rely on prior knowledge about the fog and supports various levels of fog densities. However, because their approach is based on single-pixel measurements, a data-driven, deep-learning approach would be more invariant to spatial scattering.

14 Chapter 3

Overview of Approach

The goal was to develop a cost-effective imaging software that enables the detection and classification of objects through environmental scattering media such as fog. Deep learning has proven to be very useful for classifying images, however, it is very difficult to train these neural networks without sufficient training data. Collecting tens of thousands of images through real fog is infeasible and, as discussed in the previous work section, there does not exist a simulation software with the flexibility to rapidly model photons traveling through fluctuating fog dynamics and interacting with complicated target objects. Hence, we took a three-pronged approach that involved the following steps:

1. Developing a simulation software that allows us to rapidly generate thousands of

videos of photons traveling through fog

2. Verifying and benchmarking data generated from the simulation to real-life fog

data

3. Using the data to train deep learning models to identify the target objects obscured

by dense fog in simulation

The following sections delve deeper into the feature specifications of each portion and their implementation details.

15 Chapter 4

Simulation Design

Due to the sheer volume of data needed to train deep neural networks, we needed to artificially generate data describing photons scattering through fog. Our approach, based on the work of [4] depends on time-resolved video frames being used to train the models.

Although photon simulators like MCX exist, they are either too slow or too specific [2].

Instead, we built a customizable, efficient simulation that will render photon measurements across the x, y, and time axes.

4.1 Technical Details

To maximize the utility of the simulation, we needed to make the software easy to understand and customize while still maintaining highly efficient performance. This would allow us to use the simulation to run a variety of experiments in the future as well. The best programming language for maximum readability and rapid editing is Python, so we selected that to program the simulator. Python also has libraries like Numba, which makes it easy to utilize GPU processing power, and Numpy, which simplifies mathematic operations.

We used a Monte Carlo simulation method to simulate photons propagating through a randomly scattering medium. In this approach, scattering angles and distances are randomly sampled at each step for every photon. Our detector is single photon sensitive, so the Monte Carlo method was the best way to render photon-specific measurements. A

16 benefit of using this method is the ease of parallelization, as each photon is modeled completely independently of the others.

We made the simulation efficient with the Numba library for Python, a wrapper for the C CUDA library, which allows for parallel processing on GPUs. It also pre-compiles the

Python code to reduce overhead during run time. With Numba, we distributed the processing across 256 blocks and 512 threads on each GPU, resulting in a significant speedup where we could simulate 10 billion photons in 30 seconds on average. We also ran a profiler on the code to identify potential bottlenecks. The math operations were the biggest resource drains, so we simplified the math as much as possible in the code, resulting in an additional

10% speedup over our first iteration.

4.2 Setup

Fog Fog Detector

Target Object Detector Light Source

Light Source

(a) Reflection mode: The detector and light source are (b) Transmission mode: The detector directly faces the on the same side so that only photons that are reflected light source so we can easily debug without the added off the object and/or fog are detected. uncertainty of reflections from the target

Figure 1: Two possible layouts for the components in the simulator.

To replicate realistic data collection in fog, we simulated a setup consisting of a light source, target object, and detector. All three objects are assumed to be within the scattering

17 medium of fog. The software simulates photons originating from the light source, traveling towards a target object while scattering through fog, reflecting off the target, and finally scattering back through the fog to be detected by a camera.

The simulation allows the flexibility to orient these 3 components in various layouts.

We primarily used two different layouts for testing and generating data: reflection mode and transmission mode.

Reflection mode: In most cases, we can assume the light source and detector are near each other. This is the case when the photons detected by the camera must have reflected off the target object or fog particles. This setup is most practical for the proposed use case on an autonomous vehicle, where the headlights and driver/front camera are positioned close to each other, relative to the object being detected. The basic layout of the component setup is shown in .

Transmission mode: We used transmission mode mainly for testing the simulator, where the detector and light source are positioned directly opposite each other. In this case, photons travel through the scattering medium for a specified distance and are detected at the other end. This simplifies the scenario for testing purposes because there is no longer reflection off an object involved and travel is only one-way rather than two-way. They layout of this setup is also shown in .

4.3 Input Parameters

The input parameters of the simulation can be tweaked to fine tune the behavior of the simulation and meet certain experiment requirements. In addition to moving the

18 location of the simulation components and specifying the number of photons to simulate, which was shown in the previous section, we can also adjust the behavior of each individual element of the simulation. The following subsections describe the variables that can be controlled in more detail.

4.3.1 Fog properties

The primary parameter that we use to manipulate the level of fog is the total interaction coefficient, or µt. This constant represents both the scattering and absorption coefficients and represents the density of the fog. The higher µt is, the more scattering events per unit length there will be. The effects of perturbing µt are shown in Figure 2.

Figure 2: Each line shows the path for a photon traveling through a different level of fog. The photons all originate from a single point source at the bottom, traveling upwards. The breaks in the lines represent points of scattering. As the total interaction coefficient (µt) is increased, scattering events occur more frequently, meaning the fog is denser.

The other parameter that affects the behavior of photons during scattering events is the scattering anisotropy, g, which must be between -1 and 1. This number represents the tendency of a photon to scatter forward during a randomized scattering event. The closer g is to 1, the more likely it is to travel in the forward direction.

19 4.3.2 Light sources

Our simulation allows the user to select from three different types of light sources: a point source, cone source, and plane source. The specifics of each type of source can also be tuned as needed.

1. The point source emits a single, straight beam of photons in the positive z direction.

This source can be placed anywhere in x, y space, but to maintain consistency and

simplicity, it is assumed that all light starts at z = 0 and points in the positive z

direction.

2. The cone source acts similarly to a point source in that all of the photons begin from

a single point, the location of which is also an input. However, the photons exit the

source at randomly sampled angles. The max apex angle of the cone must be specified

to indicate the boundaries of the angle sample space. The larger the apex angle, the

wider the light beam. The angles are sampled based on the following equations,

where ! is the apex angle of the cone and # is the base angle of the cone: ! = %&'( ∗ *+,%-'./, # = %&'( ∗ 21 23 = sin ! ∗ cos (#) 2; = sin ! ∗ sin (#) 2< = cos (!) 3. The plane source is the same as placing multiple point sources next to each other. It

is a rectangular source, with dimensions and locations that can be specified. The

starting location for each photon is randomly sampled from the area of the plane

source, and each photon begins with a direction straight in the positive z direction.

20 The difference in photons detected from the cone source vs. the plane source is shown in

Figure 3.

Figure 3: Different light sources are used to illuminate a light-absorbing object (MNIST handwritten "3") in transmission mode with zero fog. The cone light source on the left produces a circular imprint on the detector that is more concentrated towards the center because the angles of the photons are randomly sampled. The plane light source on the right produces a rectangular imprint that is uniform because all the photons are traveling in the same direction directly towards the detector.

4.3.3 Target

Besides setting the location of the target object in the simulation, the main source of variability was the shape of the object itself. We needed to figure out how to represent complicated object shapes in 3d space and then calculate the appropriate reflection angle such that the photons are not accidentally sent through a solid object. For simplicity, we created two options for the representation which optimize for different cases. Figure 4 shows the difference between the two approaches.

The first option is to break down a 3d shape into smaller 3d voxels. The user specifies the center point and radius of any number of cubic voxels that can be placed adjacent to each other to build a larger object, or even placed separately to represent multiple target objects. This requires smooth, curved surfaces to be broken down into small pixelated cubes,

21 making it much easier to calculate the point of intersection between the photon vector and object plane. However, this method trades off efficiency for accuracy. Looping through the individual voxels to determine which one was intersected is a computationally expensive process, but it might be worth it in some cases to model more realistic scenarios.

Figure 4: The figure on the left shows the 3d voxel representation of an object in our simulation, while the right figure shows the 2d representation generated from an image.

The other option is to represent the object as a 2d image, where high pixel values in the 2d array represent solid, reflecting areas. The simulator reads the “object” as an image and the simulator scales and places the image in 3d space, treating pixel values that are above a specific threshold as an “object”. This approach is much more efficient because we can simply calculate the intersection between the photon vector and the plane of the image and then check to see if the intersection coordinate is above the threshold value in the 2d image array. We no longer need to check for intersections across 6 different faces in 3d space. However, by flattening the object into two dimensions, we lose the ability to model more complex and natural targets.

22 Regardless of how the object was modeled in the simulation, we used the Lambertian reflectance model when calculating the angle of reflection off the object. Lambertian reflectance is most often used to model reflectance off matte or diffusing objects, so it acts as a good approximation when the reflectance properties of an object are unknown. We implemented this by randomly sampling a reflection angle back in the negative z direction for photons that intersect with an object face. This algorithm is very similar to the way angles are sampled for the cone light source.

4.3.4 SPAD Camera

For the detector, we modeled a Single-Photon Avalanche Diode (SPAD) camera, which is able to measure the time-resolved arrival of individual photons. The user can specify the time resolution of the camera to suit their needs for the experiment.

4.3.5 Lens

We had to add a lens to the detector to focus the photon readings. Using the thin lens assumption, we adjust the (x, y) position, angle of incidence, and time of arrival based on the equations below, which are derived from the thin lens equation. ( 3 → 3 1 − + ( ∗ 2 @ B

3 + ( ∗ 2B 2 → 2 − B B @

E E ( + − F + F − 3 + ( ∗ 2B C → C + 1 + 2E + D B D ∗ ' In these equations, we must specify the focal length of the lens, f, index of refraction of the lens, n, and object distance from the lens, so that we can derive the image distance, d, radius of curvature of the lens, R, and thickness of the lens, s, using the lensmakers

23 equations. In the above equations, c is the speed of light. The above equations for x and kx are also applied to the y and ky measurements so the photons are focused in both directions.

4.4 Algorithm

The basic workflow of the simulation algorithm, based on [2], is as follows:

1. Launch photon based on type of light source

2. Determine step size based on scattering coefficient (the larger the coefficient the

smaller the step size)

3. Move photon

4. Check to see if target was hit or detector was hit

a) If target hit, calculate new photon direction such that it bounces off the target

b) If detector hit, record all data and terminate simulation

c) If neither, calculate random scattering angle

5. Repeat steps 2-4 until the photon times out or is detected

6. Repeat steps 1-5 until all photons have been simulated

4.5 Output

The simulation stores a limited amount of data for each photon, as it would be infeasible to save all of the path details for each of the billions of photons simulated. Hence, the simulation only saves the following for each detected photon:

• Final (x, y, z) position at the point of detection

• Final (kx, ky, kz) angle of incidence upon the detector

• Time of arrival

24 These measurements are needed for generating time-resolved video frames to train the convolutional neural network. For testing and debugging purposes, we also stored the following:

• Number of scattering events

• Last time of interaction with target (if at all)

Unlike the time of arrival and spatial measurements, the number of scattering events and time of target interaction are only used for debugging and verification purposes. The scattering events metric is helpful in verifying that the photons scatter more frequently as the density of the fog is increased with the µt metric. It is also useful to know what percentage of the detected photons have interacted with the target. If it is a low percentage, that means most of the photons have been reflected early because of the high fog density.

In this case, there is minimal chance of correctly classifying the target object because the signal received from the object is minimal.

4.6 Challenges

Balancing the tradeoffs between efficiency and accuracy was especially challenging when building the simulator. Minor changes in the algorithm could have major impact on speed because it has to be run repeatedly for billions of photons. This section describes some of the ways we balanced this tradeoff.

4.6.1 Lens

To prevent needing to simulate tens of billions of photons, we made the lens size very large (10 x 10 cm). This large detector picked up a higher percentage of total number of photons simulated, so we were able to reduce the data generation time by 85% on average by simply simulating fewer photons.

25 However, because the lens was so large, the time of arrival also needed to be focused through the lens as well. Otherwise, in our simple case with no fog, it did not make sense that photons which focused to the same place had vastly different times of arrival.

4.6.2 Speed

Using the Numba library to try and parallelize the process on GPUs was a complicated process. It is impossible to run each simulated photon in a separate thread because GPUs cannot handle millions of threads running in parallel. Instead, we had to use a counterintuitive programming model, where each thread ran several loops within itself.

This results in a program that is not fully parallelized but is many times more efficient than trying to run billions of threads. Tuning the GPU parameters like number of blocks and threads took a lot of trial and error. In addition, we ran a profiler on the code locally to identify potential bottlenecks, but the Numba library pre-compiles functions under the hood so attempts to speed up the computation by simplifying the algorithms were minimally effective. Our final runtimes were around 40 seconds per 100 million photons simulated on

GPUs with light fog (µt = 0.01), and 90 seconds per 100 million photons simulated with dense fog (µt = 0.1). We discuss methods to improve these run times in the Future Work section.

26 Chapter 5

Verification

There were a few different techniques used to verify that the simulator was working as intended. we began with unit testing to make sure that the mathematics were correct when finding intersection points and calculating scattering angles. We also used visual plotting to confirm that the setup of the 3D coordinate system was correct and that tweaking the various parameters affected the system as expected. Lastly, the most important step was to match the measurements from the simulator with a few actual lab measurements through synthetic fog. This helped us visually benchmark the seemingly arbitrary µt numbers to approximate fog density and verify that the simulator produces realistic results.

5.1 Unit testing

We wrote unit tests to verify all of the math and physics equations in the simulator.

These simple checks of isolated portions of the code were useful during the iterative process of writing the code for the simulator because we would immediately know if a change caused a bug in our code. This resulted in a faster development loop of testing and iterating on the simulator code. Below are the two most important test cases we used:

• Intersection of a vector and a plane: We sampled a random point, p, and a

random direction vector and vector size. After generating two end points of a line

using that random direction and size that goes through the random point, we

computed the intersection of the line and the z-plane of p using an isolated

27 function from our simulation code. The intersection point should be equal to

point p.

• Lens focus: In transmission mode with zero fog, when a cone light source is

pointed directly into the camera, the lens should be able to focus all of the

photons to a single point. The test checked to make sure the position coordinates

and timestamps of all the photons are close to each other.

5.2 Visual testing

Although unit tests are very helpful for checking that the logic in the simulation works as expected, they are not as useful for making sure that the logic is the correct logic to be using for the specific scenario. We plotted the complete photon path on a 3d axis for a few photons to make sure the photons scattered and reflect off objects in a manner that makes physical sense. Figure 4 in Section 4.3.3 shows an example photon path in the 3d debugger. As shown in Section 4.3.2 and 4.3.3 above, this method of testing was also a good for verifying the different types of light sources and objects are placed and functioning correctly.

A lot of the testing was done with the fog parameter set to zero, just to make sure that the output video frames were focused properly. As we slowly increased the fog level, the image becomes more fuzzy until the fog is so dense that no clear image of the target object can be detected. This progression is shown in Figure 5. At this step, the timestamp of last interaction with the target object becomes useful. If the majority of the photons detected have no timestamp of interaction with the target, that means the fog is too dense and the photons are scattering too much to be a useful measurement.

28

Figure 5: Starting from the top going left to right, this figure shows how the object becomes more obscured as the fog becomes denser.

5.3 Verifying with lab measurements

The most critical part of this project was ensuring that the simulator can match real-life lab measurements of fog. We received measurements of lab-generated fog data from the work of [5]. We replicated their lab setup in the simulator, with the detector and light source 10 cm apart and facing the same direction, and varied µt and g to model different fog properties for many runs. We generated a lot of data for different fog levels and used trial and error to find the best matches between the simulated and experimental lab data.

The lab measurements consist of photon counts at a time interval of 56 picoseconds.

We plotted the same metrics using data from the simulation and compared the general

29 shapes of both plots. To make the comparison more even, we normalized the photon counts from both datasets by dividing by the total number of photons detected in each.

Due to the noisy measurements from the lab and the simulator, we had to visually compare the plots to find matches. We overlaid the graphs in two formats, bar and line plots, so the comparison would be easier to make. We also plotted kernel density estimates for the simulation data to reduce some of the noise in the plots. Figure 6 shows the best plots found for three different levels of fog.

Figure 6: From left to right: A bar chart and line plot showing the fraction of total photons detected at each timestep, and a kernel density estimate of the simulated data overlaid on the frequency line plot of the experimental lab data. From top to bottom: Comparison graphs for different levels of fog from low to high density.

30 After finding some good combinations of µt and g that model fog similar to the fog from the lab measurements, we perturbed the values of µt and g slightly and re-generated data. The new plots, shown in Figure 7, were very similar to each other, meaning that the best values we found were not noise or local minima/maxima. The output signal of the simulation is relatively predictable as the fog gets denser, meaning that the signal to noise ratio (SNR) of the system is high.

Figure 7: Line plots for time vs. fraction of photons detected, comparing the simulated data at slightly perturbed total interaction coefficients (µt) and scattering anisotropies (g) to the experimental lab data. The center plot is the original baseline found experimentally. From left to right, g is increased. From top to bottom, µt is increased.

31 Chapter 6

Data Generation

Once we were confident that the simulator produces realistic data and has all the functionalities needed to customize the scene, we began generating large volumes of training data. For the scope of this thesis, we trained models to classify different target objects when they are obscured by fog in simulation.

We used digits from the MNIST (Modified National Institute of Standards and

Technology) dataset as our target objects [6]. MNIST is a large dataset of handwritten digits consisting of 60,000 training samples and 10,000 test samples. The digits have been size-normalized and centered on a fixed-size image. MNIST is one of the most popular datasets and has a high classification success rate, with state-of-the-art models reaching over 99% accuracy on the dataset, making it ideal for our use case.

Figure 8: Layout of components used to collect training data in simulator.

32 The layout of the light source and detector is shown in Figure 8. This is identical to the lab setup from which we collected physical data, but with an MNIST handwritten digit added in as a target time. It was important to be able to replicate the simulation settings in real life so that we could use real data to evaluate our classification models at the end.

We generated simulated data for 4 discrete levels of fog, starting with zero fog as a baseline dataset, and gradually increasing the amount of fog until the target object was no longer visible to the human eye from the measurements. Besides changing the level of fog by varying µt and changing the target object, all of the other parameters were kept constant.

For each level of fog, it took 3-4 days to run the simulation through the entire MNIST training set (60,000 samples) on 8 NVIDIA Titan X GPUs.

33 Chapter 7

Deep Learning

After generating multiple datasets for different levels of fog, we began training deep neural networks to try to correctly identify the target object through fog. We used the open source TensorFlow framework by Google on NVIDIA GPUs to allow for rapid training and iteration. A key benefit to using TensorFlow was the built-in Tensorboard dashboard, which helped visually analyze the loss and accuracy of the models in real-time.

We chose convolutional neural networks (CNNs) for this task because CNNs are known to be the best model for image classifications. They are also robust to slight variations such as translations or rotations in the images. We experimented with a few popular convolutional neural networks that have performed well on classification challenges like ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which involves classifying images into 1000 categories and is used industry-wide to benchmark the best classification models [7]. In this section, we discuss the pros and cons of the three different models we tried: VGG, AlexNet, and C3D.

7.1 Input

The training data for the deep learning models consists of time-resolved video frames of photons arriving at the detector. We generated these frames by building a 3d histogram out of the (x, y) position and time data saved for each photon from the simulation. An example of the frames is shown in Figure 9.

34

Figure 9: Example model input of 8 time-resolved frames from the simulation. The target object is a handwritten "0" from the MNIST dataset, and the photons are traveling through mild fog (µt = 0.01).

The dimensions of each frame are 58 x 58 pixels, with 64 frames representing intervals of 56 picoseconds, and only one channel in each of the images, which contains the raw number of photons detected for that pixel in that frame. The dataset was divided into ten label classes, where the label for each set of video frames was the corresponding MNIST digit that was used as the target. In other words, the model should be able to recognize which digit (0-9) is behind the fog.

We started by training each of the models with the dataset generated with no fog.

This was equivalent to training the models on the raw MNIST dataset with an extra dimension of time. If the models did not perform well on data with no fog, it is highly unlikely that they would succeed in the more complicated cases where the objects are occluded with fog. Using the data without fog as a test helped speed up the iterative training process because we could quickly rule out models that were unlikely to work. Once a model performed well on the simple case, we trained it further with data generated through increasingly higher levels of fog.

7.2 VGG

We began by extending the approach from [4], where the VGG model was successful for classifying objects through a single sheet of paper. We trained the VGG 16-layer

35 convolutional neural network to try to establish a baseline performance on our zero-fog dataset [8].

7.2.1 Architecture

Figure 10: VGG model architecture from [8], with input modifications to support 64 time-resolved frames.

The architecture of this model is shown in Figure 10. We used the 16-layer version of VGG, which has 13 convolutional layers and 3 fully connected layers at the end for performing the classification step. We modified the last fully connected layer to output ten weights, corresponding to the ten different classes in the dataset. VGG has more layers than other common models, which helps it achieve generally better results, as exemplified by a

93.2% top-5 accuracy on the ILSVRC challenge. However, the increased depth makes the network very large and bulky. The model contains around 134 million parameters, hence, training the model is computationally expensive and inefficient.

36 7.2.2 Training

We first tried training the model from scratch, with weights initialized randomly using the Xavier initialization method [9]. This allowed us to customize the input layers to match the shape of our three-dimensional data, as the VGG architecture is designed for 2d images that are 256 x 256 pixels with 3 channels for RGB data. We changed the input layer to handle images that are 58 x 58, with 64 channels representing our video frames. The

VGG model is so large that it requires a lot of time and data to train it from scratch, so attempting to do so was probably a futile effort considering the limited amount of data and resources we had.

As a sanity check, we trained the model on our data with pre-trained weights from the ILSVRC submission and resized our input data to match the VGG dimensions. We did this by upsampling the frame size to 256 x 256 using the nearest neighbors method, and compressing the time data to just 3 frames instead of 64.

7.2.3 Results

The loss and accuracy while training the two models are shown in Figure 11. We ruled out training VGG from scratch because the loss of the model would not decrease, meaning the model was learning extremely slowly, if at all. Pre-trained VGG showed great results with an accuracy of 98%, but the input data format does not make sense for our training data.

The high level of accuracy proves that the VGG model could work on our data if we either trained for long enough from scratch or changed our input format to match the pre- trained model. Neither seemed like a viable option. Collecting more training data and

37 training for weeks for a single level of fog would significantly slow down progress on the project and there was no guarantee that this would be successful. Changing the input format involved reducing the number of frames and up-sampling the frame size, which would both decrease the amount of useful data and add unnecessary noise by scaling up the images.

This could work for a case with little to no fog, but would break down as more fog is added because all of the valuable time of arrival data has to be collapsed into only 3 frames. Based on these results, we decided that VGG may not be the right fit for this specific problem and began looking at other models to see if we could get a better performance elsewhere.

Training step Training step Figure 11: These graphs show the progression of accuracy and loss for each training iteration. The red line is for pre-trained VGG, which trained exceptionally fast and had a very high accuracy. The blue line is for VGG trained from scratch, which had a very low accuracy and did not learn, as evidenced by the loss line remaining stagnant.

38 7.3 AlexNet

After an unsuccessful attempt at training VGG from scratch, we experimented with

AlexNet because it is a shallower network that is relatively easy to train with limited resources.

7.3.1 Architecture

AlexNet is much more lightweight than VGG, as it only has 5 convolutional layers and 3 fully connected layers [10]. However, we realized that we needed to perform 3d to take full advantage of the 3d data we have. In standard AlexNet and VGG models, the convolutions are 2d, meaning that each video frame would be treated as independent of the others. This is not ideal for our dataset because there is a lot of variability in our stochastic model that affects time of arrival. Our model needs to be invariant to minor fluctuations across all 3 dimensions, meaning that we must use 3d convolutions in the model. We used the same architecture as AlexNet, but simply made all the and max pooling layers 3d operations instead of the original 2d. The modified AlexNet architecture is shown in Figure 12.

Figure 12: The modified 3d AlexNet model, based off the architecture from [10].

39 7.3.2 Training

Once again, we used input data of shape 58 x 58 with a depth of 64 frames that each had one channel. Unfortunately, after training for thousands of iterations, this model had trouble converging. The loss would explode almost immediately to unreasonably high numbers on the order of 109. We used a few techniques to try to decrease the loss, as described below:

• Decaying learning rate: Usually when the loss explodes, it means that the learning

rate is too high and that the gradients have overshot the optimal point. To solve

this, we tried decaying the learning rate so that the model would be likelier to

converge.

• Different optimizers: We were initially using RMSProp optimization function,

which we felt might also be overshooting the gradient, so we substituted it for the

Adam optimizer, which usually performs better in character recognition tasks such

as this one [9].

• Reducing network depth: Removing layers from the network is sometimes

necessary when the model goes too deep and tries finding meaning in noise. AlexNet

is already a relatively shallow network, so we were not sure how the results would

turn out with even fewer layers. We tried this anyways by removing the final two

convolutional layers and feeding the output of the first three convolutional layers

directly into the fully connected layers.

7.3.3 Results

We could not get AlexNet with our 3d modification to converge on the dataset with zero fog. Figure 13 shows the loss and accuracy for different runs with the tuning described in the previous section.

40

Training step Training step Figure 13: The accuracy and loss of different 3d AlexNet trials as the models trained. The loss of all the models exponentially exploded, as shown in the right graph, meaning that the models did not learn much.

Decaying the learning rate surprisingly did not have much of an effect on the model.

However, coupling a decaying learning rate with the Adam optimizer did help decrease the loss by a few orders of magnitude, although it was still exceptionally high on the order of

105. Our final attempt of removing some layers also helped slightly, but nowhere near to the extent that was needed. Overall, AlexNet 3d did not perform well at all as shown by the high losses and unstable accuracies while training the models.

7.4 C3D

Converting AlexNet to 3d convolutions did not work, so we looked for a network architecture that was specifically built to recognize features from videos using 3d convolutions across the x, y, and time domains.

7.4.1 Architecture

C3D is a little bit deeper than AlexNet, with 8 convolutional layers and 3 fully connected layers [11]. We hoped the added depth layers were the key to taking full

41 advantage of the extra dimension of data, especially because this model was developed explicitly for the purpose of recognizing objects in videos and has been successful in doing so. We had to modify the input layer to match the shape of our data again, but otherwise, we kept the model the same. The architecture of C3D is shown in Figure 14.

Figure 14: The C3D architecture adapted from [11].

7.4.2 Training

Surprisingly, this model trained faster than 3d AlexNet even though there were more layers and hence more weights in the network. The model converged rapidly, so we were able to train with different datasets that had higher levels of fog. As soon as the model would stabilize and converge, we saved the weights, swapped the training dataset to the new level of fog, and resumed training with the saved weights. We trained this model on two levels: zero fog and medium fog. The differences in input are shown in Figure 15.

Figure 15: The top row of images shows 8 frames of photons arriving at the detector with zero fog. The bottom row shows the arrival of photons in light fog, where µt = 0.01.

42 7.4.3 Results

The C3D model converged very rapidly, as shown in Figure 16. Once the loss stabilized, at around 3000 training iterations, we switched datasets from zero fog to medium fog (µt = 0.01). When we made the switch, the loss shot up, as expected, but converged again soon after, although the accuracy was 2-3 percentage points lower on the second level of fog. The accuracy on the medium fog was 95%. As we continue training with denser fog, it will be interesting to see how well the model is able to classify the data.

43

Training step

Training step Figure 16: Accuracy (top) and loss (bottom) for C3D model as it trains. The training and testing lines track each other almost identically, showing that there is minimal overfitting on the models. The split in the curves ar the 3000 step mark shows the change in dataset from zero fog to medium fog. The model converged with a 95% accuracy on medium fog as well.

44 Chapter 8

Future Work

This thesis lays the groundwork for object identification through scattering media with a focus on fog. Given more time, this work can be improved to better represent the dynamic nature of fog and provide a more rigorous validation method. This section presents some ideas for building on the simulation, data collection process, and training methods described in this thesis.

8.1 Simulation

The simulation is already very generalizable so that it can be used for a variety of applications beyond simulating fog. However, the following sections describe some assumptions we made to simplify the simulation. These may need to be adjusted if we wish to make a more rigorous simulator. Additionally, simulation speed is a big limiting factor in our experiments, so we propose a workaround for detecting a higher percentage of the photons simulated.

8.1.1 Target

As mentioned in Section 4.3.3, both ways of modeling target objects have drawbacks: the 3d voxels method is slow, and the 2d image method is a special case that would rarely occur in the real world. Ideally, some combination of both methods would be used such that the simulator is both efficient and accurate.

45 8.1.2 Speed

While generating data for high levels of fog, we ran into issues where we detected very few photons that had interacted with the target. To receive any strong signal from the target, we would need to simulate more than 1012 photons to detect a few thousand photons and would require weeks to generate a dataset of 60,000 samples. This is infeasible, so we plan to implement a stochastic technique where we artificially send back more photons originating from the target towards the detector. This would reduce the number of photons lost to the fog on the way to the target. The workflow would be as follows:

1. Run the simulation as usual with no target to record a signal of photons interacting

with fog.

2. Run the simulation with targets of different sizes for a few cases to record the ratio

of photons that have interacted with the target vs. photons that have interacted

only with fog. Also record the times at which photons reach the plane of the target

object that will be placed later.

3. For each target in the MNIST database, randomly sample a desired photon ratio

based on the size of the target, and figure out how many photons from the target

are required for a reasonable signal.

4. Randomly sample locations on the target and originate photons from the target

headed towards the detector. Using the times recorded in step 1, randomly sample a

start time for each photon as well. Keep simulating until the number of detected

photons that have interacted with the target meets the ratio determined in step 3.

5. Randomly sample fog photon data from step 1 and combine with data from step 4

to produce a combined signal of photons that have interacted with the target and

those that have not.

46 By running the full simulator only a few times in steps 1 and 2, we are able to generate probabilistic distributions that we can artificially sample from to generate data for the full set much more rapidly. Hopefully, this technique would greatly reduce the number of photons that are simulated but not detected.

8.2 Data Generation

For the scope of this thesis, we limited data collection to a few levels of fog, where the only changed variable was the µt total interaction coefficient, which affects the density of the fog. The setup was also optimized to produce data with minimal noise to make the initial experiments easier. For example, the lens was focused exactly on the target object, even though in reality, the system would not know where to focus because the locations of the objects are not known. This can be remedied by collecting more data with slight, randomized perturbations in each of the parameters, as described in [4]. Below are some of the areas where noise can be added in the data:

• Time jitter: to simulate measurement noise from the SPAD camera, we can add a

very small randomly sampled time jitter to the measured time of arrival for every

photon.

• Lens focus: for each run of the simulation, the focus distance can be selected at

random because in a realistic scenario we would not know where the target object is

located.

• Dynamic fog: instead of having a uniform total interaction coefficient (µt) and

scattering anisotropy (g) for the entire run of the simulation, these values can also

be randomly sampled from a small distribution around a specified base value each

time the photon is scattered. This would be a more accurate representation of the

constantly changing nature of fog.

47 8.3 Deep Learning

As new data is collected, it follows that new models would also need to be trained and tuned. However, a key technique that is critical to proving the viability of this system in real-life scenarios would be to test the models on real-life fog data. By training the models on simulation data and testing on real lab measurements, we can get a sense for how realistic the simulation is and whether the data is similar enough to generate useful results.

48 Chapter 9

Contributions

The contributions of this thesis include the following:

• A proof of concept that object classification through random scattering media is

possible using a SPAD camera and deep learning

• A robust and widely-applicable simulator for light traveling through any type of

scattering media

• An evaluation of various deep learning model architectures to recognize handwritten

digits that are occluded with fog, with the best model evaluated achieving 95%

accuracy on light fog

49 Chapter 10

Conclusion

Imaging through dense fog is a problem that must be solved before autonomous vehicles can be released to the public. This thesis presents a framework for rapidly generating training data for any fog level and evaluating deep learning architectures for this specific use case. This work is an important stepping stone to enabling seeing through fog in extreme scenarios. The proposed future work on this project will hopefully help bring this research out of a simulation and into the scale of the real world.

50 Bibliography

[1] M. Hess, P. Koepke and I. Schult, "Optical Properties of Aerosols and Clouds: The Software Package OPAC," in Bulletin of the American Meteorological Society, 1998.

[2] Q. Fang and D. A. Boas, "Monte Carlo Simulation of Photon Migration in 3D Turbid Media Accelerated by Graphics Processing Units," in Optics Express , 2009.

[3] E. Dumont, "Semi-Monte Carlo Light Tracing Applied to the Study of Road Visibility in Fog," 2000.

[4] G. Satat, M. Tancik, O. Gupta, B. Heshmat and R. Raskar, "Object Classification through Scattering Media with Deep Learning on Time Resolved Measurement," in Optics Express, 2017.

[5] G. Satat, M. Tancik and R. Raskar, "Towards Photography Through Realistic Fog," in ICCP, 2018.

[6] Y. LeCun, C. Cortes and C. J. Burges, "The MNIST Database of Handwritten Digits".

[7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei, "ImageNet Large Scale Visual Recognition Challenge".

[8] K. Simonyan, Zisserman and Andrew, "Very Deep Convolutional Networks for Large- Scale Image Recognition," in ICLR, 2015.

[9] S. Ruder, "An overview of optimization algorithms," in CoRR, 2016.

[10] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," 2012.

[11] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri, "Learning Spatiotemporal Features with 3D Convolutional Networks," in CoRR, 2014.

51