Eye Tracking for the Iphone Using Deep Learning by Harini D. Kannan

Eye Tracking for the iPhone using Deep Learning

by Harini D. Kannan

S.B., Massachusetts Institute of Technology (2016)

Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

Massachusetts Institute of Technology

February 2017

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created.

Author: ______Department of Engineering and Computer Science February 3, 2017

Certified by: ______Professor Antonio Torralba, Thesis Supervisor February 3, 2017

Accepted by: ______Christopher J. Terman, Chairman, Master of Engineering Thesis Committee Eye tracking for the iPhone using Deep Learning by Harini Kannan

Submitted to the Department of Electrical Engineering and Computer Science on February 3, 2017, in partial fulﬁllment of the requirements for the degree of Master of Engineering in Computer Science and Engineering

Abstract Accurate eye trackers on the market today require specialized hardware and are very costly. If eye-tracking could be available for free to anyone with a camera phone, the potential impact could be great. For example, free eye tracking assistive technology could help people with paralysis to regain control of their day-to-day activities, such as sending email. The ﬁrst part of this thesis describes the software implementation and the current performance metrics of the original iTracker neural network, which was published in the CVPR 2016 paper "Eye Tracking for Everyone." This original iTracker network had a 1.86 centimeter error for eye tracking on the iPhone. The second part of this thesis describes the eﬀorts towards creating an improved neural network with a smaller centimeter error. A new error of 1.66 centimeters (11% improvement from the previous benchmark) was achieved using ensemble learning with the ResNet10 model with batch normalization.

Thesis Supervisor: Antonio Torralba Title: Professor

2 Acknowledgments

Firstly, I would like to thank my mentors, Aditya Khosla and Bolei Zhou. Aditya was a very helpful mentor during the implementation phase of this project, when I worked on implementing the iTracker neural network on the iPhone GPU. When I worked on the modeling side of the project in order to improve the centimeter error, Bolei provided lots of insightful guidance, introducing me to new concepts and ideas. Secondly, I would like to thank my thesis supervisor, Professor Torralba, for many helpful discussions, especially when I was stuck. Finally, I would like to thank my parents and my brother for their unwavering support throughout my MIT career.

3 4 Contents

1Introduction 9 1.1 Motivationsforasoftware-onlyeyetracker ...... 9 1.2 Previouswork...... 10 1.2.1 GazeCapturedataset ...... 10 1.2.2 iTracker model ...... 11 1.3 GoalsanddirectionofM.Engresearchproject...... 12

2 ImplementingtheiTrackermodelontheiPhoneGPU 13 2.1 Objectives ...... 13 2.2 Discussion of software challenges encountered and their solutions . . . 14 2.3 Implementation details in iOS ...... 17

3Datasetanalysis 21 3.1 Analysis of bias ...... 21 3.2 Reasons behind bias ...... 22 3.3 Eﬀectsoftrainingsetbias ...... 22

4Improvingthemodel 25 4.1 Firstattempt: Classiﬁcationloss ...... 25 4.1.1 Motivations ...... 25 4.1.2 Implementation ...... 26 4.2 Secondattempt: End-to-endmodel ...... 28

5 4.2.1 Motivations ...... 28 4.2.2 Discussion of incremental improvements ...... 29 4.2.3 Croppingontheﬂy...... 29 4.2.4 Changingthelossfunction ...... 29 4.2.5 Changing the model: Original architecture to AlexNet to ResNet10 withbatchnormalization...... 30 4.2.6 Implementing rectangle crops in Caﬀe to preserve aspect ratio 31 4.2.7 Increasing image resolution ...... 32 4.2.8 Ensemblelearningapproach ...... 32

5ResultsandVisualizations 37 5.1 Comparisontorandomperformance...... 37 5.2 Visualizations ...... 38 5.2.1 Generating model attention heatmaps with CAM ...... 38 5.2.2 Observations ...... 38

6Conclusion 43 6.1 Further work ...... 43 6.2 Conclusion...... 44

6 List of Figures

1-1 Structure of iTracker neural network ...... 11

2-1 Flowchart illustrating CaﬀetoiOSconversionprocess ...... 15 2-2 ScreenshotofcurrentiPhoneapp ...... 19

3-1 3D histogram of ground truth points from GazeCapture dataset . . . 23 3-2 3D histogram of ground truth points from GazeCapture dataset . . . 24

4-1 Five grids produced by four diﬀerentshifts...... 27 4-2 Centimeter error as a function of image size. Shown for both full frame images and cropped face images...... 33 4-3 Centimeter error as a function of image size. Shown for both full frame images and cropped face images...... 36

5-1 Cropped face, Example 58 in Visualization 2 (see previous footnotes for link)...... 40 5-2 Cropped face, Example 58 in Visualization 2 (see previous footnotes for link)...... 40 5-3 Example 54 in Visualization 2 (see previous footnotes for link). . . . . 41 5-4 Example 24 in Visualization 3 (see previous footnotes for link). . . . . 41

7 8 Chapter 1

Introduction

Eye-tracking is a technique that tracks a user’s eye movements in order to determine where they are looking on a screen. This thesis describes research done towards an accurate eye tracker implemented with only software, in contrast to the multitude of specialized and expensive hardware eye trackers currently on the market.

1.1 Motivations for a software-only eye tracker

Current state-of-the-art eye trackers include the Tobii X2-60, the Eye Tribe, and Tobii EyeX, all of which are costly and require specialized hardware (such as infrared sen- sors). Moreover, many eye trackers on the market today work accurately in controlled conditions for experiments, but not in real-world conditions. An accurate, free eye-tracker has many applications - for example, it could impact education technology by enabling online educators to assess attention patterns while students watch online lecture videos - thereby helping educators to tweak and improve their content. It could impact medical and behavioral studies as many researchers use eye-tracking to study how people respond to various stimuli. Furthermore, eye- tracking has the power to transform many people’s lives, especially through assistive technology. For people who cannot use their hands anymore, eye tracking is an alter-

9 native way to control electronic devices and resume daily activities, such as sending email. For example, there exists accurate eye trackers to help those with ALS, MND, and other diseases that cause paralysis, but these eye trackers can cost thousands of dollars. This can be an impossible amount of money to set aside for those who are already faced with astronomical health care costs. Inspired by these compelling applications, the aim of this thesis project was to create an accurate eye tracker that only requires software, thereby making it completely free to anyone who already has a normal camera phone. State-of-the-art computer vision methods in deep learning make it possible for computers to identify minute details from images. Because of this, deep learning seemed like a natural tool to predict where a user is looking on a screen, by extracting minute details from a camera stream of a person.

1.2 Previous work

1.2.1 GazeCapture dataset

Prior to this M. Eng thesis, Krafka et al made progress towards a software only eye- tracker[4]. They developed a large-scale eye tracking dataset called GazeCapture, which consisted of images from 1500 unique test subjects captured by an iPhone or an iPad, along with the cropped face and eye images belonging to each original full- frame image. The images were labeled with an (x, y) coordinate corresponding to the point on the screen that the user looked at when the photo was taken. These images were captured using the iOS GazeCapture application, which ﬂashed a dot that the user was instructed to look at. To ensure that the user actually was looking at the dot, a letter was displayed in the dot ("L" or "R"). The user was then asked to tap the left or right side of the screen depending on the corresponding letter he or she saw. Many prior datasets had been collected by inviting participants to a physical location such as a research lab, but this method was hard to scale up (and also often

10 Figure 1-1: Structure of iTracker neural network led to little variation in the dataset). Instead, online crowd-sourcing was used to overcome these diﬃculties, and most of the participants in the dataset were recruited through Amazon Mechanical Turk, a popular crowd-sourcing platform. Others were recruited through studies at the University of Georgia.

1.2.2 iTracker model

A deep learning model called iTracker [4] was trained with Caffe [3] using the Gaze- Capture dataset. Figure 1-1 indicates the structure of the full iTracker neural network, a large neural network constructed with mini neural networks. There are four total inputs: a right eye image, a left eye image, a face image, and a facegrid. The facegrid is a 25x25 grid of 1’s and 0’s that indicates where the face is located relative to the rest of the camera view. The 25x25 grid represents the whole camera view, and the region shaded in by 1’s represents where the face is located. Each of these four inputs are fed into four different mini neural networks. The outputs from the left eye neural network and the right eye neural network are concatenated and fed into a fifth neural

11 network. Finally, the output from this ﬁfth neural network is concatenated with the output from the face neural network and the facegrid neural network to form the input for the sixth neural network. This sixth and last neural network outputs an (x, y) coordinate pair, which is the predicted location of the user’s gaze. The iTracker model had an average error of 1.86 centimeters on the test set for iPhones, and a 1.71 centimeter error from a 25-crop average. Because in the real- time case, it is unrealistic to generate 25 random crops and average the error from all of them, this M. Eng thesis chose to use the 1.86 single-crop centimeter error as a benchmark to improve upon.

1.3 Goals and direction of M. Eng research project

Given the previous work done in this area, the goal of this M. Eng research project was twofold:

1. To implement an iOS application that ran the iTracker model on the iPhone GPU, as this has not been done before.

2. To bring the single-crop error down from 1.86 centimeters.

This thesis will describe the work done to complete both goals. Chapter two describes the software work done to implement the iTracker model on the iPhone. Chapter three provides an analysis of the dataset used and its biases. Chapter four describes the steps taken to bring the error down from 1.86 centimeters to 1.66 centimeters. Chapter ﬁve provides visualizations and analysis of the new 1.66 centimeter result. Chapter six concludes the thesis and contains some suggestions for further work.

12 Chapter 2

Implementing the iTracker model on the iPhone GPU

2.1 Objectives

The objective of this part of the project was to write software to run the iTracker model on the iPhone GPU for real-time eye tracking. Since most mobile deep learning occurs on the back end by communicating with external server that has a GPU, there are very few deep learning libraries that actually run a model on iPhone GPU. Part of the challenge of this project was to ﬁnd and tweak such a library for the purposes of implementing a real-time eye tracker that ran on the iPhone GPU. There are two main advantages to implementing a real-time eye tracking library with the deep learning done locally on the phone itself, instead of on a backend server:

1. The eye-tracking library would not need the internet and would therefore be much more ﬂexible.

2. In the age of privacy concerns, users would be much less willing to adopt an application that constantly streamed their faces back to a server. Having the deep learning done locally erases this concern and would help users feel more

13 conﬁdent about their privacy.

The ultimate goal was to create an eye-tracking library that could theoretically be used in any iOS app. For the purposes of this project, the eye-tracking library was used in a very simple app that simply displayed a dot where the user was looking on the screen.

2.2 Discussion of software challenges encountered and their solutions

There were many software challenges presented on the way to implementing the real- time eye tracker. This section describes the issues encountered, along with a discussion of the solutions that were implemented and their trade-oﬀs.

No direct Caﬀe to iOS conversion

Firstly, there were no deep learning libraries that directly and accurately converted a model trained with Caffe to a model that could run on iOS. There was a very new library called DeepLearningKit 1 that did offer a direct Caffe to iOS conversion, but it had many bugs and did not work properly. A different library called DeepBeliefSDK 2 was chosen instead for its correctness, lack of bugs, and optimized performance. However, unlike DeepLearningKit, it did not offer a direct Caffe to iOS conversion. To solve this issue, the Caffe model was first converted to another intermediate library called ccv3. Then, the intermediate ccv model was used with the DeepBeliefSDK library, which then converted it to a model that could be used by iOS. The main challenge of this solution was in writing the scripts to perform the manual conversions between libraries correctly (Caffe to ccv to iOS). For example,

1https://github.com/DeepLearningKit/DeepLearningKit 2https://github.com/jetpacapp/DeepBeliefSDK 3https://github.com/liuliu/ccv

14 Figure 2-1: Flowchart illustrating Caffe to iOS conversion process each library had its own conventions for the ordering of the 4D tensor dimensions (e.g. width x height x depth x channels). Since there was no documentation online about the ccv library’s conventions vs. DeepBeliefSDKs conventions, all 24 permutations of the ordered dimensions needed to be tested. Other discrepancies between the libraries that were similarly undocumented were the the type of mean file (e.g. dimensions of 224x224x3 as opposed to 3x224x224) and color channels (e.g. a BGR convention as opposed to RGB). Figure 2-1 illustrates this manual conversion process among the libraries. First, a Matlab script was used to convert a caffemodel file into a SQLite file. Then, a script modified from Pete Warden’s forked version of the ccv libary4 converted the sqlite file into a .ntwk file, which is the format used by DeepBeliefSDK. Many tests were written to ensure that the large iTracker model was being pre- served correctly in the manual conversion process. In the end, the final iOS model outputted (x, y) coordinates that were slightly different from the original Caffe outputs, but since the differences were around 0.01cm, they were deemed negligible.

Multi-input networks not supported

The iTracker model was a model with four diﬀerent inputs, and the DeepBeliefSDK library only supported neural networks with one input. To resolve this issue, the iTracker model was split up into mini neural network components. Each of these components were converted one at a time into the corresponding DeepBeliefSDK network. As Figure 1-1 shows, there were 6 mini neural networks that could be

4https://github.com/petewarden/ccv

15 constructed this way: three for each of the left eye, right eye, and face, and three for each of the the facegrid, eye concatenation, and final concatenation of all inputs. The first set of three neural networks (left eye, right eye, and face) all had convolutional layers, while the second set of three neural networks (facegrid, eye concatenation, and final concatenation) all did not have convolutional layers. Therefore, there were two types of mini neural networks that needed to be implemented: one with convolutional layers, and one without convolutional layers.

Five out of six mini neural networks not supported by ccv

The ccv library had the following two requirements for networks:

1. Networks needed to have at least one convolutional layer (i.e. no networks with only fully connected layers)

2. Each convolutional layer in a network needed to be followed by a fully connected layer.

Two out of the six mini networks did not follow the first requirement (the networks for the left eye and right eye), and three other networks did not follow the second requirement (the networks for the facegrid, eye concatenation, and final concaten- tation). Therefore, in the beginning, five out of the six mini neural networks were unable to be handled by the ccv library. One possible solution to this issue was to change the ccv library itself, but after multiple days of trying this, it became clear that changing the entire library would be too time-consuming. Instead, the following solutions were implemented. To satisfy the first requirement, fully-connected layers were re-implemented within DeepBe- liefSDK by reading in the weights of the fully connected layers from a textfile, and then performing optimized matrix operations. This solution worked since the networks in question were made up of only fully-connected layers. However, the tradeoff

16 in implementing this solution was in a larger memory footprint of the app, since the weights of the layers needed to be stored in memory. To satisfy the second requirement, the iTracker model was retrained with an extra fully connected layer after the left eye convolutional network, and a second extra fully connected layer after the right eye convolutional network. The original concern around this solution was that adding unnecessary layers could increase the number of parameters in the model and lead to overﬁtting (and a larger error on the test set), but the error stayed roughly the same (within 0.01cm), so this solution was implemented.

Dependence on external libraries slows the app down

The iTracker model requires a lot of preprocessing, including detecting and extracting the face and eyes from the original frame image. This dependence on external libraries slows the app down, which was one of the motivations for the research towards a newer, single-input, end-to-end model described at the end of this thesis.

2.3 Implementation details in iOS

After the model conversion from Caffe to iOS, what remained was to connect the rest of the app.5 Apple’s native face and eye detection libraries were used to gather the three out of the four inputs, which consisted of the face image, left eye image, and right eye image. The facegrid was computed separately by computing the position of the face relative to the rest of the full frame image. The input images were then converted to a format that could be read by the neural network, which then ran on the GPU of the Apple device and produced the final coordinates. To remain device independent, the final coordinates that the neural network predicted were relative to the camera position on the Apple device. Therefore,

5https://github.com/harini-kannan/EyeTrackerDemo

17 there had to be some post-processing of the coordinates to adapt the output to the iPhone 6s that was being used for testing. Finally, after the post-processing, the location of a red dot rendered by a CALayer object was updated with each frame, creating a smoothly moving dot that followed a user’s gaze. The work done for this app was published as part of the CVPR 2016 paper "Eye Tracking for Everyone"[4]. Figure 2-2 is a screenshot of the iPhone app. The three image inputs (face, left eye, and right eye) are shown on the screen, along with a red dot that indicates the location of the current gaze. Below are the current performance metrics of the app:

1. Speed -3framespersecond

2. Memory usage - 500 MB

3. Centimeter error -1.86centimeters

Avideodemo6 of this app was also created, illustrating the movement of the dot to eight diﬀerent points: upper left corner, upper middle corner, upper right corner, bottom right corner, bottom middle corner, bottom left corner, and the center.

6http://people.csail.mit.edu/hkannan/eye_tracking_demo.html

18 Figure 2-2: Screenshot of current iPhone app

19 20 Chapter 3

Dataset analysis

Prior to eﬀorts to improve the model, it was important to analyze the original Gaze- Capture dataset ﬁrst to better understand its characteristics, since the plan was to use the GazeCapture dataset to train a new model. During this analysis, a strong bias in the dataset was discovered: the same set of around 20 points appeared over and over again, as shown in Figure 3-1.

3.1 Analysis of bias

The training data for the model in question (the model trained for iPhones in portrait mode) contained 10,677,300 data points that should have been spread out over the 6.83600 cm by 12.15400 cm possible screen size (screen size of the largest iPhone, which is the iPhone 6 Plus). These ground truth data points were then plotted on a 3D histogram, where each histogram bucket was 0.12 cm by 0.12 cm, with 56 x 100 =5600totalbuckets. As Figure 3-1 shows, there are 26 "peaks" in the data, where each peak is well over 100,000 points. The rest of the locations on the grid have very few points (on the order of 100 or so). The total number of points contained in these 26 peaks (or 26 grid cells) is 3,804,375. Given that each grid cell is 0.12cm by 0.12cm, the total

21 area represented by these 26 grid cells is 0.3744 square centimeters, while the total area represented by the entire possible screen size is 6.83600 cm x 12.15400 cm, or 83.085 square centimeters. This means that 3,804,375 / 10,677,300 = 35.6% of the entire data is concentrated in 0.3744 / 83.085 = 0.45 % of the total area. These 26 peaks were found when looking at the histogram buckets with more than 100,000 samples each. When looking at the histogram buckets with more than 10,000 samples each, there were 48 peaks. Similar calculations for these 48 peaks show that 42.6% of the entire dataset is concentrated in 0.83% of the total area. Furthermore, out of the 5600 histogram buckets, 1116 of them did not have any training data. This means that 20% of the total area was not covered at all by the training set.

3.2 Reasons behind bias

Upon further inspection, it was observed that the GazeCapture app had thirteen calibration points that it sent out to each user. Since there were two supported iPhone screen sizes (a larger and a smaller one), the thirteen calibration points appeared at two diﬀerent sets of locations for each screen size. This resulted in a total of 26 points that appeared over and over again, which is the result seen in Figure 3-1.

3.3 Eﬀects of training set bias

The initial concern around this training dataset distribution was that the original iTracker model was trained to recognized those twenty-six calibration points. More- over, since the testing set had the same bias since it was drawn from the same distribution as the training set, the concern was that the numbers reported in the paper were also biased. In other words, the testing error that was reported may have been

22 Figure 3-1: 3D histogram of ground truth points from GazeCapture dataset smaller than the actual "real life" error. However, when plotting the 3D histogram of the predictions on the test set as shown in Figure 3-2, the distribution was relatively flat, and no sharp peaks were observed as in Figure 3-1. One reason this happened is likely because the model learned to interpolate well, even though it was fed the same set of 20 points over and over again since the training set was not too diverse. Because the effects of the dataset bias were small, the same training set with all of the repeated points was used for the new model. Creating a new training set by removing all the repeated points would have almost halved the training set size, so it was convenient that the effects of the dataset bias were small, allowing us to avoid the expensive and time-consuming process of re-collecting all the data.

23 Figure 3-2: 3D histogram of ground truth points from GazeCapture dataset

24 Chapter 4

Improving the model

As described in the introduction, the second goal of this M. Eng project was to improve the error from the baseline model’s error of 1.86 centimeters.

4.1 First attempt: Classiﬁcation loss

4.1.1 Motivations

The original iTracker model used a regression loss to predict the (x, y) coordinate of the user’s gaze. The first attempt to improve the centimeter error was to use a classification loss instead, motivated by the work of Recasens et al [7]. Their work implemented a model for gaze following 1,oridentifyingtheobjectsinaphotothat apersoninthephotoislookingat.Sincetheproblemissimilartotheproblemof eye tracking, the classification loss used in their model seemed like a good approach to improve the iTracker model error.

1http://gazefollow.csail.mit.edu/

25 4.1.2 Implementation

The idea behind the classification loss function was to take the output space and divide it up into a grid. In the case of eye-tracking, the output space would be the iPhone screen. A softmax classification loss would be used to predict which square in the grid the user is looking at. In other words, for each square on each grid, the classification loss function would calculate the probability that the location of the user’s gaze is in that particular square. The center of the square with the highest probability would be used as the model’s prediction. With just one grid, the number of locations the model could predict would be equal to the the number of square centers. However, the regression model has a clear advantage over this since the regression model’s output space is continuous. To mitigate this issue, the original grid was shifted in four different directions (up, down, left, and right) to produce four different grids (for a total of five grids). Figure 4-1 illustrates this process. For each of these five grids, a softmax classification loss layer was used to predict which square in the grid the user was looking at. The centers of these five chosen squares were weighted by their respective probabilities, and then summed up to give the final (x, y) prediction. The original grid dimensions that were used was 5x3, so with an original screen size of 12.15400 cm x 6.836 cm, each grid square was 2.4308 cm x 2.2787 cm. The four shifted grids were created by shifting half of their width if shifting horizontally, or shifting half of their height if shifting horizontally. This corresponded to a horizontal shift of 1.2154 cm, and a vertical shift of 1.13935 cm. However, the final error of this classification model was 1.93 centimeters, which was worse than the baseline iTracker model. Many variations on this classification model were tried, such as 4x6 grids, 7x5 grids, grids shifted by thirds, adding dropout layers of various percentages after a variety of different layers, removing layers to reduce overfitting, shrinking layers to reduce overfitting, trying different loss layers (Euclidean, Cross-Entropy, and both),

26 Figure 4-1: Five grids produced by four diﬀerent shifts.

trying many diﬀerent hyperparameter values (for learning rate, weight decay, step size, batch size), and trying diﬀerent weight initialization methods. However, none of these models did better than the baseline model’s error of 1.86 centimeters.

One reason that the classification loss did poorly could be because it was unable to interpolate as well as the regression loss. As discussed in the previous chapter, the training set was heavily skewed towards a set of twenty-six repeated calibration points. Such a training set requires a model that can interpolate well given the limited number of points. Since the regression function has a perfectly continuous and uniform output space, while the classification function does not, the regression function is likely better suited to such interpolation. Because of this, the classification loss was discarded, and the regression loss was kept for future iterations of the model.

27 4.2 Second attempt: End-to-end model

4.2.1 Motivations

One issue with the baseline model was that it required a lot of preprocessing, since it needed to extract a left eye, right eye, face, and facegrid from a full-frame image. This meant that on the software side, a separate library was needed to perform real-time face detection and eye detection. All of this preprocessing hindered the performance of the app, which makes the dot movements slow and created discrete- looking movements. This motivated the idea of a neural network that could accurately predict the location of a user’s gaze using just a person’s face, or even just a full frame image. This would remove the need for manual feature extraction (extracting the eyes and facegrid, for example). Even if this new end-to-end neural network could just match the existing performance, this would cut down the preprocessing time a lot and make the eye tracker work better. Advantages to such an end-to-end model include the following:

1. Better performance on the app

2. A more elegant model - the full frame already contains all the information the model would need, such as the picture of the eyes, pose of the head, etc. There is not really a need to parse out these individual components and do the feature extraction ourselves if we can build a good enough model.

3. Potential for visualizations - The model could be used to visualize which areas of the image are being utilized the most by the network. For example, the model could show that some neurons learn to activate on the eyes, or that some neurons activate on certain key points on the face to determine head pose. These results would be interesting on their own to shed light on which features are important for eye tracking.

28 4.2.2 Discussion of incremental improvements

The new model described in the subsequent sections decreased the error from 1.86 centimeters in the baseline model to 1.66 centimeters in the new model, which is an 11% improvement from the baseline. The following sections describe the incremental improvements used to achieve this result: from various improvements on the data side (cropping on the ﬂy, implementing rectangular crops in Caﬀe, increasing image resolution) to various improvements on the model side (using AlexNet, then ResNet10 and batch normalization, and then an iterative ensemble learning model).

4.2.3 Cropping on the ﬂy

The first improvement was an improvement with the implementation of the model itself. The previous baseline model had a cropped face training set that had already been augmented in five different ways before being fed into the Caffe model. The cropped face had been moved up, down, left, and right, and kept in the center for the five different augmentations. The issue with this was twofold: firstly, the lmdbs took up much more memory than necessary, since they were 5 times larger than they should have been. Secondly, pre-augmenting the lmdb makes it so that the model is trained to recognize the same types of augmentations (up, down, left, and right). On the other hand, doing a random crop of images on the fly while training would lead to a new crop (and therefore new position) of the face each time the training image passed through the model. As the models were trained with 50 epochs, this meant that the model saw 50 unique crops of each training image, once for each time it saw the training image. This would lead to less overfitting on the "real-life" test set.

4.2.4 Changing the loss function

The original iTracker architecture on 224x224 full face images resulted in a baseline error of 3.14 cm. The ﬁrst improvement that decreased the centimeter error was to

29 use a custom L1 centimeter loss instead of Caffe’s Euclidean loss [3]. Caffe’s Euclidean loss is a measure of the L2 distance, not an L1 distance 2. This decision was made since an L1 loss function is more resistant to outliers. This is because the L2 loss function squares the error, so the model will be more likely to prioritize fixing the outlier cases rather than the more average examples. A custom L1 Euclidean loss was created in Caffe to implement this change, and the code for this can be found on the author’s Github 3.Asaresultofthisnewlosslayer,theerrorimprovedto2.99 centimeters (0.15 centimeter improvement).

4.2.5 Changing the model: Original architecture to AlexNet to ResNet10 with batch normalization

The second improvement that was made was to use AlexNet[5] on the full face images instead of the original iTracker architecture. The original iTracker architecture was loosely modeled after the ﬁrst four layers of AlexNet. As deeper neural networks generally learn better features, using all nine layers of AlexNet would likely result in a more accurate model. This was shown to be true, as the result from using AlexNet was a 2.10cm error, or a 0.89 centimeter improvement. The third improvement was to use the ResNet10 model[1] with batch normalization [2]. ResNet is based on building blocks of conv-relu-conv layers, the output of which is denoted by some F(x). Then, the identity x is added to F(x) to create a new function H(x), which is easier for the model to optimize. The ResNet model, made from these building blocks, won 1st place in the ILSVRC 2015 competition 4. AlexNet’s top 5 test error was 15.4%, while ResNet’s top 5 test error rate was 3.6%. Therefore, it seemed clear that ResNet would learn better features than AlexNet, which was the main motivation for using the ResNet model. The ten-layer version

2http://caﬀe.berkeleyvision.org/doxygen/classcaﬀe_1_1EuclideanLossLayer.html 3http://www.github.com/harini-kannan 4http://image-net.org/challenges/LSVRC/2015/

30 of ResNet was used in order to more quickly train the models5,andtherestofthis thesis will refer to this version of ResNet as ResNet10. The ResNet10 model was used with batch normalization [2], a fairly new technique that helps to improve performance. Batch normalization works by seeking to reduce internal covariate shift. Covariate shift is when the distribution of inputs into a machine learning model changes, and internal covariate shift is when the distribution of neuron activations changes from layer to layer inside a neural network. LeCun et al [6] and Wiesler et al [9] showed that whitened inputs, or inputs with a mean of 0 and a variance of 1, make neural networks converge faster than non-whitened inputs. Motivated by this, a batch normalization layer whitens the inputs between each layer in a neural network, which drastically reduces the issue of an internal covariate shift. During the training phase of the ResNet10 neural network, the batch normalization layer performed its calculations on mini-batches, while during the testing phase, the batch normalization layer used the global, aggregated statistics on the whole testing set. Using the ResNet10 model with batch normalization brought the error down to 1.94 centimeters, or a 0.16 centimeter improvement.

4.2.6 Implementing rectangle crops in Caﬀe to preserve aspect ratio

The fourth improvement was to change the input image size to keep the original aspect ratio. All of the original images were 640x480 pixels, but since Caffe only supported square crops, the images needed to be resized to squares. Images that were 256x256 were used for the model that produced the 1.92 centimeter error. Resizing a rectangle image to a square causes distortion along one axis, which could make the model unable to find key details in the picture. As a result, Caffe was modified to allow rectangular crops. Two new parameters (called "crop_h" and "crop_w") were implemented within Caffe to allow this. The code for this modified version of Caffe

5https://github.com/cvjena/cnn-models

31 can be found on the author’s Github 6.Whenimagesthatwere320x240wereused with a rectangle crop size of 288x224, the result improved to 1.90 centimeters, or a 0.04 centimeter improvement.

4.2.7 Increasing image resolution

The ﬁfth improvement was to use an increased resolution for the input images. As current state-of-the-art eye trackers like the Tobii Pro show 7,itisimportanttohave high resolution images so that models can identify minute details in the eyes in order to make a prediction about the user’s gaze. The original images were all 640x480 pixels, and input image sizes of 256x256 (with crop size of 224x224), 384x384 (with crop size of 352x352), and 480x480 (with crop size of 448x448) were all tried. Figure 4-2 is a graph illustrating the relationship between image input size and centimeter error. As shown, the centimeter error decreases linearly with the input image size. The best result was the 480x480 image with a crop size of 448x448, which resulted in a 1.767 centimeter error, or a 0.13 centimeter improvement.

4.2.8 Ensemble learning approach

The sixth improvement was to use an ensemble learning approach. One important observation from Figure 4-2 is that when two types of input images were tried, full frames and cropped faces, the cropped face ones always performed better. The cropped face input images were created by ﬁrst cropping out the face from the full frame, and then enlarging the cropped face to the desired image size. The best full frame model (448x448 input images) resulted in a 1.767 centimeter error, while the best cropped face model (448x448 input images) resulted in a 1.752 centimeter error. When averaging the predictions from both of these models to create an ensemble, the resulting

6http://www.github.com/harini-kannan 7http://www.tobiipro.com/learn-and-support/learn/eye-tracking-essentials/how-do-tobii-eye- trackers-work/

32 Figure 4-2: Centimeter error as a function of image size. Shown for both full frame images and cropped face images. error was 1.657 centimeters. This represents around a 11% improvement from the baseline model’s error, which was 1.86 centimeters.

Creating an ensemble model by averaging the predictions from both of these models likely led to a better result because the ensemble model was able to use the strengths from both models. The strength of the full frame model was that it provided extra location information for where the face was, similar to the function of the facegrid in the original baseline model. The strength of the cropped face model was that it allowed the model to focus in on ﬁne-grained details of the eyes. By averaging these two predictions, the strengths of both these models were combined, making it less likely that either one of them would make an extreme prediction that would drive the error higher.

One point that needed to be taken into consideration was the memory usage for

33 the high resolution images. While the 448x448 input images produced the best result, it also led to a large increase in memory. To ﬁt an entire mini-batch of images into a single GPU that had 12GB of memory, the batch size needed to be reduced from 64 to 32. To account for the halving of the batch size, the learning rate was increased by a factor of p2. This slowed down training signiﬁcantly - both the cropped face and the full frame model took around 3 days to train. Each model was trained for 50 epochs, with a training set size of around 400,000 images, and a test set size of approximately 50,000 images.

This result suggests that it could be possible to merge the full frame and the cropped face model into one model whose performance was around 1.66 centimeters. Inspired by the work of Rosenfeld and Ullman [8], an iterative model was used, constructed with two smaller models. For the ﬁrst model for the training phase, the model took in three input lmdbs (lmdb for 640x480 full frames, lmdb for the ground truth gaze labels, and lmdb for the ground truth face bounding boxes). The face bounding boxes were represented with two points - one for the top left corner, and one for the bottom right corner. There were three losses used: a Euclidean loss to measure against the ground truth gaze label, a Euclidean loss to measure against the top left corner of the face bounding box, and a Euclidean loss to measure against the bottom right corner of the face bounding box. A weighted average of these three losses was used to train the model.

The second model for the training phase was exactly the same as the 480x480 cropped face model. This model took in two input lmdbs (lmdb for 480x480 previously cropped faces and an lmdb for the ground truth gaze labels). The output loss was a Euclidean loss on the ground truth labels.

The key novelty of this model is in the testing phase, whose process Figure 4-3 illustrates. For the testing phase, the ﬁrst model only needs the full frame image - with this, it produces both the bounding box for a cropped face and the coordinates of the user’s gaze. The testing phase of the second model only needs a cropped face,

34 and with this, it produces the coordinates of the user’s gaze. Then, the two sets of predicted gaze coordinates (from the ﬁrst model and the second model) are averaged together to produce the ﬁnal result. The novelty of this model lies in the fact that no preprocessing is needed to achieve a better result than the baseline - the model only needs the full frame image as input. In contrast, the baseline model needed a cropped face, left eye, right eye, and a facegrid as input, which required a lot of preprocessing. This new model eliminates the need for any preprocessing and is designed to make real-time performance faster when implemented on an app.

35 Figure 4-3: Centimeter error as a function of image size. Shown for both full frame images and cropped face images.

36 Chapter 5

Results and Visualizations

The ﬁnal result achieved with the new deep learning model described in this thesis was a 1.66 centimeter error, which improved upon the previous baseline of 1.86 centimeters by 11%. Many visualizations were created in order to better understand how this new model made its predictions, and the analysis of these visualizations is described in Section 2.

5.1 Comparison to random performance

Many machine learning models compare their results to random performance as one of the baselines to measure against. To compare this to purely random performance, two types of random models were used:

1. Model that randomly picks one of the twenty-six calibration points described in Chapter 3: 4.1 centimeter error

2. Model that picks a point purely at random: 6.6 centimeter error

As shown, the new model with a 1.66 centimeter error does much better than random performance.

37 5.2 Visualizations

5.2.1 Generating model attention heatmaps with CAM

Using the CAM technique 1 developed by Zhou et al [10], the model’s attention on a particular image was able to be visualized. The CAM technique draws a heatmap over an input image to illustrate the most important parts of the input image to the model. Below is a list of the full set of visualizations, along with their corresponding links:

1. Visualization 1: 100 full frame images taken randomly from the test set, along with their corresponding cropped faces 2

2. Visualization 2: The 100 cropped face images with the highest error in the test set, along with the corresponding full frame images 3

3. Visualization 3: The 100 cropped face images with the lowest error in the test set, along with the corresponding full frame images 4

4. Visualization 4: The 100 full frame images with the highest error in the test set, along with the corresponding cropped faces 5

5. Visualization 5: The 100 full frame images with the lowest error in the test set, along with the corresponding cropped faces 6

5.2.2 Observations

Generating the attention heatmaps provided valuable data to make observations about the results discussed in the previous chapter. Below are some of these ob-

1https://github.com/metalbubble/CAM 2http://people.csail.mit.edu/hkannan/heatmap_random_100_frames/ 3http://people.csail.mit.edu/hkannan/heatmap_highest_100_faces/ 4http://people.csail.mit.edu/hkannan/heatmap_lowest_100_faces/ 5http://people.csail.mit.edu/hkannan/heatmap_highest_100_frames/ 6http://people.csail.mit.edu/hkannan/heatmap_lowest_100_frames/

38 servations:

1. The heat maps for the full frame images look very diﬀerent from the heat maps for the cropped face images. The model trained on the cropped face images is able to localize on the eyes much better than the model trained on the full frame images. Example 58 in Visualization 2, reproduced in Figure 5-1 and Figure 5-2 , illustrates this. This is consistent with the hypothesis described in the previous section about the relative strengths of the full frame images and the cropped face images with respect to the ensemble model: the cropped face model is able to localize on ﬁne-grained details like the eyes, while the full frame image provides valuable information about the head pose.

2. With regards to the cropped face model’s ability to localize on the eyes, it appears that there could be some room for improvement. Example 54 in Visu- alization 2, reproduced in Figure 5-3, illustrates this. The model does not focus its attention on the eyes, and instead focuses its attention on the nose. This suggests that the model could have some trouble with identifying the eyes, especially with blurry images. This observation could inform the next generation of models to improve the centimeter error, perhaps by identifying models that localize on the eyes better.

3. Both the cropped face model and the full frame model highlight the nose many times. Example 24 in Visualization 3, reproduced in Figure 5-4 illustrates this. This implies that the orientation of the nose acts as a key point that signals which direction the head is pointing.

39 Figure 5-1: Cropped face, Example 58 in Visualization 2 (see previous footnotes for link).

Figure 5-2: Cropped face, Example 58 in Visualization 2 (see previous footnotes for link).

40 Figure 5-3: Example 54 in Visualization 2 (see previous footnotes for link).

Figure 5-4: Example 24 in Visualization 3 (see previous footnotes for link).

41 42 Chapter 6

Conclusion

6.1 Further work

While this thesis describes a new model with a 1.66 centimeter error, there is still room for further work to bring the centimeter error down even lower. Below are three ways that this further work could develop:

1. Collecting more data from a variety of subjects. As Krafka et al showed [4], increasing the number of subjects brought down the centimeter error signiﬁcantly.

2. Developing an accurate deep learning model that uses calibration. The eﬀects of calibration were not explored in this thesis since the scope was to create a calibration-free eye tracker, but calibration could be used in the future to create amuchmoreaccurateeyetracker.

3. Developing a deep learning model that localizes on the eyes more accurately. As the analysis from the visualizations in Chapter 5 showed, there seems to be room for better localization for the eyes. Finding ways to better localize on the eyes could help the model ﬁnd minute details in the eyes better, and therefore lead to a better centimeter error.

43 6.2 Conclusion

This thesis discussed the development of a software-only eye tracker, with an error of 1.66 centimeters that represented a 11% improvement over the baseline error of 1.86 centimeters. The first part of this thesis discussed the novel implementation of the original iTracker model on the iPhone GPU using DeepBeliefSDK, while the second part of the thesis discussed the range of incremental improvements made to decrease the centimeter error. These improvements included ones on the data side (cropping on the fly, implementing rectangular crops in Caffe, larger image resolution) and also ones on the model side (using AlexNet, then ResNet10 and batch normalization, and then an iterative learning model). The final model was an iterative model inspired by an ensemble learning approach that only needed a full frame image to make a final prediction, in contrast to the original iTracker model that needed heavy preprocessing and four different inputs (cropped face, cropped left eye, cropped right eye, and facegrid). In conclusion, this thesis developed an accurate, calibration-free, software- only eye tracker that could eventually be used for a variety of applications, such as free assistive technology that could one day help people with paralysis control their phones with just their eyes.

44 Bibliography

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In arXiv preprint arXiv:1512.03385,2015.

[2] Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Journal of Machine Learning Research (JMLR),2015.

[3] Yangqing Jia, Evan Shelhamer, JeﬀDonahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caﬀe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093,2014.

[4] Kyle Krafka*, Aditya Khosla*, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. Eye tracking for everyone. In Computer Vision and Pattern Recognition (CVPR),2016.

[5] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS),2012.

[6] Yann LeCun, Leon Bottoui, Genevieve B. Orr, and Klaus-Robert Miuller. Eﬃ- cient backprop. In Neural Networks: Tricks of the trade,1998.

[7] Adria Recasens*, Aditya Khosla*, Carl Vondrick, and Antonio Torralba. Where are they looking? In Advances in Neural Information Processing Systems (NIPS),2015.

[8] Amir Rosenfeld and Shimon Ullman. Visual concept recognition and localization via iterative introspection. arXiv preprint arXiv:1603.04186v2,2016.

[9] Simon Wiesler and Hermann Ney. A convergence analysis of log-linear training. In Advances in Neural Information Processing Systems (NIPS),2011.

[10] B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR,2016.