<<

applied sciences

Article Deep-Learning-Based Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices

Wei-Liang Ou, Tzu-Ling Kuo, Chin-Chieh Chang and Chih-Peng Fan *

Department of Electrical Engineering, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan; [email protected] (W.-L.O.); [email protected] (T.-L.K.); [email protected] (C.-C.C.) * Correspondence: [email protected]

Abstract: In this study, for the application of visible-light wearable eye trackers, a pupil tracking methodology based on deep-learning technology is developed. By applying deep-learning object detection technology based on the You Only Look Once (YOLO) model, the proposed pupil tracking method can effectively estimate and predict the center of the pupil in the visible-light mode. By using the developed YOLOv3-tiny-based model to test the pupil tracking performance, the detection accuracy is as high as 80%, and the recall rate is close to 83%. In addition, the average visible-light pupil tracking errors of the proposed YOLO-based deep-learning design are smaller than 2 pixels for the training mode and 5 pixels for the cross-person test, which are much smaller than those of the previous ellipse fitting design without using deep-learning technology under the same visible-light conditions. After the combination of calibration process, the average gaze tracking errors by the proposed YOLOv3-tiny-based pupil tracking models are smaller than 2.9 and 3.5 degrees at the training and testing modes, respectively, and the proposed visible-light wearable gaze tracking  system performs up to 20 frames per second (FPS) on the GPU-based software embedded platform. 

Citation: Ou, W.-L.; Kuo, T.-L.; Keywords: deep-learning; YOLOv3-tiny; visible-light; pupil tracking; gaze tracker; wearable Chang, C.-C.; Fan, C.-P. eye tracker Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices. Appl. Sci. 2021, 11, 1. Introduction 851. https://doi.org/10.3390/ Recently, wearable gaze tracking devices (or called eye trackers) have started to app11020851 become more widely used in human-computer interaction (HCI) applications. To evaluate people’s attention deficit, vision, and cognitive information and processes, gaze tracking Received: 29 November 2020 devices have been well-utilized in this field [1–6]. In [4], the eye-tracking technology was a Accepted: 12 January 2021 Published: 18 January 2021 vastly applied methodology to examine hidden cognitive processes. The authors developed a program source code debugging procedure that was inspected by using the eye-tracking

Publisher’s Note: MDPI stays neutral method. The eye-tracking-based analysis was elaborated to test the debug procedure and with regard to jurisdictional claims in record the major parameters of . In [5], the eye-movement tracking device published maps and institutional affil- was used to observe and analyze a complex cognitive process for C# programming. By iations. evaluating the knowledge level and eye movement parameters, the test subjects were involved to analyze the readability and intelligibility of the query and method at the Language-Integrated Query (LINQ) declarative query of the C# programming language. In [6], the forms and efficiency of debugging parts of software development were observed by tracking eye movements with the participation of test subjects. The routes of gaze for Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. the test subjects were observed continuously during the procedure, and the coherences This article is an open access article were decided by the assessment of the recorded metrics. distributed under the terms and Many up-to-date consumer products have applied embedded gaze tracking technol- conditions of the Creative Commons ogy in their designs. Through the use of gaze tracking devices, the technology improves Attribution (CC BY) license (https:// the functions and applications of intuitive human-computer interaction. Gaze tracking creativecommons.org/licenses/by/ designs use an image sensor based on near- (NIR), which was utilized in [7–16]. 4.0/). Some previous eye-tracking designs used high-contrast eye images, which are recorded by

Appl. Sci. 2021, 11, 851. https://doi.org/10.3390/app11020851 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, x FOR PEER REVIEW 2 of 21

Appl. Sci. 2021, 11, 851 2 of 21 designs use an image sensor based on near-infrared (NIR), which was utilized in [7–16]. Some previous eye-tracking designs used high-contrast eye images, which are recorded activeby active NIR NIR image image sensors. sensors. NIR NIR image image sensor-based sensor-based technology technology can achievecan achieve high high accuracy accu- whenracy when used indoors.used indoors. However, However, the NIR-based the NIR-ba designsed design is not is suitablenot suitable for usefor outdoorsuse outdoors or withor with sunlight. sunlight. Moreover, Moreover, long-term long-term operation operatio mayn may have have the the potential potential to to hurt hurt the the user’s user’s eyes.eyes. When When eye eye trackers trackers are are considered considered for for general general and and long-term long-term use, use, visible-light visible-light pupil pupil trackingtracking technology technology becomes becomes a a suitable suitable solution. solution. DueDue to to the the characteristics characteristics of theof the selected selected sensor sensor module, module, the contrastthe contrast of the of visible-light the visible- eyelight image eye mayimage be may insufficient, be insufficient, and the nearfieldand the nearfield eye image eye may image contain may some contain random some types ran- ofdom noise. types In general,of noise. visible-light In general, eyevisible-light images usually eye images include usually insufficient include contrast insufficient and large con- randomtrast and noise. large In random order to noise. overcome In order this to inconvenience, overcome this some inconvenience, other methods some [17 other–19] havemeth- beenods [17–19] developed have for been visible-light-based developed for visible-li gaze trackingght-based design. gaze tracking For the previousdesign. For wearable the pre- eyevious tracking wearable design eye in tracking [15,19] withdesign the in visible-light [15,19] with mode, the visible-light the estimation mode, of pupil the estimation position wasof pupil easily position disturbed was by easily light anddisturbed shadow by occlusion,light and shadow and the predictionocclusion, and error the of theprediction pupil position,error of thus,the pupil becomes position, serious. thus, Figure becomes1 illustrates serious. some Figure of the1 illustrates visible-light some pupil of the tracking visible- resultslight pupil by using tracking the previous results by ellipse using fitting-based the previous methodsellipse fitting-based in [15,19]. Thus, methods a large in [15,19]. pupil trackingThus, a errorlarge willpupil lead tracking to inaccuracy error will of thelead gaze to inaccuracy tracking operation. of the gaze In tracking [20,21], for operation. indoor applications,In [20,21], for the indoor designers applications, developed the eyedesigners tracking developed devices basedeye tracking on visible-light devices based image on sensors.visible-light A previous image studysensors. [20 ]A performed previous experimentsstudy [20] performed indoors, but experiments showed that indoors, the device but couldshowed not that work the normally device could outdoors. not work When normally the wearable outdoors. eye-tracking When the wearable device is correctlyeye-track- useding device in all indoor is correctly and outdoor used in environments,all indoor and theoutdoor method environments, based on the the visible-light method based image on sensorthe visible-light is sufficient. image sensor is sufficient.

FigureFigure 1. 1.Visible-light Visible-light pupil pupil tracking tracking results results by by the the previous previous ellipse ellipse fitting-based fitting-based method. method.

RelatedRelated Works Works InIn recent recent years, years, the the progress progress of deep-learning of deep-learning technologies technologies has become has become very important, very im- especiallyportant, especially in the field in of the computer field of and vision pattern and recognition. pattern recognition. Through deep-learningThrough deep- technology,learning technology, inference models inference can models be trained can tobe realize trained real-time to realize object real-time detection object and detection recog- nition.and recognition. For the deep-learning-based For the deep-learning-based designs in de [signs22–37 in], the[22–37], eye trackingthe eye tracking systems systems could detectcould thedetect location the location of eyes orof ,eyes or regardlesspupils, regardless of using of the using visible-light the visible-light or near-infrared or near- imageinfrared sensors. image Table sensors.1 describes Table 1 thedescribes overview the ofoverview the deep-learning-based of the deep-learning-based gaze tracking gaze designs.tracking Fromdesigns. the From viewpoints the viewpoints of device of setup, device the setup, deep-learning-based the deep-learning-based gaze tracking gaze designstracking in designs [22–37] in can [22–37] be divided can be into divided two different into two styles, different which styles, include which the include non-wearable the non- andwearable the wearable and the types. wearable types. ForFor the the deep-learning-based deep-learning-based non-wearable non-wearable eye eye tracking tracking designs, designs, in in [ 22[22],], the the authors authors proposedproposed a a deep-learning-based deep-learning-based gazegaze estimation estimation scheme scheme to to estimate estimate the the gaze gaze direction direction fromfrom a a single single face face image. image. The The proposed proposed gaze gaze estimation estimation design design was was based based on on applying applying multiplemultiple convolutionalconvolutional neuralneural networksnetworks (CNN)(CNN) toto learn learn the the model model for for gaze gaze estimation estimation fromfrom the the eye eye images. images. The The proposed proposed design design provided provided accurate accurate gaze gaze estimation estimation for for users users withwith different different head head poses poses by by including including the the head head pose pose information information into into the the gaze gaze estimation estimation framework.framework. Compared Compared with with the the previous previous methods, methods, the the developed developed gaze gaze estimation estimation design design increasedincreased the the accuracy accuracy of of appearance-based appearance-based gaze gaze estimation estimation under under head head pose pose variations. variations. In [23], the purpose of this work was to use CNNs to classify eye tracking data. Firstly, In [23], the purpose of this work was to use CNNs to classify eye tracking data. a CNN was used to classify two different web interfaces for browsing news data. Secondly, Firstly, a CNN was used to classify two different web interfaces for browsing news data. a CNN was utilized to classify the nationalities of users. In addition, the data-preprocessing Secondly, a CNN was utilized to classify the nationalities of users. In addition, the data- and feature-engineering techniques were applied. In [24], for emerging consumer elec- tronic products, an accurate and efficient eye gaze estimation design was important, and Appl. Sci. 2021, 11, 851 3 of 21

the products were required to remain reliable under difficult environments with low- power consumption and low hardware cost. In this paper, a new hardware-friendly CNN model with minimal computational requirements was developed and assessed for efficient appearance-based gaze estimation. The proposed CNN model was tested and compared against existing appearance-based CNN approaches, achieving better eye gaze accuracy with large reductions of computational complexities. In [25], iRiter technology was developed, which can assist paralyzed people to write on screen by using only their eye movements. iRiter detects precisely the movement of the eye pupil by referring to the reflections of external near-infrared (IR) signals. Then, the deep multilayer perceptron model was used for the calibration process. By the position vector of pupil area, the coordinates of the pupil were mapped to the given gaze position on the screen. In [26], the authors proposed a gaze zone estimation design by using deep-learning technology. Compared with traditional designs, the developed method did not need the procedure of calibration. In the proposed method, a Kinect was used to capture the video of a computer user, a Haar cascade classifier was applied to detect the face and eye regions, and data on the eye region was used to estimate the gaze zones on the monitor by the trained CNN. The proposed calibration-free method performed a high accuracy to be applied for human–computer interaction. In [27], which focused on healthcare applications, the high accuracy in semantic segmentation of medical images were important; a key issue for training the CNNs was obtaining large-scale and precisely annotated images, and the authors sought to address the lack of annotated data used with eye tracking method. The hypothesis was that segmentation masks generated to help eye tracking would be very similar to those rendered by hand annotation, and the results demonstrated that the eye tracking method created segmentation masks, which were suitable for deep-learning-based semantic segmentation. In [29], for the conditions of unconstrained gaze estimation, most of the gaze datasets were collected under laboratory conditions, and the previous methods were not evaluated across multiple datasets. In this work, the authors studied key challenges, including target gaze range, illumination conditions, and facial appearance variation. The study showed that image resolution and the use of both eyes affected gaze estimation performance. Moreover, the authors proposed the first deep appearance-based gaze estimation method, i.e., GazeNet, to improve the state of the art by 22% for the cross-dataset evaluation. In [30], a subconscious response influenced the estimation of human gaze from factors related to the human mental activity. The previous methodologies only based on the use of gaze statistical modeling and classifiers for assessing images without referring to a user’s interests. In this work, the authors introduced a large-scale annotated gaze dataset, which was suitable for training deep-learning models. Then, a novel deep-learning-based method was used to capture gaze patterns for assessing image objects with respect to the user’s preferences. The experiments demonstrated that the proposed method performed effectively by considering key factors related to the human gaze behavior. In [31], the authors proposed a two-step training network, which was named the Gaze Estimator, to raise the gaze estimation accuracy on mobile devices. At the first step, an eye landmarks localization network was trained on the 300W-LP dataset, and the second step was to train a gaze estimation model on the GazeCapture dataset. In [28], the network localized the eye precisely on the image and the CNN network was robust when facial expressions and occlusion happened; the inputs of the gaze estimation network were eye images and eye grids. In [32], since many previous methods were based on a single camera, and focused on either the gaze point estimation or gaze direction estimation, the authors proposed an ef- fective multitask method for the gaze point estimation by multi-view cameras. Specifically, the authors analyzed the close relationship between the gaze point estimation and gaze direction estimation, and used a partially shared convolutional neural networks to estimate Appl. Sci. 2021, 11, 851 4 of 21

the gaze direction and gaze point. Moreover, the authors introduced a new multi-view gaze tracking data set consisting of multi-view eye images of different subjects. In [33], the U2Eyes dataset was introduced, and U2Eyes was a binocular database of synthesized images regenerating real gaze tracking scripts with 2D/3D labels. The U2Eyes dataset could be used in machine and deep-learning technologies for eye tracker designs. In [34], for using appearance-based gaze estimation techniques, only using a single camera for the gaze estimation will limit the application field to short distance. To overcome this issue, the authors developed a new long-distance gaze estimation design. In the training phase, the Learning-based Single Camera eye tracker (LSC eye tracker) acquired gaze data by a commercial eye tracker, the face appearance images were captured by a long-distance camera, and deep convolutional neural network models were used to learn the mapping from appearance images to gazes. In the application phase, the LSC eye tracker predicted gazes based on the acquired appearance images by the single camera and the trained CNN models with effective accuracy and performance. In [35], the authors introduced an effective eye-tracking approach by using a deep- learning-based method. The location of the face relative to the computer was obtained by detecting color from the infrared LED with OpenCV, and the user’s gaze position was inferred by the YOLOv3 model. In [36], the authors developed a low-cost and accurate remote eye-tracking device which used an industrial prototype smartphone with integrated infrared illumination and camera. The proposed design used a 3D gaze-estimation model which enabled accurate point-of-gaze (PoG) estimation with free head and device motion. To accurately determine the input eye features, the design applied the convolutional neural networks with a novel center-of-mass output layer. The hybrid method, which used artificial illumination, a 3D gaze-estimation model, and a CNN feature extractor, achieved better accuracy than the existing eye-tracking methods on smartphones. On the other hand, for the deep-learning-based wearable eye tracking designs, in [28], the eye tracking method with video-oculography (VOG) images needed to provide an accurate localization of the pupil with artifacts and under naturalistic low-light conditions. The authors proposed a fully convolutional neural network (FCNN) for the segmentation of the pupil area that was trained on 3946 hand-annotated VOG images. The FCNN output simultaneously performed pupil center localization, elliptical contour estimation, and blink detection with a single network on commercial workstations with GPU acceleration. Compared with existing methods, the developed FCNN-based pupil segmentation design was accurate and robust for new VOG datasets. In [34], to conquer the obstructions and noises in nearfield eye images, the nearfield visible-light eye tracker design utilized the deep-learning-based gaze tracking technologies to solve the problem. In order to make the pupil tracking estimation more accurate, this work uses the YOLOv3-tiny [38] network-based deep-learning design to detect and track the position of the pupil. The proposed pupil detector is a well-trained deep-learning network model. Compared with the previous designs, the proposed method reduces the pupil tracking errors caused by light reflection and shadow interference in visible-light mode, and will also improve the accuracy of the calibration process of the entire eye tracking system. Therefore, the gaze tracking function of the wearable eye tracker will become powerful under visible-light conditions. Appl. Sci. 2021, 11, 851 5 of 21

Table 1. Overview of the deep-learning-based gaze tracking designs.

Deep-learning-based Operational Eye/Pupil Tracking Gaze Estimation and Dataset Used Methods Mode/Setup Method Calibration Scheme Multiple pose-based Uses a facial Visible-light mode and VGG-like CNN Sun et al. [22] UT Multiview landmarks-based design to /Non-Wearable models for gaze locate eye regions estimation Joint eye-gaze CNN Visible-light mode The CNN-based eye architecture with both Lemley et al. [24] MPII Gaze /Non-Wearable detection eyes for gaze estimation Uses the OpenCV method Visible-light mode GoogleNetV1 based Cha et al. [26] Self-made dataset to extract the face region /Non-Wearable gaze estimation and eye region Uses face alignment and VGGNet-16-based Visible-light mode Zhang et al. [29] MPII Gaze 3D face model fitting to GazeNet for gaze /Non-Wearable find eye zones estimation CNN-based multi-view Visible-light mode ShanghaiTechGaze and Uses the ResNet-34 model Lian et al. [32] and multi-task gaze /Non-Wearable MPII Gaze for eye features extraction estimation Uses the OpenFace CLM- Visible-light mode framework/Face++/YOLOv3 Infrared-LED based Li et al. [34] Self-made dataset /Non-Wearable to extract facial ROI and calibration landmarks Visible-light mode Uses YOLOv3 to detect the Infrared-LED based Rakhmatulin et al. [35] Self-made dataset /Non-Wearable eyeball and eye’s corners calibration Infrared mode Eye-Feature locator CNN 3D gaze-estimation Brousseau et al. [36] Self-made dataset /Non-Wearable model model-based design Uses a fully convolutional 3D eyeball model, Near-eye infrared German center for neural network for pupil marker, and projector- Yiu et al. [28] mode vertigo and balance segmentation and uses assisted-based /Wearable disorders ellipse fitting to locate the calibration eye center Uses the Near-eye Marker and affine YOLOv3-tiny-based lite Proposed design visible-light Self-made dataset transform-based model to detect the mode/Wearable calibration pupil zone

The main contribution of this study is that the pupil location in nearfield visible-light eye images is detected precisely by the YOLOv3-tiny-based deep-learning technology. To our best knowledge from surveying related studies, the novelty of this work is to provide the leading reference design of deep-learning-based pupil tracking technology using the nearfield visible-light eye images for the application of wearable gaze tracking devices. Moreover, the other novelties and advantages of the proposed deep-learning-based pupil tracking technology are described as follows: (1) By using a nearfield visible-light eye image dataset for training, the used deep- learning models achieved real-time and accurate detection of the pupil position at the visible-light mode. (2) The proposed design detects the position of the pupil’s object at any eyeball movement condition, which is more effective than the traditional image processing methods without deep-learning technologies at the near-eye visible-light mode. (3) The proposed pupil tracking technology can overcome efficiently the light and shadow interferences at the near-eye visible-light mode, and the detection accuracy of a pupil’s Appl. Sci. 2021, 11, 851 6 of 21

location is higher than that of previous wearable gaze tracking designs without using Appl. Sci. 2021, 11, xdeep-learning FOR PEER REVIEW technologies. 6 of 21

To enhance detection accuracy, we used the YOLOv3-tiny-based deep-learning model to detect the pupil’s object in the visible-light near-eye images. We also compared its detection results to theTo otherenhance designs, detection which accuracy, include we the used methods the YOLOv3-tiny-based without using deep-deep-learning learning technologies.model The to pupildetect detectionthe pupil’s performance object in the vi wassible-light evaluated near-eye in terms images. of precision, We also compared recall, and pupil trackingits detection errors. results By theto the proposed other designs, YOLOv3-tiny-based which include the deep-learning-based methods without using deep- method, the trainedlearning YOLO-based technologies. deep-learning The pupil detection models pe achieverformance enough was evaluated precision in terms with of preci- small average pupilsion, tracking recall, and errors, pupil which tracking are lesserrors. than By 2the pixels proposed for the YOLOv3-tiny-based training mode and deep-learn- ing-based method, the trained YOLO-based deep-learning models achieve enough preci- 5 pixels for the cross-person test under the near-eye visible-light conditions. sion with small average pupil tracking errors, which are less than 2 pixels for the training mode and 5 pixels for the cross-person test under the near-eye visible-light conditions. 2. Background of Wearable Eye Tracker Design Figure2 depicts2. Background the general of two-stage Wearable operationalEye Tracker Design flow of the wearable gaze tracker with the four-point calibrationFigure 2 depicts scheme. the The general wearable two-stage gaze operational tracking device flow of can the usewearable the NIR gaze tracker or visible-light imagewith the sensor-based four-point calibration gaze tracking scheme. design, The we whicharable gaze can tracking apply appearance-, device can use the NIR model-, or learning-basedor visible-light technology. image sensor-based First, the gaze gaze tracking tracking device design, estimates which can and apply tracks appearance-, the pupil’s center tomodel-, prepare or learning-based the calibration technology. process. For First, the th NIRe gaze image tracking sensor-based device estimates design, and tracks for example, the first-stagethe pupil’s computations,center to prepare which the calibration involve process. the processes For thefor NIR pupil image location, sensor-based de- extraction of the pupil’ssign, for region example, of interest the first-stage (ROI), thecomputatio binarizationns, which process involve with the the processes two-stage for pupil lo- scheme in pupil’scation, ROI, andextraction the fast of the scheme pupil’s of region ellipse of fittinginterest with (ROI), binarized the binarization pupil’s process edge with the two-stage scheme in pupil’s ROI, and the fast scheme of ellipse fitting with binarized pu- points, are utilized [15]. Then, the pupil’s center can be detected effectively. For the visible- pil’s edge points, are utilized [15]. Then, the pupil’s center can be detected effectively. For light image sensor-basedthe visible-light design, image for sensor-based example, by design, using fo ther example, gradients-based by using the gradients-based center iris location technologycenter [21 ]location or the deep-learning-basedtechnology [21] or the methodologydeep-learning-based [37], themethod gazeology tracking [37], the gaze device realizes antracking effective device iris centerrealizes tracking an effective capability. iris center Next,tracking at capability. the second Next, processing at the second pro- stage, the functioncessing and process stage, the for function calibration and and process gaze for prediction calibration can and be gaze done. prediction To increase can be done. the accuracy of gazeTo tracking,increase the the accuracy head movement of gaze tracking, effect canthe head be compensated movement effect by thecan motion be compensated measurement deviceby the [39 ].motion In the measurement following section, device for[39]. the In proposedthe following wearable section, gazefor the tracking proposed wear- design, the first-stageable processinggaze tracking procedure design, the uses first-stage the deep-learning-based processing procedure pupil uses the tracking deep-learning- technology by usingbased a pupil visible-light tracking nearfieldtechnology eye by using image a visible-light sensor. Next, nearfield the second-stageeye image sensor. Next, process achieves thethe functionsecond-stage of calibration process achieves for the the gaze function tracking of calibration and prediction for thework. gaze tracking and prediction work.

Figure 2. The generalFigure processing 2. The general flow ofprocessing the wearable flow of gaze the wearable tracker. gaze tracker.

3. Proposed Methods3. Proposed Methods The proposed wearableThe proposed eye tracking wearable method eye tracking is divided method into is divided several into steps seve asral follows: steps as follows: (1) Collecting the visible-light(1) Collecting nearfieldthe visible-light eye images nearfield in eye datasets images and in datasets labeling and these labeling eye imagethese eye image datasets; (2) Choosingdatasets; and (2) designingChoosing and the designing deep-learning the deep-lea networkrning network architecture architecture for pupil for pupil object object detection; (3)detection; Training (3) andTraining testing and thetesting trained the traine inferenced inference model; model; (4) (4) Using Using the the deep- deep-learning learning model tomodel infer to and infer detect and detect the pupil’s the pupil’s object, object, and and estimate estimate the center center coordinate coordinate of pupil box; of pupil box; (5) Following(5) Following the the calibration calibration process process andand predictingpredicting the the gaze gaze points. points. Figure Figure 3 depicts the computational process of the proposed visible-light gaze tracking design based on the 3 depicts the computational process of the proposed visible-light gaze tracking design YOLO deep-learning model. In the proposed design, a visible-light camera module is used based on the YOLO deep-learning model. In the proposed design, a visible-light camera module is used to capture nearfield eye images. The design uses a deep-learning network Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 21

to capture nearfield eye images. The design uses a deep-learning network based on YOLO to train an inference model to detect the object of pupil, and then, the proposed system calculates the pupil center from the detected pupil box for the subsequent calibration and gaze tracking process. Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In the proposed wearable device, both the outward-view scene camera and the nearfield eye Appl.Appl. Sci. 2021 Sci.,2021 11, x, 11FOR, 851 PEER REVIEW camera are required to provide the whole function of the wearable gaze tracker.7 of The 217 of pro- 21 posed design uses deep-learning-based technology to detect and track the pupil zone, es- timate the pupil’s center by using the detected pupil object, and then use the coordinate information to perform and predict the gaze points. Compared with the previous image to capture nearfield eye images. The design uses a deep-learning network based on YOLO basedprocessing on YOLO methods to train of tracking an inference the pupil model at the to detect visible-light the object mode, of pupil,the proposed and then, method the to train an inference model to detect the object of pupil, and then, the proposed system proposedcan detect system the pupil calculates object the at pupilany eyeball center position, from the detectedand it also pupil is not box affected for the subsequentby light and calculates the pupil center from the detected pupil box for the subsequent calibration and calibrationshadow interferences and gaze tracking in real process.applications. gaze tracking process. Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In the proposed wearable device, both the outward-view scene camera and the nearfield eye camera are required to provide the whole function of the wearable gaze tracker. The pro- posed design uses deep-learning-based technology to detect and track the pupil zone, es- timate the pupil’s center by using the detected pupil object, and then use the coordinate information to perform and predict the gaze points. Compared with the previous image processing methods of tracking the pupil at the visible-light mode, the proposed method can detect the pupil object at any eyeball position, and it also is not affected by light and shadow interferences in real applications.

Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker. Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker. Figure4 demonstrates the scheme of the self-made wearable eye tracking device. In the proposed wearable device, both the outward-view scene camera and the nearfield eye camera are required to provide the whole function of the wearable gaze tracker. The proposed design uses deep-learning-based technology to detect and track the pupil zone, estimate the pupil’s center by using the detected pupil object, and then use the coordinate information to perform and predict the gaze points. Compared with the previous image processing methods of tracking the pupil at the visible-light mode, the proposed method can detect the pupil object at any eyeball position, and it also is not affected by light and shadow interferences in real applications. Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker.

Figure 4. The self-made visible-light wearable gaze tracking device.

3.1. Pre-Processing for Pupil Center Estimation and Tracking In the pre-processing work, the visible-light nearfield eye images are recorded for datasets, and the collected eye image datasets will be labeled for training and testing the trained inference model. Moreover, we choose several YOLO-based deep-learning models for pupil object detection. By using the YOLO-based deep-learning model to detect the pupil’s object, the proposed system can track and estimate the center coordinate of pupil box for calibration and gaze prediction. FigureFigure 4. The 4. self-madeThe self-made visible-light visible-light wear wearableable gaze gazetracking tracking device. device.

3.1. Pre-Processing3.1. Pre-Processing for Pupil for Pupil Center Center Estimation Estimation and Tracking and Tracking In theIn pre-processing the pre-processing work, work, the visible-light the visible-light nearfield nearfield eye eyeimages images are recorded are recorded for for datasets,datasets, and andthe collected the collected eye image eye image datasets datasets will willbe labeled be labeled for training for training and andtesting testing the the trainedtrained inference inference model. model. Moreover, Moreover, we choose we choose several several YOLO-based YOLO-based deep-learning deep-learning models models for pupilfor pupil object object detection. detection. By using By using the YOLO-based the YOLO-based deep-learning deep-learning model model to detect to detect the the pupil’s object, the proposed system can track and estimate the center coordinate of pupil pupil’s object, the proposed system can track and estimate the center coordinate of pupil box for calibration and gaze prediction. box for calibration and gaze prediction. Appl.Appl. Sci. Sci.2021 2021, 11, 11, 851, x FOR PEER REVIEW 8 of8 of 21 21

3.1.1.3.1.1. Generation Generation of of Training Training and and Validation Validation Datasets Datasets InIn order order to to collect collect nearfield nearfield eye eye images images from from eight eight different different users users for for training training and and testing,testing, each each useruser usedused aa self-made wearable wearable eye eye tracker tracker to to emulate emulate the the gaze gaze tracking tracking op- operations,erations, after after which which the theusers users rotated rotated their their gaze gazearound around the screen the screento capture to capture the nearfield the nearfieldeye images eye with images various with variouseyeball eyeballpositions. positions. The collected The collected and recorded and recorded near-eye near-eye images imagescover all cover possible all possible eyeball eyeball positions. positions. Figure Figure5 illustrates5 illustrates nearfield nearfield visible-light visible-light eye images eye imagescollected collected for training for training and in-person and in-person tests. In tests. Figure In Figure5, the 5numbers, the numbers of near-eye of near-eye images imagesfrom person from person1, person 1, 2, person person 2, 3, person person 3, 4, person person 4,5, personperson 6, 5, person person 7, 6, and person person 7, and8 are person849, 987, 8 are 734, 849, 826, 987, 977, 734, 896, 826, 886, 977, and 896, 939, 886, respectively. and 939, respectively. Moreover, the Moreover, near-eye the images near-eye cap- imagestured from captured person from 9 to person person 9 16 to are person used 16 fo arer the used cross-person for the cross-person test. Thus, the test. nearfield Thus, the eye nearfieldimages from eye images person from 1 to person person 1 8 to without person 8closed without eyes closed are selected, eyes are selected,of which of approxi- which approximatelymately 900 images 900 images per person per person are used are usedfor training for training and andtesting. testing. The Theimage image datasets datasets con- consistedsisted of ofa atotal total of of 7094 7094 eye eye patterns. patterns. Next, Next, these these selected eyeeye imagesimages willwill go go through through a a labelinglabeling process process before before training training the the inference inference model model for for pupil pupil object object detection detection and and tracking. track- Toing. increase To increase the number the number of images of images in the in training the training phase, phase, the data the augmentationdata augmentation process, pro- whichcess, randomlywhich randomly changes changes angle, saturation, angle, saturati exposure,on, exposure, and hue and of the hue selected of the imageselected dataset, image isdataset, enabled is by enabled using theby using framework the framework of YOLO. of After YOLO. the After 160,000 the iterations160,000 iterations for training for train- the YOLOv3-tiny-baseding the YOLOv3-tiny-based models, models, the total the number total numb of imageser of usedimages for used training for training will be upwill to be 10up million. to 10 million.

FigureFigure 5. 5.Collected Collected datasets datasets of of nearfield nearfield visible-light visible-light eye eye images. images.

BeforeBefore training training the the model, model, the the designer designer must must find find the the correct correct image image object object (or (or answer) answer) inin the the image image datasets, datasets, to to train train the the model model what what to to learn learn and and to to identify identify the the location location of of the the correctcorrect object object in in the the whole whole image, image, which which is is commonly commonly known known as as the the ground ground truth, truth, so so the the labelinglabeling process process must must be be done done before before the the training training process. process. The The position position of of the the pupil pupil center center variesvaries in in the the different different nearfield nearfield eye eye images. images. Figure Figure6 demonstrates6 demonstrates the the labeling labeling process process for for thethe location location of of the the pupil’s pupil’s ROIROI object.object. For the following learning learning process process of of the the detector detector of ofthe the pupil pupil zone, zone, the the size size of the of labeled the labeled box also box affects also affects the detection the detection accuracy accuracy of the network of the networkarchitecture. architecture. In our experiments, In our experiments, firstly, a firstly, labeled a labeled box with box the with 10 × the 10 10pixels× 10 size pixels is tested,size isand tested, then, and a labeled then, a box labeled with box a size with ranging a size rangingfrom 20 × from 20 pixels 20 × 20to 30 pixels × 30 topixels30 × is30 adopted. pixels isThe adopted. size of the The labeled size of bounding the labeled box bounding is closely related box is closelyto the number related of to features the number analyzed of featuresby the analyzeddeep-learning by the architecture. deep-learning In the architecture. labeling process In the labeling [40], in processorder to [40 improve], in order the to improve the accuracy of the ground truth, the ellipse fitting method was applied for Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 21 Appl. Sci. 2021, 11, 851 9 of 21

accuracy of the ground truth, the ellipse fitting method was applied for the pre-position- theing pre-positioning of the pupil’s center, of the pupil’safter which center, we after used which a manual we used method a manual to further method correct to further the co- correctordinates the of coordinates the pupil center of the pupilin each center selected in each nearfield selected eye nearfield image. Before eye image. the training Before pro- the trainingcess, all process,labeled images all labeled (i.e., images7094 eye (i.e., images) 7094 eyeare randomly images) are divided randomly into divideda training into part, a trainingverification part, part, verification and test part, part, and with test correspo part, withnding corresponding ratios of 8:1:1. ratios Therefore, of 8:1:1. the Therefore, number theof labeled number eye of images labeled used eye imagesfor training used is for 567 training6, and the is number 5676, and of thelabeled number images of labeledused for imagesverification used and for verification testing is 709. and testing is 709.

Figure 6. The labeling process for location of the pupil’s ROI. Figure 6. The labeling process for location of the pupil’s ROI.

3.1.2.3.1.2. TheThe UsedUsed YOLO-BasedYOLO-Based Models Models and and Training/Validation Training/Validation ProcessProcess TheThe proposedproposed designdesign uses uses the the YOLOv3-tiny YOLOv3-tiny deep-learning deep-learning model. model. TheThe reasonsreasons for for choosingchoosing a a network network model model based based on on YOLOv3-tiny YOLOv3-tiny are: are: (1) (1) The The YOLOv3-tiny-based YOLOv3-tiny-based model model achievesachieves good good detection detection and and recognitionrecognition performanceperformance atat thethe samesame time;time; (2)(2) TheThe YOLOv3-YOLOv3- tiny-basedtiny-based model model is is effective effective for for small small object object detection, detection, and and the the pupil pupil in in the the nearfield nearfield eye eye imageimage is is also also a a very very small small object. object. In In addition, addition, because because the the number number of of convolutional convolutional layers layers inin YOLOv3-tiny YOLOv3-tiny is is small, small, the the inference inference model model based based on YOLOv3-tinyon YOLOv3-tiny can achievecan achieve real-time real- operationstime operations on the on software-based the software-based embedded embedded platform. platform. InIn the the original original YOLOv3-tiny YOLOv3-tiny model, model, as as shown shown in Figurein Figure7, each 7, each layer layer uses uses three three anchor an- boxeschor boxes to predict to predict the bounding the bounding boxes in boxes a grid in cell. a grid However, cell. However, only one pupilonly one object pupil category object needscategory to be needs detected to be in detected each nearfield in each eye nearfield image, eye after image, which after the numberwhich the of number anchor boxes of an- requiredchor boxes can required be reduced. can Figurebe reduced.8 illustrates Figure the 8 networkillustrates architecture the network of thearchitecture YOLOv3-tiny of the modelYOLOv3-tiny with only model one anchor.with only Besides one anchor. that the Besi numberdes that of the required number anchor of required boxes cananchor be reduced,boxes can since be reduced, the image since features the image of the features detected of pupil the detected object is pupil not intricate, object is thenot numberintricate, ofthe convolutional number of convolutional layers can also layers be reduced. can also Figure be reduced.9 shows theFigure network 9 shows architecture the network of the ar- YOLOv3-tinychitecture of the model YOLOv3-tiny with one anchor model and with a one-wayone anchor path. and In a Figureone-way9, since path. the In numberFigure 9, ofsince convolutional the number layers of convolutional is reduced, layers the computational is reduced, the complexity computational can also complexity be reduced can effectively.also be reduced effectively. Before the labeled eye images are fed into the YOLOv3-tiny-based deep-learning models, the parameters used for training deep-learning networks must be set properly, e.g., batch, decay, subdivision, momentum, etc., after which the inference models for pupil center tracking will be achieved accurately. In our training design for models, the parameters of Width/height, Channels, Batch, Max_batches, Subdivision, Momentum, Decay, and Learning_rate are set to 416, 3, 64, 500200, 2, 0.9, 0.0005, and 0.001, respectively. In the architecture of the YOLOv3-tiny model, 13 convolution layers with differ- ent filter numbers and pooling layers with different strides are established to allow the YOLOv3-tiny model to have a high enough learning capability for training from images. In addition, the YOLOv3-tiny-based model not only uses the convolution neural network, but also adds upsampling and concrete calculation methods. During the learning process, the YOLO model uses training datasets for training, uses the images in the validation dataset to test the model, observes whether the model is indeed learning correctly, uses the weight back propagation method to gradually converge the weight, and observes whether the weight converges with the loss function. Appl. Sci. 2021, 11, 851 10 of 21 Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 21 Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 21

FigureFigure 7. 7.The The network network architecture architecture of of the th originale original YOLOv3-tiny YOLOv3-tiny model. model. Figure 7. The network architecture of the original YOLOv3-tiny model.

Figure 8. The network architecture of the YOLOv3-tiny model with one anchor. FigureFigure 8. 8.The The network network architecture architecture of of the the YOLOv3-tiny YOLOv3-tiny model model with with one one anchor. anchor. Appl.Appl. Sci. Sci.2021 2021, 11, 11, 851, x FOR PEER REVIEW 1111 of of 21 21

FigureFigure 9. 9.The The networknetwork architecturearchitecture ofof thethe YOLOv3-tinyYOLOv3-tiny modelmodel withwith one anchor and a one-way path.

FigureBefore 10 the depicts labeled the eye training images and are testing fed in processesto the YOLOv3-tiny-based by the YOLO-based deep-learning models for trackingmodels, pupils.the parameters In the detection used for and training classification deep-learning process networks by the YOLO must model,be set properly, the grid cell,e.g., anchor batch, decay, box, and subdivisio activationn, momentum, function are etc., used after to determine which the inference the position models of the for object pupil boundingcenter tracking box in will the be image. achieved Therefore, accurately. the well-trained In our training model design can for detect models, the position the param- of theeters pupil of Width/height, in the image with Channels, the bounding Batch, Ma box.x_batches, The detected Subdivision, pupil box Momentum, can be compared Decay, Appl. Sci. 2021, 11,with xand FOR theLearning_rate PEER ground REVIEW truth are to set generate to 416, 3, the 64, intersection 500200, 2, 0.9, over 0.0005, union and (IoU), 0.001, precision, respectively. and recall 12 of 21

values;In we the use architecture these values of the to judge YOLOv3-tiny the quality model, of the 13 model convolution training. layers with different filter numbers and pooling layers with different strides are established to allow the YOLOv3-tiny model to have a high enough learning capability for training from images. In addition, the YOLOv3-tiny-based model not only uses the convolution neural network, but also adds upsampling and concrete calculation methods. During the learning process, the YOLO model uses training datasets for training, uses the images in the validation da- taset to test the model, observes whether the model is indeed learning correctly, uses the weight back propagation method to gradually converge the weight, and observes whether the weight converges with the loss function. Figure 10 depicts the training and testing processes by the YOLO-based models for tracking pupils. In the detection and classification process by the YOLO model, the grid cell, anchor box, and activation function are used to determine the position of the object bounding box in the image. Therefore, the well-trained model can detect the position of the pupil in the image with the bounding box. The detected pupil box can be compared with the ground truth to generate the intersection over union (IoU), precision, and recall values; we use these values to judge the quality of the model training. FigureFigure 10. The10. The training training and and testing testing processes processes by by the the YOLO-based YOLO-based modelsmodels forfor tracking pupil.

Table2 presentsTable the confusion 2 presents matrix the confusio used forn matrix the performance used for the evaluation performance of classifica- evaluation of classi- tion. In patternfication. recognition In pattern for information recognition retrieval for information and classification, retrieval and recall classification, is mentioned recall is men- to as the true positivetioned to rate as the or sensitivity,true positive and rate precision or sensitivity, is mentioned and precision as the is positive mentioned predic- as the positive tive value. predictive value.

Table 2. Confusion matrix for classification.

Actual Values Positives Negatives

True Positives False Positives (TP) (FP) Positives Positives

False Negatives True Negatives

Predictive values Predictive (FN) (TN) Negatives Negatives

Both precision and recall are based on an understanding and measurement of rele- vance. Based on Table 1, the definition of precision is expressed as follows: Precision = TP/(TP + FP), (1) and recall is defined as follows: Recall = TP/(TP + FN). (2) After using the YOLO-based deep-learning model to detect the pupil’s ROI object, the midpoint coordinate of the pupil’s bounding box can be found, and the midpoint co- ordinate value is the estimated pupil‘s center point. The midpoint coordinates for the es- timated pupil’s centers are evaluated with Euclidean distance by comparing the estimated coordinates with the coordinates of ground truth, and the tracking errors of pupil’s center will also be obtained for the performance evaluation. Using the pupil‘s central coordinates for the coordinate conversion of calibration, the proposed wearable device can correctly estimate and predict the outward gaze points.

3.2. Calibration and Gaze Tracking Processes Before using the wearable eye tracker to predict the gaze points, the gaze tracking device must undergo a calibration procedure. Because the appearance of each user’s eye and Appl. Sci. 2021, 11, 851 12 of 21

Table 2. Confusion matrix for classification.

Actual Values Positives Negatives True Positives False Positives Positives Predictive values (TP) (FP) False Negatives True Negatives Negatives (FN) (TN)

Both precision and recall are based on an understanding and measurement of rele- vance. Based on Table1, the definition of precision is expressed as follows:

Precision = TP/(TP + FP), (1)

and recall is defined as follows:

Recall = TP/(TP + FN). (2)

After using the YOLO-based deep-learning model to detect the pupil’s ROI object, the midpoint coordinate of the pupil’s bounding box can be found, and the midpoint coordinate value is the estimated pupil‘s center point. The midpoint coordinates for the estimated pupil’s centers are evaluated with Euclidean distance by comparing the estimated coordinates with the coordinates of ground truth, and the tracking errors of pupil’s center will also be obtained for the performance evaluation. Using the pupil‘s central coordinates for the coordinate conversion of calibration, the proposed wearable device can correctly estimate and predict the outward gaze points.

3.2. Calibration and Gaze Tracking Processes

Appl. Sci. 2021, 11, x FORBefore PEER usingREVIEW the wearable eye tracker to predict the gaze points, the gaze tracking 13 of 21 device must undergo a calibration procedure. Because the appearance of each user’s eye and the application distance to the near-eye camera are different, the calibration must be done to confirm the coordinate mapping scheme between the pupil’s center and the the application distance to the near-eye camera are different, the calibration must be done outward scene, after which the wearable gaze tracker can correctly predict the gaze points to confirm the coordinate mapping scheme between the pupil’s center and the outward correspond to thescene, outward after which scene. the Figurewearable 11 gaze depicts tracker the can used correctly calibration predict the process gaze points of the correspond proposed wearableto the gaze outward tracker. scene. The Figure steps 11 of depicts the applied the used calibration calibration process process are of describedthe proposed weara- as follows: ble gaze tracker. The steps of the applied calibration process are described as follows:

Figure 11. TheFigure used 11. calibration The used calibration process of process the proposed of the prop wearableosed wearable gaze tracker. gaze tracker.

At the first step,At the the system first step, collects the system the correspondence collects the correspondence between the between center point the center coor- point co- dinates of the pupilsordinates captured of the bypupils the nearfieldcaptured eyeby the camera nearfield and eye the camera gaze points and the simulated gaze points simu- by the outwardlated camera. by the Before outward calibration, camera. Before in addition calibration, to recording in addition the to positions recordingof the the positions of center points ofthe the center pupil points movement, of the thepupil trajectory movement, of markersthe trajectory captured of markers by the captured outward by the out- camera must alsoward be camera recorded. must Figure also be 12 recorded. shows theFigure trajectory 12 shows diagram the trajectory of tracking diagram the of tracking pupil’s center withthethe pupil’s nearfield center eye with image the nearfield for calibration. eye imag Ine addition,for calibration. Figure In 12 addition, reveals theFigure 12 re- veals the corresponding trajectory diagram of tracking four markers’ center points with the outward image for calibration.

Figure 12. Diagram of tracking pupil’s center with the nearfield eye camera for calibration.

In Figure 12, the yellow line indicates the sequence of the movement path of gazing the markers, and the red rectangle indicates the range of the eye movement. In Figure 13, when the calibration process is executed, the user gazes at the four markers sequentially from ID 1, ID 2, ID 3, to ID 4, and the system records the coordinates of the pupil movement path while watching the gazing coordinates of the outward scene with markers. Then, the pro- posed system performs the coordinate conversion to achieve the calibration effect. Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 21

the application distance to the near-eye camera are different, the calibration must be done to confirm the coordinate mapping scheme between the pupil’s center and the outward scene, after which the wearable gaze tracker can correctly predict the gaze points correspond to the outward scene. Figure 11 depicts the used calibration process of the proposed weara- ble gaze tracker. The steps of the applied calibration process are described as follows:

Figure 11. The used calibration process of the proposed wearable gaze tracker.

At the first step, the system collects the correspondence between the center point co- ordinates of the pupils captured by the nearfield eye camera and the gaze points simu- Appl. Sci. 2021, 11, 851 lated by the outward camera. Before calibration, in addition to recording the positions13 of 21 of the center points of the pupil movement, the trajectory of markers captured by the out- ward camera must also be recorded. Figure 12 shows the trajectory diagram of tracking the pupil’s center with the nearfield eye image for calibration. In addition, Figure 12 re- correspondingveals the corresponding trajectory diagram trajectory of trackingdiagram fourof tracking markers’ four center markers’ points center with the points outward with imagethe outward for calibration. image for calibration.

Figure 12. Diagram of tracking pupil’s center with the nearfield eye camera for calibration. Figure 12. Diagram of tracking pupil’s center with the nearfield eye camera for calibration.

InIn Figure Figure 12 12,, thethe yellowyellow line indicates indicates the the sequ sequenceence of of the the movement movement path path of ofgazing gazing the themarkers, markers, and and the thered redrectangle rectangle indicates indicates the ra thenge range of the of eye the movement. eye movement. In Figure In Figure 13, when 13, whenthe calibration the calibration process process is executed, is executed, the user the gazes user at gazes the four at the markers four markers sequentially sequentially from ID from1, ID ID 2, 1,ID ID 3, 2, to ID ID 3, 4, to and ID 4,the and system the system records records the coordinates the coordinates of the of pupil the pupil movement movement path Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 21 pathwhile while watching watching the gazing the gazing coordinates coordinates of th ofe outward the outward scene scene with with markers. markers. Then, Then, the thepro- proposedposed system system performs performs the the coordinate coordinate conversion conversion to achieve to achieve the thecalibration calibration effect. effect.

FigureFigure 13.13.Diagram Diagram ofof trackingtracking fourfour markers’markers’ center center points points with with the the outward outward camera camera for for calibration calibra- tion1 When ① When the user the gazes user atgazes the markerat the marker ID1 for ID1 calibration. for calibration. 2 When ② theWhen user the gazes user at gazes the marker at the ID2 formarker calibration. ID2 for 3calibration.When the user③ When gazes the at theuser marker gazes ID3at the for marker calibration. ID3 for 4 calibration.When the user ④ When gazes at thethe marker user gazes ID4 at for the calibration. marker ID4 for calibration.

TheThe pseudo-affinepseudo-affine transform is is used used for for the the coordinate coordinate conversion conversion between between the thepupil pupil and andthe outward the outward scene. scene. The equation The equation of the pseudo of the-affine pseudo-affine conversion conversion is expressed is as expressed follows: as follows:   𝑋 Xp    𝑎 𝑎 𝑎 𝑎  X 𝑋 a1 a2 a3 a4  𝑌 Yp  = = 𝑎 𝑎 𝑎 𝑎  , , (3)(3) Y 𝑌 a5 a6 a7 a8 𝑋X𝑌pYp  1 1 wherewhere ( X(X, Y, Y) is) is the the outward outward coordinate coordinate of theof the markers’ markers’ center center point point coordinate coordinate system, system, and  andXp, Y𝑋p is,𝑌 the is nearfield the nearfield eye coordinates eye coordinates of the pupils’of the pupils’ center center point coordinate point coordinate system. system. In (3), theIn (3), unknown the unknown transformation transformation coefficients coefficients of a1~a8 need of 𝑎 to~𝑎 be solved need to by be the solved coordinate by the values coor-  computeddinate values and computed estimated fromand estimated (X, Y) and fromXp ,(YXp, Y.) In and order 𝑋 to,𝑌 carry. In outorder the to calculation carry out the of thecalculation transformation of the parameterstransformation of a 1parameters~a8 for calibration, of 𝑎~𝑎 Equation for calibration, (4) describes Equation the pseudo- (4) de- affinescribes transformation the pseudo-affine with four-pointstransformation calibration, with four-points which is used calibration, to solve the which transformation is used to solve the transformation coefficients of the coordinate conversion. The equation of the pseudo-affine conversion with four-point calibration is expressed as follows:

𝑋 𝑌 𝑋𝑌 10 0 0 0 𝑋 ⎡ ⎤ 𝑎 ⎡ ⎤ 00 0 0𝑋 𝑌 𝑋𝑌 1 ⎡ ⎤ 𝑌 ⎢ ⎥ 𝑎 ⎢ ⎥ 𝑋 ⎢𝑋 𝑌 𝑋𝑌 10 0 0 0⎥ ⎢𝑎 ⎥ ⎢ ⎥ ⎢ 00 0 0𝑋𝑌 𝑋 𝑌 1⎥ ⎢𝑎 ⎥ ⎢𝑌 ⎥ ⎢ ⎥ = ⎢ ⎥ ∙ , (4) ⎢𝑋⎥ 𝑋 𝑌 𝑋𝑌 10 0 0 0⎢𝑎⎥ ⎢ ⎥ ⎢𝑌 ⎥ 00 0 0𝑋𝑌 𝑋 𝑌 1 ⎢𝑎⎥ ⎢ ⎥ ⎢𝑋 ⎥ ⎢𝑎⎥ ⎢𝑋 𝑌 𝑋𝑌 10 0 0 0⎥ ⎣𝑌 ⎦ ⎣𝑎⎦ ⎣ 00 0 0𝑋 𝑌 𝑋𝑌 1⎦

where 𝑋,𝑌, 𝑋,𝑌, 𝑋,𝑌, and 𝑋,𝑌 are the outward coordinates of the four marker‘s center points and 𝑋,𝑌, 𝑋,𝑦, 𝑋,𝑦, and 𝑋,𝑦 are the near- field eye coordinates of the four pupil’s center points, respectively.

4. Experimental Results and Comparisons In the experimental environment, the general illuminance values measured by the spectrometer are 203.68lx. To verify the proposed pupil tracking design under a strong light environment, we use a desk lamp to create the effect, and the measurement values Appl. Sci. 2021, 11, 851 14 of 21

coefficients of the coordinate conversion. The equation of the pseudo-affine conversion with four-point calibration is expressed as follows:

      X1 Xp1 Yp1 Xp1Yp1 1 0 0 0 0 a1 Y  0 0 0 0 X Y X Y 1  a  1   p1 p1 p1 p1   2         X2   Xp2 Yp2 Xp2Yp2 1 0 0 0 0   a3         Y2   0 0 0 0 Xp2 Yp2 Xp2Yp2 1   a4    =  · , (4)  X3   Xp3 Yp3 Xp3Yp3 1 0 0 0 0   a5         Y3   0 0 0 0 Xp4 Yp4 Xp4Yp4 1   a6         X4   Xp4 Yp4 Xp4Yp4 1 0 0 0 0   a7  Y4 0 0 0 0 Xp4 Yp4 Xp4Yp4 1 a8

where (X1, Y1), (X2, Y2), (X3, Y3), and (X4, Y4) are the outward coordinates of the four     marker‘s center points and Xp1, Yp1 , Xp2, yp2 , Xp3, yp3 , and Xp4, yp4 are the nearfield eye coordinates of the four pupil’s center points, respectively.

4. Experimental Results and Comparisons Appl. Sci. 2021, 11, x FOR PEER REVIEW 15 of 21 In the experimental environment, the general illuminance values measured by the spectrometer are 203.68lx. To verify the proposed pupil tracking design under a strong light environment, we use a desk lamp to create the effect, and the measurement values of illuminancesof illuminances are 480.22lx.are 480.22lx. On the On contrary, the contrary for verifying, for verifying the proposed the proposed pupil tracking pupil tracking design underdesign a under darker a environment, darker environment, the illuminance the illuminance values of values the experimental of the experimental environment environ- are setment to 86.36lx.are set Into addition86.36lx. toIn achievingaddition real-timeto achieving operation real-time effects, operat theion used effects, YOLOv3-tiny- the used basedYOLOv3-tiny-based deep-learning models deep-learning can solve models the problem can solve of lightthe problem and shadow of light interferences and shadow that in- cannotterferences be solved that cannot effectively be solved by the effectively traditional by pupil the trackingtraditional methods. pupil tracking In our experiments, methods. In aour computing experiments, server a withcomputing two NVIDIA server with GPU two accelerator NVIDIA cards GPU was accele usedrator to traincardsthe was pupil used trackingto train models,the pupil and tracking a personal models, computer and a withpersonal a 3.4 computer GHz operating with frequencya 3.4 GHz CPUoperating and anfrequency NVIDIA CPU GPU and card an was NVIDIA usedto GPU test card the YOLO-based was used to inferencetest the YOLO-based models. The inference visible- lightmodels. wearable The visible-light gaze tracker wearable utilized gaze two cameratracker utilized modules two for camera shooting, modules one of for which shooting, is an outwardone of which scene is camera an outward module scene with camera a resolution module of with 1600 a× resolution1200 pixels, of and1600 the× 1200 other pixels, is a visible-lightand the other nearfield is a visible-light eye camera nearfield module, eye which camera provides module, a resolution which provides of 1280 × a 720resolution pixels andof 1280 can reach× 720 pixels 30 frames and percan second.reach 30 frames per second. ThroughThrough thethe proposedproposed YOLOv3-tiny-basedYOLOv3-tiny-based deep-learning methods methods to to track track the the posi- po- sitiontion of of the the pupil pupil center, center, the the size size of the of the pupil’s pupil’s detection detection box boxis between is between 20 × 20 20 and× 20 30 and × 30 30pixels;× 30 pixelsthe pupil; the center pupil centerat any at position any position of the of eyeball the eyeball can be can correctly be correctly inferred, inferred, and and the theinference inference results results are areshown shown in Figure in Figure 14. Table 14. Table 3 lists3 liststhe performance the performance comparison comparison of vis- ofible-light visible-light pupil pupil object object detection detection among among the three the YOLO-based three YOLO-based models. models. In Table In3, the Table three3, theYOLO-based three YOLO-based deep-learning-based deep-learning-based models perform models performthe pupil the detection pupil detection precision, precision, which is whichup to is80%, up and to 80%, the andrecall the rate recall is up rate to is83%. up to 83%.

Figure 14. The inference results by using the YOLOv3-tiny-based model. Figure 14. The inference results by using the YOLOv3-tiny-based model. For the tested eye images, to evaluate the accuracy of the trained YOLOv3-tiny-based models, Table 4 lists the comparison results of visible-light pupils’ center tracking errors between the YOLOv3-tiny and the previous designs without using deep-learning technol- ogies. In Table 4, the applied YOLOv3-tiny-based design performs smaller pupil tracking errors than the previous designs in [15,21]. The average pupil tracking error of the pro- posed deep-learning-based model is less than 2 pixels, which is much smaller than the pupil tracking error of the previous ellipse fitting design in [15]. Compared with the pre- vious design in [15,21], without using deep-learning technology, the proposed deep-learn- ing-based design has very small pupil tracking errors under various visible-light condi- tions.

Table 3. Performance comparisons of visible-light pupil object detection among the three YOLO- based models.

Validation Test Model Precision Recall Precision Recall YOLOv3-tiny (original) 81.40% 83% 78.43% 81% YOLOv3-tiny (one anchor) 80.19% 84% 80.50% 83% YOLOv3-tiny (one anchor/one way) 79.79% 82% 76.82% 81%

Appl. Sci. 2021, 11, 851 15 of 21

Table 3. Performance comparisons of visible-light pupil object detection among the three YOLO- based models.

Validation Test Model Precision Recall Precision Recall YOLOv3-tiny (original) 81.40% 83% 78.43% 81% YOLOv3-tiny (one anchor) 80.19% 84% 80.50% 83% YOLOv3-tiny (one anchor/one way) 79.79% 82% 76.82% 81%

For the tested eye images, to evaluate the accuracy of the trained YOLOv3-tiny- based models, Table4 lists the comparison results of visible-light pupils’ center tracking errors between the YOLOv3-tiny and the previous designs without using deep-learning technologies. In Table4, the applied YOLOv3-tiny-based design performs smaller pupil tracking errors than the previous designs in [15,21]. The average pupil tracking error of the proposed deep-learning-based model is less than 2 pixels, which is much smaller than the pupil tracking error of the previous ellipse fitting design in [15]. Compared with the previous design in [15,21], without using deep-learning technology, the proposed deep- learning-based design has very small pupil tracking errors under various visible-light conditions.

Table 4. Comparisons of the visible-light pupils’ center tracking errors between the YOLOv3-tiny and the previous ellipse fitting based designs.

Ellipse Fitting Based Gradients-Based Design YOLOv3-Tiny Methods Design [15] [21] (Original) Average tracking errors Larger than 50 5.99 1.62 (pixels) Standard deviation 63.1 N/A 2.03 (pixels) Max/min errors 294/0 40/0 21/0 (pixels)

To verify the pupil tracking errors of the different YOLOv3-tiny-based deep-learning models, Table5 lists the comparison results of the visible-light pupils’ center tracking errors among the three YOLO-based models, where the datasets of person 1 to person 8 are used for the training mode. In Table5, the average pupil tracking errors of the three YOLOv3- tiny-based deep-learning-based models are less than 2 pixels. Table6 lists the comparison results of the visible-light pupils’ center tracking errors among the three YOLO-based models for the cross-person testing mode. In Table6, the average pupil tracking errors of the three YOLOv3-tiny-based deep-learning-based models are less than 5 pixels.

Table 5. Comparisons of the visible-light pupils’ center tracking errors among the three YOLO-based models by using the datasets of person 1 to person 8 for the training mode.

Mean Errors Variance Model (Pixels) (Pixels) YOLOv3-tiny (original) 1.62 2.03 YOLOv3-tiny (one anchor) 1.58 1.87 YOLOv3-tiny (one 1.46 1.75 anchor/one way) Appl. Sci. 2021, 11, 851 16 of 21

Table 6. Comparisons of the visible-light pupils’ center tracking errors among the three YOLO-based models by using the datasets of person 9 to person 16 for the cross-person testing mode.

Mean Errors Variance Model (Pixels) (Pixels) YOLOv3-tiny (original) 4.11 2.31 YOLOv3-tiny (one anchor) 3.99 2.27 YOLOv3-tiny (one 4.93 2.31 anchor/one way)

Table7 describes the comparison of computational complexities among the three YOLOv3-tiny-based models. In Table7, “BFLOPS” means billion float operations per second. Using a personal computer with an Nvidia GeForce RTX 2060 card, the processing frame rate of pupil tracking with YOLOv3-tiny-based models is up to 60 FPS. Using a personal computer with an Nvidia GeForce RTX 2080 card, the processing frame rate of pupil tracking with YOLOv3-tiny-based models is up to 150 FPS. Moreover, when using the NVIDIA XAVIER embedded platform [41], the processing frame rate of pupil tracking with YOLOv3-tiny-based models is up to 88 FPS.

Table 7. Comparisons of computational complexities among the three YOLO-based models.

Model BFLOPS YOLOv3-tiny (original) 5.448 YOLOv3-tiny (one anchor) 5.441 Appl. Sci. 2021, 11, x FOR PEER REVIEW YOLOv3-tiny (one anchor/one way) 5.042 17 of 21

Figure 15 illustrates the calculation of the gaze tracking errors, and the expression of the calculationFigure 15 method illustrates is described the calculation in Equation of the (5)gaze as tracking follows: errors, and the expression of the calculation method is described in Equation (5) as follows: −1 E θ = tan 𝐸 (5) θ =tanD (5) 𝐷

FigureFigure 15. 15.Diagram Diagram of of estimating estimating gaze gaze tracking tracking errors. errors.

InIn Equation Equation (5), (5), the the angle angleθ θreflects reflects the the gaze gaze tracking tracking errors errors (i.e., (i.e., degrees), degrees), E E is is the the absoluteabsolute error error (by (by pixels) pixels) at at the the gaze gaze plane, plane, and and D D is is the the application application distance distance between between the the useruser and and the the gaze gaze plane, plane, where where D D is is set set to to 55 55 cm. cm. For For the the evaluation evaluation of of the the gaze gaze tracking tracking errors,errors, Figure Figure 16 16aa depicts depicts four four target target markers marker useds used for for calibration calibration for for the the training training mode, mode, andand Figure Figure 16 16bb presents presents fivefive targettarget pointspoints used for for estimation estimation for for the the testing testing mode. mode. For For the thetraining training mode, mode, using using the four the four target target markers markers shown shown in Figure in Figure 16a, Table 16a, 8 Table lists 8the lists compar- the ison of gaze tracking errors among the three applied YOLOv3-tiny-based models. After be- ing joined with the calibration process, the average gaze tracking errors by the proposed YOLOv3-tiny-based pupil tracking models are less than 2.9 degrees for the training mode. Table 9 lists the comparison of the gaze tracking errors among the three YOLOv3-tiny-based models for the testing mode using the five target points shown in Figure 16b. In Table 9, the average gaze tracking errors by the proposed YOLOv3-tiny-based pupil tracking models are less than 3.5 degrees.

(a) (b) Figure 16. (a) Four target markers used for calibration for the training mode. (b) Five target points used for estimation for the testing mode.

Table 10 revealed the performance comparison among the different deep-learning- based gaze tracking designs. In general, the gaze tracking errors by the wearable eye tracker will be less than those by the non-wearable eye trackers. For the wearable eye tracker designs, since the infrared-like oculography images have a higher contrast and less light, noise, and shadow interferences than the near-eye visible-light images, the method in [28] performs smaller pupil tracking and gaze estimation errors in comparison with our proposed design. However, when the comparison is focused on the setup with Appl. Sci. 2021, 11, x FOR PEER REVIEW 17 of 21

Figure 15 illustrates the calculation of the gaze tracking errors, and the expression of the calculation method is described in Equation (5) as follows: 𝐸 θ =tan (5) 𝐷

Figure 15. Diagram of estimating gaze tracking errors.

In Equation (5), the angle θ reflects the gaze tracking errors (i.e., degrees), E is the absolute error (by pixels) at the gaze plane, and D is the application distance between the Appl. Sci. 2021, 11, 851 user and the gaze plane, where D is set to 55 cm. For the evaluation of the gaze tracking17 of 21 errors, Figure 16a depicts four target markers used for calibration for the training mode, and Figure 16b presents five target points used for estimation for the testing mode. For the training mode, using the four target markers shown in Figure 16a, Table 8 lists the compar- isoncomparison of gaze tracking of gaze trackingerrors among errors the among three theapplied three YOLOv3-tiny-based applied YOLOv3-tiny-based models. After models. be- ingAfter joined being with joined the withcalibration the calibration process, the process, average the gaze average tracking gaze errors tracking by the errors proposed by the YOLOv3-tiny-basedproposed YOLOv3-tiny-based pupil tracking pupil models tracking are less models than 2.9 are degrees less than for 2.9 the degrees training formode. the Tabletraining 9 lists mode. the comparison Table9 lists of the the comparison gaze tracking of errors the gaze among tracking the three errors YOLOv3-tiny-based among the three YOLOv3-tiny-based models for the testing mode using the five target points shown in models for the testing mode using the five target points shown in Figure 16b. In Table 9, the Figure 16b. In Table9, the average gaze tracking errors by the proposed YOLOv3-tiny-based average gaze tracking errors by the proposed YOLOv3-tiny-based pupil tracking models pupil tracking models are less than 3.5 degrees. are less than 3.5 degrees.

(a) (b)

Figure 16. ((a)) Four Four target target markers markers used for ca calibrationlibration for the training mode. ( (b)) Five Five target target points points used used for for estimation estimation for for the testing mode. Table 8. Comparisons of gaze tracking errors among the three YOLOv3-tiny-based models for the Table 10 revealed the performance comparison among the different deep-learning- training mode. based gaze tracking designs. In general, the gaze tracking errors by the wearable eye tracker will be less than those byMean the non- Gazewearable Tracking eye Errors, trackers.θ For theVariance wearable eye Model tracker designs, since the infrared-like oculography(Degrees) images have a higher(Degrees) contrast and less YOLOv3-tinylight, noise, (original)and shadow interferences than 2.131 the near-eye visible-light 0.249 images, the method in [28] performs smaller pupil tracking and gaze estimation errors in comparison YOLOv3-tiny (one anchor) 2.22 0.227 with our proposed design. However, when the comparison is focused on the setup with YOLOv3-tiny (one 2.959 0.384 anchor/one way)

Table 9. Comparisons of gaze tracking errors among the three YOLOv3-tiny-based models for the testing mode.

Mean Gaze trAcking Errors, θ Variance Model (Degrees) (Degrees) YOLOv3-tiny (original) 2.434 0.468 YOLOv3-tiny (one anchor) 2.57 0.268 YOLOv3-tiny (one 3.523 0.359 anchor/one way)

Table 10 revealed the performance comparison among the different deep-learning- based gaze tracking designs. In general, the gaze tracking errors by the wearable eye tracker will be less than those by the non-wearable eye trackers. For the wearable eye tracker designs, since the infrared-like oculography images have a higher contrast and less light, noise, and shadow interferences than the near-eye visible-light images, the method in [28] performs smaller pupil tracking and gaze estimation errors in comparison with our proposed design. However, when the comparison is focused on the setup with near-eye visible-light images, the proposed YOLOv3-tiny-based pupil tracking methods can provide the best, state-of-the-art pupil tracking accuracy. Appl. Sci. 2021, 11, 851 18 of 21

Table 10. Performance comparison of the deep-learning-based gaze tracking designs.

Deep-Learning-Based Operational Mode/Setup Pupil Tracking Errors Gaze Estimation Errors Methods Visible-light mode Sun et al. [22] N/A Mean errors are less than 7.75 degrees /Non-Wearable Visible-light mode Lemley et al. [24] N/A Less than 3.64 degrees /Non-Wearable Visible-light mode Cha et al. [26] N/A Accuracy 92.4% by 9 gaze zones /Non-Wearable Mean errors are 10.8 degrees for Visible-light mode Zhang et al. [29] N/A cross-dataset /Non-Wearable /Less than 5.5 degrees for cross-person Visible-light mode Lian et al. [32] N/A Less than 5 degrees /Non-Wearable Visible-light mode Less than 2 degrees at training mode Li et al. [34] N/A /Non-Wearable /Less than 5 degrees at testing mode Visible-light mode Rakhmatulin et al. [35] N/A Less than 2 degrees /Non-Wearable Infrared mode Brousseau et al. [36] Less than 6 pixels Gaze-estimation bias is 0.72 degrees /Non-Wearable Near-eye infrared Yiu et al. [28] Median accuracy is 1.0 pixel Less than 0.5 degrees mode/Wearable Less than 2.9 degrees for the Near-eye visible-light Less than 5 pixels for the training mode Proposed design mode/Wearable cross-person testing mode /Less than 3.5 degrees for the testing mode

Figure 17 demonstrates the implementation result of the proposed gaze tracking design on the Nvidia Xavier embedded platform [41]. Using the embedded software implementation with Xavier, the proposed YOLOv3-tiny-based visible-light wearable gaze Appl. Sci. 2021, 11, x FOR PEER REVIEWtracking design operates up to 20 frames per second to be suitable for practical19 of consumer21 applications.

Figure 17. Implementation of the proposed gaze tracking design on Nvidia Xavier embedded platform. Figure 17. Implementation of the proposed gaze tracking design on Nvidia Xavier embedded platform.5. Conclusions and Future Work For the application of visible-light wearable gaze trackers, YOLOv3-tiny-based deep- learning pupil tracking methodology is developed in this paper. By applying YOLO- based object detection technology, the proposed pupil tracking method effectively esti- mates and predicts the center of the pupil in a visible-light mode. By using the developed YOLO-based model to test the pupil tracking performance, the accuracy is up to 80%, and the recall rate is close to 83%. In addition, the average visible-light pupil tracking errors of the proposed YOLO based deep-learning design are smaller than 2 pixels for the train- ing mode and 5 pixels for the cross-person testing mode, which are much smaller than those of the previous designs in [15,21], without using deep-learning technology under the same visible-light conditions. After the calibration process, the average gaze tracking errors using the proposed YOLOv3-tiny-based pupil tracking models are smaller than 2.9 and 3.5 degrees for the training and testing modes, respectively. The proposed visible- light wearable gaze tracking design performs up to 20 frames per second on the Nvidia Xavier embedded platform. In this design, the application distance between the user’s head and the screen is fixed during the operation, and the direction of the head pose tries to keep fixed. In future works, a head movement compensation function will be added, and the proposed wearable gaze tracker will be more convenient and friendly for practical uses. To raise the high-precision recognition ability of the pupil location and tracking, the deep-learning model will be up- dated with online training using the big image databases in the cloud to fit the pupil po- sition for different eye colors and eye textures.

Author Contributions: Conceptualization, W.-L.O. and T.-L.K.; methodology, T.-L.K. and C.-C.C.; software and validation, W.-L.O. and T.-L.K.; writing—original draft preparation, T.-L.K. and C.- P.F.; writing—review and editing, C.-P.F.; supervision, C.-P.F. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Acknowledgments: This work was supported by the Ministry of Science and Technology, Taiwan, R.O.C. under Grant MOST 109-2218-E-005-008. Informed Consent Statement: Not applicable. Data Availability Statement: The data presented in this study are available on request from the corresponding author. Appl. Sci. 2021, 11, 851 19 of 21

5. Conclusions and Future Work For the application of visible-light wearable gaze trackers, YOLOv3-tiny-based deep- learning pupil tracking methodology is developed in this paper. By applying YOLO-based object detection technology, the proposed pupil tracking method effectively estimates and predicts the center of the pupil in a visible-light mode. By using the developed YOLO- based model to test the pupil tracking performance, the accuracy is up to 80%, and the recall rate is close to 83%. In addition, the average visible-light pupil tracking errors of the proposed YOLO based deep-learning design are smaller than 2 pixels for the training mode and 5 pixels for the cross-person testing mode, which are much smaller than those of the previous designs in [15,21], without using deep-learning technology under the same visible-light conditions. After the calibration process, the average gaze tracking errors using the proposed YOLOv3-tiny-based pupil tracking models are smaller than 2.9 and 3.5 degrees for the training and testing modes, respectively. The proposed visible-light wearable gaze tracking design performs up to 20 frames per second on the Nvidia Xavier embedded platform. In this design, the application distance between the user’s head and the screen is fixed during the operation, and the direction of the head pose tries to keep fixed. In future works, a head movement compensation function will be added, and the proposed wearable gaze tracker will be more convenient and friendly for practical uses. To raise the high-precision recognition ability of the pupil location and tracking, the deep-learning model will be updated with online training using the big image databases in the cloud to fit the pupil position for different eye colors and eye textures.

Author Contributions: Conceptualization, W.-L.O. and T.-L.K.; methodology, T.-L.K. and C.-C.C.; software and validation, W.-L.O. and T.-L.K.; writing—original draft preparation, T.-L.K. and C.-P.F.; writing—review and editing, C.-P.F.; supervision, C.-P.F. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the Ministry of Science and Technology, Taiwan, R.O.C. under Grant MOST 109-2218-E-005-008. Informed Consent Statement: Not applicable. Data Availability Statement: The data presented in this study are available on request from the corresponding author. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Chennamma, H.R.; Yuan, X. A Survey on Eye-Gaze Tracking Techniques. Indian J. Comput. Sci. Eng. (IJCSE) 2013, 4, 388–393. 2. Li, D.; Babcock, J.; Parkhurst, D.J. OpenEyes: A Low-Cost Head-Mounted Eye-Tracking Solution. In Proceedings of the 2006 ACM Eye Tracking Research and Applications Symposium, San Diego, CA, USA, 27–29 March 2006; pp. 95–100. 3. Fan, C.P. Design and Implementation of Wearable Gaze Tracking Device with Near-Infrared and Visible-Light Image Sensors. In Advances in Modern Sensors-Physics, Design, and Application; Sinha, G.R., Ed.; IOP Publishing: Bristol, UK, 2020; pp. 12.1–12.14. 4. Katona, J.; Kovari, A.; Costescu, C.; Rosan, A.; Hathazi, A.; Heldal, I.; Helgesen, C.; Thill, S. The Examination Task of Source-code Debugging Using GP3 Eye Tracker. In Proceedings of the 10th International Conference on Cognitive Infocommunications (CogInfoCom), Naples, Italy, 23–25 October 2019; pp. 329–334. 5. Katona, J.; Kovari, A.; Heldal, I.; Costescu, C.; Rosan, A.; Demeter, R.; Thill, S.; Stefanut, T. Using Eye-Tracking to Examine Query Syntax and Method Syntax Comprehension in LINQ. In Proceedings of the 11th International Conference on Cognitive Infocommunications (CogInfoCom), Mariehamn, Finland, 23–25 September 2020; pp. 437–444. 6. K˝ovári, A.; Katona, J.; Costescu, C. Evaluation of Eye-Movement Metrics in a Software Debugging Task using GP3 Eye Tracker. Acta Polytech. Hung. 2020, 17, 57–76. [CrossRef] 7. Swirski,´ L.; Bulling, A.; Dodgson, N. Robust Real-Time Pupil Tracking in Highly Off-Axis Images. In Proceedings of the ACM Proceedings of the Symposium on Eye Tracking Research and Applications, Santa Barbara, CA, USA, 28–30 March 2012; pp. 173–176. 8. Chen, Y.; Su, J. Fast Eye Localization Based on a New Haar-like Feature. In Proceedings of the 10th World Congress on Intelligent Control and Automation (WCICA), Beijing, China, 6–8 July 2012; pp. 4825–4830. Appl. Sci. 2021, 11, 851 20 of 21

9. Ohno, T.; Mukawa, N.; Yoshikawa, A. FreeGaze: A Gaze Tracking System for Everyday Gaze Interaction. In Proceedings of the Eye Tracking Research Applications Symposium, New Orleans, LA, USA, 25–27 March 2002; pp. 125–132. 10. Morimoto, C.H.; Mimica, M.R.M. Eye Gaze Tracking Techniques for Interactive applications. Comput. Vis. Image Underst. 2005, 98, 4–24. [CrossRef] 11. Li, D.; Winfield, D.; Parkhurst, D.J. Starburst: A Hybrid Algorithm for Video-Based Eye Tracking Combining Feature-Based and Model-Based Approaches. In Proceedings of the IEEE Computer Vision and Pattern Recognition–Workshops, San Diego, CA, USA, 21–23 September 2005; pp. 1–8. 12. Sugita, T.; Suzuki, S.; Kolodko, J.; Igarashi, H. Development of Head-Mounted Eye Tracking System achieving Environmental Recognition Ability. In Proceedings of the SICE Annual Conference, Takamatsu, Japan, 17–20 September 2007; pp. 1887–1891. 13. Chen, S.; Julien, J.E. Efficient and Robust Pupil Size and Blink Estimation from Near-Field Video Sequences for Human-Machine Interaction. IEEE Trans. Cybern. 2014, 44, 2356–2367. [CrossRef][PubMed] 14. Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-Based Gaze Estimation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–10. 15. Cheng, C.W.; Ou, W.L.; Fan, C.P. Fast Ellipse Fitting Based Pupil Tracking Design for Human-Computer Interaction Applications. In Proceedings of the IEEE International Conference on Consumer Electronics, Las Vegas, NV, USA, 7–11 January 2016; pp. 445– 446. 16. Wu, J.H.; Ou, W.L.; Fan, C.P. NIR-based Gaze Tracking with Fast Pupil Ellipse Fitting for Real-Time Wearable Eye Trackers. In Proceedings of the IEEE Conference on Dependable and Secure Computing, Taipei, Taiwan, 7–10 August 2017; pp. 93–97. 17. Pires, B.R.; Hwangbo, M.; Devyver, M.; Kanade, T. Visible-Spectrum Gaze Tracking for Sports. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 1005–1010. 18. Kao, W.C.; Chang, W.T.; Wu, S.J.; Liu, C.H.; Yin, S.Y. High Speed Gaze Tracking with Visible Light. In Proceedings of the International Conference on System Science and Engineering (ICSSE), Budapest, Hungary, 4–6 July 2013; pp. 285–288. 19. Wu, J.H.; Ou, W.L.; Fan, C.P. Fast Iris Ellipse Fitting Based Gaze Tracking with Visible Light for Real-Time Wearable Eye Trackers. In Proceedings of the 5th IEEE Global Conference on Consumer Electronics, Kyoto, Japan, 11–14 October 2016. 20. Liu, T.L.; Fan, C.P. Visible-Light Based Gaze Tracking with Image Enhancement Pre-processing for Wearable Eye Trackers. In Proceedings of the 6th IEEE Global Conference on Consumer Electronics, Nagoya, Japan, 24–27 October 2017. 21. Liu, T.L.; Fan, C.P. Visible-Light Wearable Eye Gaze Tracking by Gradients-Based Eye Center Location and Head Movement Compensation with IMU. In Proceedings of the International Conference on Consumer Electronics, Las Vegas, NV, USA, 12–14 January 2018; pp. 1–2. 22. Sun, H.P.; Yang, C.H.; Lai, S.H. A Approach to Appearance-Based Gaze Estimation under Head Pose Variations. In Proceedings of the 4th IAPR Asian Conference on Pattern Recognition, Nanjing, China, 26–29 November 2017; pp. 935–940. 23. Yin, Y.; Juan, C.; Chakraborty, J.; McGuire, M.P. Classification of Eye Tracking Data using a Convolutional Neural Network. In Proceedings of the 17th IEEE International Conference on and Applications, Orlando, FL, USA, 17–20 December 2018. 24. Lemley, J.; Kar, A.; Drimbarean, A.; Corcoran, P. Efficient CNN Implementation for Eye-Gaze Estimation on Low-Power/Low- Quality Consumer Imaging Systems. arXiv 2018, arXiv:1806.10890v1. 25. Ahmad, M.B.; Saifullah Raja, M.A.; Asif, M.W.; Khurshid, K. i-Riter: Machine Learning Based Novel Eye Tracking and Calibration. In Proceedings of the IEEE International Instrumentation and Measurement Technology Conference, Houston, TX, USA, 14–17 May 2018; 2018. 26. Cha, X.; Yang, X.; Feng, Z.; Xu, T.; Fan, X.; Tian, J. Calibration-free gaze zone estimation using convolutional neural network. In Proceedings of the International Conference on Security, Pattern Analysis, and Cybernetics, Jinan, China, 14–17 December 2018; pp. 481–484. 27. Stember, J.N.; Celik, H.; Krupinski, E.; Chang, P.D.; Mutasa, S.; Wood, B.J.; Lignelli, A.; Moonis, G.; Schwartz, L.H.; Jambawalikar, S.; et al. Eye Tracking for Deep Learning Segmentation Using Convolutional Neural Networks. J. Digit. Imaging 2019, 32, 597–604. [CrossRef][PubMed] 28. Yiu, Y.H.; Aboulatta, M.; Raiser, T.; Ophey, L.; Flanagin, V.L.; zu Eulenburg, P.; Ahmadi, S.A. DeepVOG: Open-Source Pupil Segmentation and Gaze Estimation in Neuroscience Using Deep Learning. J. Neurosci. Methods 2019, 324, 108307. [CrossRef] [PubMed] 29. Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation. IEEE Tran. Pattern Anal. Mach. Intell. 2019, 41, 162–175. [CrossRef][PubMed] 30. Stavridis, K.; Psaltis, A.; Dimou, A.; Papadopoulos, G.T.; Daras, P. Deep Spatio-Temporal Modeling for Object-Level Gaze-Based Relevance Assessment. In Proceedings of the 27th European Signal Processing Conference (EUSIPCO), A Coruña, Spain, 2–6 September 2019. 31. Liu, J.; Lee, B.S.F.; Rajan, D. Free-Head Appearance-Based Eye Gaze Estimation on Mobile Devices. In Proceedings of the International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Okinawa, Japan, 11–13 February 2019; pp. 232–237. 32. Lian, D.; Hu, L.; Luo, W.; Xu, Y.; Duan, L.; Yu, J.; Gao, S. Multiview Multitask Gaze Estimation with Deep Convolutional Neural Networks. IEEE Tran. Neural Netw. Learn. Syst. 2019, 30, 3010–3023. [CrossRef] Appl. Sci. 2021, 11, 851 21 of 21

33. Porta, S.; Bossavit, B.; Cabeza, R.; Larumbe-Bergera, A.; Garde, G.; Villanueva, A. U2Eyes: A binocular dataset for eye tracking and gaze estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Korea, 27–28 October 2019; pp. 3660–3664. 34. Li, W.Y.; Dong, Q.L.; Jia, H.; Zhao, S.; Wang, Y.C.; Xie, L.; Pan, Q.; Duan, F.; Liu, T.M. Training a Camera to Perform Long-Distance Eye Tracking by Another Eye-Tracker. IEEE Access 2019, 7, 155313–155324. [CrossRef] 35. Rakhmatulina, I.; Duchowskim, A.T. Deep Neural Networks for Low-Cost Eye Tracking. In Proceedings of the 24th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Procedia Computer Science, Verona, Italy, 16–18 September 2020; pp. 685–694. 36. Brousseau, B.; Rose, J.; Eizenman, M. Hybrid Eye-Tracking on a Smartphone with CNN Feature Extraction and an Infrared 3D Model. Sensors 2020, 20, 543. [CrossRef] 37. Kuo, T.L.; Fan, C.P. Design and Implementation of Deep Learning Based Pupil Tracking Technology for Application of Visible- Light Wearable Eye Tracker. In Proceedings of the IEEE International Conference on Consumer Electronics, Las Vegas, NV, USA, 4–6 January 2020; pp. 1–2. 38. Joseph, R.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. 39. Kocejko, T.; Bujnowski, A.; Rumi´nski,J.; Bylinska, E.; Wtorek, J. Head Movement Compensation Algorithm in Multidisplay Communication by Gaze. In Proceedings of the 7th International Conference on Human System Interactions (HSI), Costa da Caparica, Portugal, 16–18 June 2014; pp. 88–94. 40. LabelImg: The graphical image annotation tool. Available online: https://github.com/tzutalin/labelImg (accessed on 26 November 2020). 41. NVIDIA Jetson AGX Xavier. Available online: https://www.nvidia.com (accessed on 26 November 2020).