<<

Faculty of Technology and Society Computer Science

Bachelor’s thesis 15 Credits, undergraduate level

Eye Movement Analysis for in Everyday Situations

Analys av Ogonr¨orelserf¨orAktivitetsigenk¨anning¨ i Vardagliga Situationer

Anton Gustafsson

Degree: Bachelor, 180 Credits Supervisor: Shahram Jalaliniya Field of study: Computer Science Examiner: Fahed Alkhabbas Program: System Developer Date of final seminar: 2018-05-30

Abstract

The increasing amount of smart devices in our everyday environment has created new problems within human-computer interaction such as how we humans are supposed to interact with these devices efficiently and with ease. So far, context-aware systems could be a possible candidate to solve this problem. If a system automatically could detect people’s activities and intentions, it could act accordingly without any explicit input from the user. Eyes have previously shown to be a rich source of information about a person’s cognitive state and current activity. Because of this, eyes could be a viable input modality for extracting information from. In this thesis, we examine the possibility of detecting human activity by using a low cost, home-built monocular eye tracker. An experiment was conducted were participants performed everyday activities in a kitchen to collect data. After conduct- ing the experiment, the data was annotated, preprocessed and classified using multilayer perceptron and random forest classifiers. Even though the data set collected was small, the results showed a recognition rate of between 30-40% depending on the classifier used. This confirms previous work that activity recognition using eye movement data is possible but that achieving high accuracy is challenging.

Keywords: activity recognition, artificial intelligence, data annotation, eye tracking, eye movement analysis, gaze tracking, human activity recognition, intention recognition,

Sammanfattning

Den st¨andigt¨okande m¨angdenav smarta enheter i v˚arvardag har lett till nya problem inom HCI s˚asom hur vi m¨anniskor ska interagera med dessa enheter p˚aett effektivt och enkelt s¨att. An¨ s˚al¨angehar kontextuellt medvetna system visat sig kunna vara ett m¨ojligt s¨attatt l¨osadetta problem. Om ett system hade kunnat automatiskt detektera personers aktiviteter och avsikter, kunde det agera utan n˚agonexplicit inmatning fr˚ananv¨andaren. Ogon¨ har tidigare visat sig avsl¨ojamycket information om en persons kognitiva tillst˚and och skulle kunna vara en m¨ojligmodalitet f¨oratt extrahera aktivitesinformation ifr˚an. I denna avhandling har vi unders¨oktm¨ojlighetenatt detektera aktiviteter genom att anv¨andaen billig, hemmabyggd ¨ogonsp˚arningsapparat. Ett experiment utf¨ordesd¨ardelt- agarna genomf¨ordeaktiviteter i ett k¨okf¨oratt samla in data om deras ¨ogonr¨orelser.Efter experimentet var f¨ardigt,annoterades, f¨orbehandlades och klassificerades datan med hj¨alp av en multilayer perceptron–och en random forest–klassificerare. Trots att m¨angden data var relativt liten, visade resultaten att igenk¨anningsgradenvar mellan 30-40% beroende p˚avilken klassificerare som anv¨andes.Detta bekr¨aftartidigare forskning att aktivitetsigenk¨anninggenom att analysera ¨ogonr¨orelser¨arm¨ojligt. Dock visar det ¨aven att det fortfarande ¨arsv˚artatt uppn˚aen h¨ogigenk¨anningsgrad.

Glossary

Gaze To look at something steadily and with intent

HAR Human activity recognition, sometimes referred to activity recognition (AR) is the process of interpreting and detecting what a person is doing.

HCI Human-computer interaction. The field of research about interfaces between hu- mans and computers

IoT , the concept of having many small connected computers inside common objects

ML Machine learning, a collection of algorithms that learns in order to classify data.

MLP Multilayer perceptron, a type of machine learning algorithm that is implemented with a neural net.

Modality A single independent channel of input to a computer system.

RF Random forest, a type of machine learning algorithm that is implemented with a decision tree.

Smart space An environment that is interwoven with sensors, devices and computers that are connected. Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Research Questions ...... 2 1.3 Hypotheses ...... 2 1.4 Structure ...... 3

2 Related work 4 2.1 Systematic Literature Review ...... 4 2.2 Search Strategy ...... 4 2.3 Summary of Related Work ...... 5 2.4 Common Uses for Activity Recognition ...... 6 2.5 The Traditional Methods of Activity Recognition ...... 6 2.6 Eyes for Activity Recognition ...... 7

3 Methodology 9 3.1 Experiment ...... 9 3.2 Method ...... 10 3.2.1 Experiment Setup ...... 10 3.2.2 Apparatus ...... 10 3.2.3 Software Stack ...... 11 3.2.4 Tasks ...... 12 3.2.5 Participants ...... 13 3.2.6 Procedure ...... 13 3.3 Data Processing ...... 14 3.3.1 Features and Attributes ...... 14 3.3.2 Tasks ...... 15 3.3.3 Time Window ...... 15 3.3.4 Sampling ...... 15 3.3.5 Classifiers ...... 15 3.3.6 Metrics ...... 16 3.3.7 Pilot Experiment ...... 16

4 Results 18

5 Discussion and Analysis 21 5.1 Results Compared to Related Work ...... 22

6 Conclusion 24 6.1 Future Work ...... 24

References 25 A Appendix 33 A.1 Systematic Literature Study Search Keywords ...... 33 A.2 Systematic Literature Study Inclusion and Exclusion Criteria ...... 33 A.3 Applied Inclusion and Exclusion Criteria to the Search Results ...... 34 1 Introduction

This section gives an introduction for this thesis. It includes the background, hypotheses, research questions and the structure for the rest of the thesis.

1.1 Background The vision of smart spaces by Mark Weiser [1] is quickly becoming a reality with the emergence of IoT. It is estimated that we will have 50 billion connected devices by the year 2020 [2]. This future will bring us many benefits, some of which we can already see evidence of today. Systems that control the lights [3] in our home by automatically turning them on and off when needed, reduces the environmental impact and leads to a more comfortable living. Smart home alarms [4] can increase fire safety and reduce the risks of break-ins. Samsung, one of the largest brands of home electronics in the world [5], has stated that all their products will be smart by 2020 [6]. This increase in devices around us creates new problems and questions such as how are we going to interact with all these devices. Having a separate for each system or device either physical or as part of some other device would quickly become overwhelming. It could possibly make these systems more cumbersome to use than if we were without them. One solution to this problem could be to create context-aware systems. These systems are able to recognize and infer different situations in their environment and thereby not require any explicit input from a user to react. Despite the massive amount of previous work in the area of context-aware systems, context-recognition is still very challenging. Many sensors must work in conjunction and collect great amounts of different data points and process it in order to be able to recognize the current situation. If it is possible to solve this problem, context-aware systems could play an essential part in realizing the vision of smart spaces and adoption of the internet of things [7]. We have already seen applications for activity recognition by using sensors in smart- phones and other wearable sensors. There are applications that recognize sport activi- ties [8], keep track of your health [9] and your movements in everyday life [10, 11]. How- ever, inertial sensors such as accelerometers and gyroscopes are limited in the activities they are able to detect, mainly physical ones. Multiple sensors can be deployed to increase accuracy [12] and the amount of possible activities to detect but this is obtrusive and not particularly user friendly, especially if a person has to wear them on their body. Another common approach is borrowing techniques from the field of . Researchers have managed to achieve high accuracy when detecting common activities [13, 14] in videos. Aside from privacy issues associated with using videos for activity recognition, just as with inertial sensors, this approach can mostly detect physical activities. Eyes have a great potential to reveal a lot of information for non-physical states and activities such as high [15, 16, 17] and activity recognition such as read- ing [18, 19, 20, 21, 22, 23] and even recognizing emotions [24, 25]. But eye trackers have their own limitations, e.g. head-mounted eye trackers are still obtrusive and sensitive to body motion. Remote eye trackers such as the Tobii eye tracker [26] is susceptible to the fact that it will not always have a clear line of sight to the eyes and are limited to relatively short detection distance. However, as eye trackers continue to evolve and become more accurate, smaller and cheaper it is likely that it will be possible to integrate eye trackers

1 in smart glasses to make them completely unobtrusive in the future. We have already seen evidence of this in technology such as the [27, 28] and Tobii eye trackers. Today, eye trackers are a lot better than just ten years ago and eye tracking technology is used not only in research projects [29, 30] but also in commercial applications such as driver assisting technologies [31, 32] and smartphone applications [33]. It is also one of the primary input modalities for disabled people [34, 35]. Almost all of the existing research regarding activity recognition with eye trackers has been conducted in laboratory environments where the subjects are stationary in front of a screen. That is, the research is about tracking eyes and detecting activity when the sub- ject is looking at regular two-dimensional (2D) screens. Only a few number of researchers have examined activity recognition in three-dimensional (3D) environments situated in real life [30, 36, 37]. Although the previous research showed promising results, there is a clear gap in the current state of this research. Only a few number of tasks, hardware plat- forms and software has been previously examined. Eye tracking in 3D needs to be further investigated in order to enable new applications for the future. Some of these applications could be for example different kinds of health monitoring applications. Eye-based activ- ity recognition could be used for ensuring that dementia patients have performed their necessary activities such as taking their medicine or turning off the stove after using it. Another possible use case for eye-based activity recognition is for the future of human- computer interaction in smart spaces. If a system could predict a user’s activity, without any explicit input, the smart environment could adjust the configuration automatically to better support the user’s activity and get us one step closer to the envisioned smart environments of Mark Weiser.

1.2 Research Questions To investigate if eye movement data can reveal the activities of a user in a 3D environment, the following research questions need to be answered:

1. How accurately can everyday activities in natural environments be recognized by analyzing eye movements collected from a mobile eye tracker?

2. What type of everyday activities can be recognized more accurately by analyzing eye movements collected from a mobile eye tracker?

These questions does not specify any hardware, software, environment or people. It is mainly due to the novelty of eye tracking and activity recognition in 3D. We chose a more general approach because other considerations can be future studies of their own.

1.3 Hypotheses Eyes have shown to reveal a lot of useful information in the field of activity recognition on 2D screens. In this thesis, we investigate the possibility of detecting activities also in 3D environments. 3D environments are more challenging because of the variation of physical activities and noise that the real world introduces. Moreover, the accuracy of gaze tracking is lower in 3D environments due to the parallax error. The source of parallax error is the fact that the gaze tracker is calibrated for a certain distance but in 3D the gaze needs to be detected in different depths. The main goal of this research is to investigate the

2 possibility of using eye movements for activity recognition in everyday situations such as kitchens in order to determine the accuracy of activity recognition in 3D environments.

1.4 Structure This thesis is organized into several sections. Section two introduces a systematic literature review (SLR) and a summary of related work to this thesis. Section three explains the methodology that was used as well as the implementation of it. Section four presents our results after performing the experiment. Section five presents our discussion about the results. And finally in section six we present our conclusions about this thesis and identify some aspects that can be investigated in future research.

3 2 Related work

The following section briefly explains the process and purpose of a systematic literature review. It then continues to present the summary of all the previous work that we could find being related to this thesis using our search method.

2.1 Systematic Literature Review A systematic literature review is a process of identifying and evaluating all available re- search relevant to a research question, topic area or phenomenon of interest [38, 39]. The process includes the steps of gathering and filtering large amounts of literature in a pro- cedural and systematic manner. It should be well documented and structured to show that the author has been thorough and to make the review reproducible. This is to ensure that the author does not become accused of plagiarism, repeats something unnecessarily and to make sure that the author is knowledgeable about the field in order to be able to contribute to it. It is also used to give the author and reader an overview of the current state of the chosen topic. The systematic review should have a protocol that specifies the research questions and the topic that is the starting point. It also needs a search strategy, a well-documented procedure on how the search process is conducted. Usually, this strategy has inclusion and exclusion criteria that are followed during the review. The review should also specify the information that is to be obtained from each study. All these steps [40, 41] are usually performed iteratively and are thus repeated and altered as many times as needed. The author usually discover new keywords, phrases and databases when conducting the review and so the process needs to be changed to encompass these changes. As we had limited resources when conducting this study, we could not conduct a proper SLR. Therefore our work can be considered as a semi-structured literature review (SSLR). A SSLR does not have an official definition other than that it is a less formal version of an SLR. In our case, we had to restrict our amount of search result hits to a certain limit. The main cause of this change was the sheer amount of hits that were generated as described in the next section.

2.2 Search Strategy The databases chosen to find the literature in for the SSLR were ACM, IEEE Xplore and Scopus. These are well known and recognized databases to use when doing research within computer science. Another database, Google Scholar, was mainly utilized when snowballing techniques were used to find what references were cited where and as a general tool when additional references were needed. To find relevant literature, the research question and the research field divided into keywords that would be used in the search process (see appendix1), mainly activity recognition and eye tracking. Some keywords resolved in too many or irrelevant hits resulting in either modifying them or discarding them completely. As the research question has evolved over time, general keywords were chosen at the start to later become more specific as the research questions and thesis evolved. To limit the amount of hits but still get a good overview, constraints were defined (see appendix2). They were picked as described by [41] and by experimenting and comparing the number of hits in different databases. A maximum limit of 600 hits after applying the second criteria were set because of time and resource constraints for

4 this thesis. If a search resulted in more than 600 hits, the keywords were altered so that the amount of hits would be below our maximum limit. It was mainly Scopus that came close to this limit, the other databases resulted in about 2-300 hits. This restriction is the reason why we are calling this SLR an SSLR because in an SLR, all references would have been included in the final study. Once the keywords were selected, the search process could commence. The first step was to use the keywords and applying the defined constraints on the references that were found. This meant searching the databases for abstracts containing the chosen keywords. Abstracts gave relevant literature while keeping the number of hits low. Another restriction was that only articles and journals were used as these are recognized as good sources and for further reducing the number of hits. By reading the title and abstract of the resulting literature, it was determined if the literature should be saved or not. If they were deemed relevant, they were saved with the help of the reference management system Zotero that allows for organizing, exporting and backing up of references. If a duplicate was found, it was discarded and so only the first occurrence of that reference was kept. After this, the conclusion of the literature that was previously saved were read and and either discarded due to not being relevant or kept if they were. This procedure was then repeated for the full text of each reference as well. The literature that was left from this stage was the final literature that was deemed important and relevant for the SSLR and the research questions. Once this step was finished, one iteration of backwards snowballing [42] were conducted to find relevant literature that has been cited by the literature in order to find more recognized works within the field. Scopus always resulted in a large amount of hits so an additional constraint of limiting the search to only Computer Science topic was applied to reduce the number of hits further. A total of 54 references was the result after conducting this SSLR using this strategy.

2.3 Summary of Related Work Activity recognition started to emerge as a field in the late ’90s [43]. Since then, great progress has been made in recognizing activities using different types of sensors, either on their own or in combinations. We can split the approaches into either wearable – something you have attached on your body – or remote – sensors placed in the surrounding environment. The main challenges that exists in activity recognition is the following [11, 43]:

1. Recognition Accuracy – Different approaches can detect different activities in various granularity. This means that some sensors are good at recognizing specific activities. This has led to that many researchers have tried using a fusion of sensors to achieve better accuracy for the cost of increased complexity. Also the sheer variance in how people perform different activities makes it difficult to develop a general system that works for all people in every situation.

2. Obtrusiveness – Most wearable apparatuses are obtrusive and have limited resources in terms of battery life and computational power. External sensors mitigate this problem but are limited to the space they are located in and can easily be obstructed by objects in front of them.

3. Privacy – Both wearable and external sensors can be very invasive. Effort has been

5 made to develop less privacy invasive technologies but they are limited in what activities they are able to detect.

4. Flexibility – Systems have issues adapting to new environments and people. Most systems need a training phase to be able to recognize a specific activity and thus this will hurt the ability to detect new and exotic situations that the system has never seen before.

2.4 Common Uses for Activity Recognition Today, many people track their sports activities with the help of their smartphones by counting their amount of steps taken, running performance or location data to improve their fitness and to track their progress [8]. There are also numerous efforts put into de- veloping better health care applications, mainly for elderly people still living at home [9, 44, 45] by using activity recognition for detecting for example if a person falls to the ground [13, 46]. Somewhat related to this, activity recognition can also be used when analyzing video surveillance footage. More areas than ever have video surveillance these days but we lack workers to look through all this footage. In most cases, the footage cap- tured is not viewed until a crime has already happened. This is where activity recognition comes in. If it is possible to have a computer analyze the vast amount of footage recorded, it could highlight certain activities such as crimes or abnormal behaviour. It could also categorize large amounts of video to make it searchable [47] and thereby removing the need for a human operator to review all the footage which can be time consuming. However so far, the systems proposed are still error prone and not efficient enough to be able to distinguish more than the basic types of physical activities.

2.5 The Traditional Methods of Activity Recognition One of the most well-known approaches to recognizing activities today is video based, a huge topic in the field of computer vision. Can computers by just analyzing images detect different activities by identifying objects and their behaviour? Some of the most well- known approaches within activity recognition with video involves using RGB and Depth cameras [48], sometimes in combination. A popular device is the Microsoft Kinect [49]. These cameras go under the name RGB-D cameras and they can provide synchronized colour and depth information at high frame rates. The approach is rather simple, video is annotated and fed into machine learning algorithms to find patterns that represent activities. [50] Used a RGB-D camera and algorithm to be able to determine different tasks in a kitchen. The authors applied object recognition, both on the hands and items in the kitchen to be able to detect them and relate the objects to each other. This way, they could infer what the user was currently doing. However, their research was limited in the number of objects and actions and mostly about detecting fine grained tasks in following a recipe. The classification was also done offline, limiting the usability of the proof of concept they developed. Even though the field of computer vision have evolved a lot in recent years thanks to advances in Machine Learning and AI, most applications remain brittle and unreliable. Camera-based systems are limited in the activities they are able to detect, mainly physical like sitting, walking, standing etc [13, 14]. Another problem is that they are very invasive and obtrusive even though techniques such as silhouette

6 extraction [46, 51, 52], which hides some details from the video, have been proposed. This has led to researchers starting to mount cameras on the body instead. Some have done experiments using wrist mounted cameras [53] though this is obtrusive as it is not very convenient to have a wrist mounted camera on you all day. However, considering that cameras become smaller and better every day, it might be a feasible option in the future. Other wearable cameras attached to the head [54] have the same problems regarding obtrusiveness and privacy concerns and they are also subject to head and body motion that can cause problems with the video being captured. A less obtrusive approach suggested is using smartwatches. These wrist mounted devices have become more commonly used by people and have shown to be a useful tool for detecting activities [55]. The use of the built in accelerometers and gyroscopes in these devices as well as in smartphones [56, 57, 58, 59] and dedicated devices [9] has spawned another category of activity recognition systems using inertial sensors. These inertial sensors have become ubiquitous today by the adoption of smartphones and they offer a viable option towards video-based systems. However, just as with camera-based options, these are also limited as to the possible activities they can detect – they are mainly limited to physical activities. Though some researchers have found that combining various sensors such as cameras and video or multiple inertial sensors could lead to more accurate recognition [10, 60, 61], these approaches tend to be more complex than just using one sensor. Multiple sensors have to be deployed and maintained and aggregating the data is a difficult task. Another common approach is using radio-frequency identification (RFID) tags and sensors which are cheap and easy to deploy but somewhat limited in what they can detect – mainly locations [12, 62]. Many small sensors can be deployed for detecting various states such as on, off, opened, closed etc. [12] examined this and their results showed that the average accuracy was low and that some activities are easier to detect than others. A limiting factor is the huge amount of activities that need to be labelled and detected. Also, filling up a house with these sensors might not be feasible due to the sheer amount of them but maybe it is possible in the future smart home. The authors stated that more training data than the two weeks they had could have improved the accuracy a fair amount. Entirely different sensors have been suggested such as microphones [63, 64] for sound recording. These techniques are quite complex but could improve the accuracy of activity recognition when combined with simpler sensors such as gyroscopes and cameras. An- other novel approach included sensors that measured the size of the participant’s lungs to determine the breathing pattern to infer activities [65]. The researchers managed to achieve good accuracy for some activities such as walking, talking and smoking but more difficult for other tasks such as eating. Another approach is detecting Radio Frequency Interference [66] which can act as a camera, but it provides stronger privacy protection for the users. However, all these novel methods need to be explored further before any conclusions about their usability and accuracy levels can be drawn.

2.6 Eyes for Activity Recognition Recognizing activities and mental state with the help of our eyes is a well studied field. Eyes have shown to reveal a lot of information for non-physical states and activities. The activities recognized are usually very specific and limited to common everyday tasks such

7 as reading different material [18, 19, 20, 21, 22, 23], looking at video and images [67, 68], office activities [69] and socializing with people [30]. Many experiments include some or all of these in comparison [30, 36, 70, 71, 72] to detect different activities by just tracking the subject’s eyes using varying technologies and scenarios. A high dilation has been linked to high cognitive load [15, 16, 17] and has been used to detect driver’s attention when driving a vehicle [32] and when finding a desired object during a searching task [73, 74]. Emotions can also be detected by looking at pupil dilation [24, 25]. This can be used in various applications such as research, education, advertisement and games. Almost all of this research has been conducted in laboratory environments where the subjects are stationary in front of a screen. That is, the research is about tracking eyes and detecting activity when the subject is looking at 2D environments. Few researchers have used natural environments but there are some exceptions. In [30] the researchers examined activity recognition using gaze data in real life. Using a video- based eye tracker, more than a total of 80 hours of everyday activities for ten participants was recorded. The data was annotated, analyzed and classified using various classifiers. The best recognition performance was achieved for mentally challenging workloads such as reading (74.8%), focused work (70%) and computer work (64.2%). The achieved mean accuracy was well over 50% and suggests that there is a lot of information to be collected from just the eyes. The authors concluded that the more data points they had for each activity, the more accurate the recognition rate became. They noted that eye movements can differ between participants depending on the activity, making it necessary to optimize the recognition system for each participant. However, the activities they recognized were broad and the participants in the experiment did not have any instructions on how to perform them. Somewhat related to this is [36]. They used a smaller set of activity classes and a smaller data set but concluded that more features equal better accuracy, suggesting that more data points are an important part of activity recognition with eye gaze data. In contrast to this, [37] used a lot more fine-grained activity classes, such as ”take jam” and ”open milk” in a kitchen environment. Their focus was on detecting the actual objects in the scene and correlating those objects to the participant’s gaze. They managed to find evidence that gaze correlates to an action, suggesting that this approach is viable for activity recognition. However, this approach requires a lot of preparation in terms of mapping objects before they can be detected and might not be a feasible option for use in everyday life. Intention recognition is the means to detect what activity the user wants to do next and is similar to activity recognition. To our knowledge all intention recognition research has been examined in 2D environments [75, 76, 77, 78, 79] and no one has considered examining 3D environments. [79] used 2D images and annotated them with areas of interest to try to predict the participants’ intentions. However, these intentions are not very fine grained but more general such as detecting if the user intends to navigate or consume information. Not only is this research somewhat limited in the classes detected but using only 2D images is not applicable to the real world. In conclusion great effort has been put into the field of activity recognition using inertial sensors and also eye trackers. Good accuracy can be achieved but most of this research has been conducted in controlled lab environments looking at 2D screens. This makes it difficult to generalize these findings into the real world and applying them to various applications.

8 3 Methodology

The following section motivates why conducting an experiment was chosen as the main research method and why other methods were disregarded. It also presents the experiment setup and procedure as well as the result and discussion about the results.

3.1 Experiment To put our hypothesis to the test, conducting an experiment was considered as a valid method. The reason for choosing to perform an experiment is that it allows us to test a hypothesis in a controlled environment to get cleaner data that can be analyzed and draw conclusions from [80, 81]. Experiments are used when we need data from before and after a test so that we can compare the data against each other and see if we can find any patterns or just to examine cause and effect of doing something. For this thesis however, the purpose of the experiment is for collecting data in a controlled environment. Conducting experiments are a well recognized research method and commonly used for investigating novel approaches and techniques to explore potentials and limitations of such technologies in a very controlled environment before studying the application in more realistic scenarios. Even if experiments can produce artificial results due to their controlled nature, being able to freeze the variables is important to avoid noise in the end results. There are many parameters when doing eye tracking that can introduce noise in the tracking procedure such as insufficient calibration, misaligned cameras and movement of the eye tracker. These problems can lead to decreased accuracy of detecting and recording eye movement data. This is another reason for collecting data in a controlled setting as this allows us to control some of these variables. Other research methods such as case studies, surveys, design and creation, action research and ethnography were not deemed applicable to the research question or feasible given time and resource constraints for this thesis. A case study would require hardware and software that would be unobtrusive, durable and easy to use for participants for a longer period of time. Even if this method was chosen, it would generate massive amounts of data that would not be possible to analyze given the time limit for this research. The data generated by this experiment will be recorded in an overt observation in the sense that the participants are aware of the observation. This is because of the fact that the head-mounted eye tracker is impossible to ignore during use. However if a remote eye tracker was available, it could have been hidden and been used for covert observation instead. A few questions are also asked to the participants in order to get statistics on some of their preferences and attributes. We will mainly be using a quantitative method [81] for collecting eve movement data from several participants in order to find patterns. A qualitative method would not be useful for our research questions as the research questions are quantitative in the way they are written and not possible to use when classifying data with machine learning. In order to create a pipeline for evaluating the equipment and the data annotation and analysis, a small pilot experiment was first conducted in order to prepare for a bigger one with more participants and possible altered process. The pilot experiment would also allow us to prepare the data analysis pipeline before the main experiment.

9 3.2 Method 3.2.1 Experiment Setup A kitchen environment was chosen as a suitable controlled environment for recognizing activities. Kitchens are familiar to most people and provide an environment where many different tasks are performed such as talking, eating, cooking, cleaning, reading and others. All tasks contain sub tasks, for example cooking different foods can contain many different steps. This way, the granularity could be altered in the annotation process if needed. Initially we wanted to use a real kitchen but we were unable to obtain one at the university. A kitchen environment (see figure1) was therefore set up in a room at Malm¨oUniversity in order to simulate a real kitchen. This environment contained regular kitchen-ware such as bowls, knives, peeler, plates, cups and more. This limited us to tasks that did not require a sink or heating. This indoor environment also allowed us to keep a constant level of light and temperature in the room. These variables was kept constant to reduce noise during the experiment. Different amounts of light levels could for example affect the pupil dilation.

Figure 1: Kitchen environment where the experiment was conducted

3.2.2 Apparatus A head-mounted eye tracker was built using two generic web cameras fitted on an alu- minum frame that consists of small adjustable pipes (see figure2). This is a home-made monocular eye tracker in the sense that only one eye, the right one, is tracked. The eye camera has the resolution of 640x480 pixels and was fitted with a roll of camera film on top of it as well as two diodes. The film acts as a filter so it could detect the infrared light. The reflection of the diode that is reflected on the eye is used as a reference point for the eye tracking software (this is known as glint). To capture the front view of

10 the person, a camera with the resolution of 1280x720 was attached to the middle of the frame facing forwards. This camera’s purpose was mainly to be able to distinguish what the participant was actually doing during the experiment. The eye tracker was then con- nected to a laptop via a USB cables that was put in a backpack worn by the participant to enable mobility.

Figure 2: Eye tracker hardware. a) Front-mounted camera capturing the scene. 2) Camera capturing the eye movements 3) Strap to ensure that the headset could be fitted on different people’s heads

3.2.3 Software Stack The laptop was running a modified version of the open source eye tracking software Pupil [82] that is developed in the programming languages Python and C++. Pupil is an eye tracking platform that offers several different applications and we used two of them, Capture and Player. Capture contains the functionality of calibrating, tracking and recording the eyes using cameras. It has built-in tools for adjusting variables in the eye tracking algorithm to achieve the best possible tracking. UVC [83] is a standard for trans- ferring video data via USB which our web cameras did not support. The modification we wrote for Pupil allowed for regular non-UVC compliant cameras to be used as this soft- ware normally does not support this. Player (see figure3) does playback of video recorded with Capture. Player has functionality for applying different visual elements as overlays on the video such as location of the gaze and detection and pupil detection confidence. It also has tools for automatically detecting blinks, fixations and other features as well as an annotation tool. All this data, including gaze and pupil positions, can then be exported to comma delimited value (CSV ) files for further processing and analysis. As our camera operated on 30 Hz, one second of data therefore resulted in 30 rows of eye movement data with column data including pupil position, tracking confidence, pupil diameter and angle

11 position. This data would then be used for further processing and analysis.

Figure 3: The Pupil Player software visualizing eye movement data when participant is performing a task in the kitchen

3.2.4 Tasks Common tasks in a kitchen were chosen since the participants have probably done them several times before. The tasks are part of preparing a dinner for two people and cleaning up afterwards. We chose simple and common tasks as new or unfamiliar tasks might give completely different eye patterns. The tasks the participants performed were a mix of general and detailed ones so that it would be possible to differentiate between them and answer our research questions. Directions were given to perform each task twice so as to prepare a dinner for two people in order to record at least two repetitions for each task. The tasks are presented in table1. In the kitchen, a note with the tasks that the participant should perform was located. This allowed us to also detect eye movements when the participant is reading. The note was comparable with a recipe with step by step instructions for each of the tasks.

12 Table 1: Tasks and their description

Nbr Task Description t0 Read recipe Read the note of instructions in order to determine the next task to perform t1 Make tea Make two cups of tea by using teabags, cups and a kettle t2 Peel potatoes Peel two potatoes using a peeler t3 Slice potatoes Slice the previously peeled potatoes into thin chips t4 Peel carrots Peel two carrots using a peeler t5 Tear carrots Tear the previously peeled carrots into small bits t6 Whisk water Whisk a bowl of water for a whole minute t7 Place food Put the sliced potatoes, shredded carrots and some whisked water onto two plates t8 Serve on table Put the plates together with the tea and cutlery on the table t9 Clear table Take the plates together with the tea and cutlery back to the kitchen t10 Throw food Throw the leftover food into the garbage t11 Clean dishes Clean the plate by wiping it down with a cloth t12 Clean table Clean the table with a cloth and detergent spray

3.2.5 Participants For the experiment, ten people were recruited among friends and students around univer- sity with the mean age of 27.3 years old. None of the participants had any prior experience with eye tracking and they all estimated to perform 15-60 minutes of tasks in the kitchen everyday. Participant 2 (P2) worked as a cook and therefore spent around eight hours a day in a kitchen. None of the participants wore prescription lenses or glasses during the experiment but three of the participants acknowledged to use glasses and lenses sometimes in everyday life. Beyond these ten participants, two people’s data had to be discarded due to inaccurate tracking either due to their features around their eye or due to misalignment of the eye tracker. We did not place any more restrictions on the participants as we wanted to have a diverse set of ordinary people to perform the experiment.

3.2.6 Procedure Each participant was first asked about their age and estimation of how many minutes they spend doing tasks in a kitchen per day and if they normally wore prescription glasses or lenses. After this the participant were given a brief introduction of the hardware, the kitchen environment and the tasks they should perform. Then the participant got help mounting the eye tracking and the parameters of the eye tracker algorithm were adjusted for proper detection of the pupil. A calibration procedure followed and is required to ensure proper eye tracking. According to [82] an accuracy between 1-2 degrees is enough for achieving correct gaze tracking during recording. The calibration is conducted by looking at different graphical elements on the laptop screen in each of the four corners. After the calibration, accuracy was measured with the built in accuracy test. If the accuracy test showed more than two degrees of discrepancy, new calibrations were performed until

13 the accuracy fell below the two degree limit. If after several attempts of calibration the accuracy was still not below two degrees, the subject was denied from performing the experiment. The participant was then told to try and act natural and to ignore the eye tracker as much as possible. A last check was done to make sure that the cameras were properly aligned and calibrated and participants were asked if they were comfortable. If the answer was yes, recording started as they performed the tasks, one at a time in a sequence. We left the room as the participants started performing their tasks as we believed our presence would perhaps annoy the participant and effect the results. After the tasks were complete, recording stopped and the video was transferred to an external hard drive for backup.

3.3 Data Processing Collecting data is just the first part in this experiment. A plan for annotating and ana- lyzing the data is equally as important to enable a smooth and efficient process. The first step was to annotate the data, namely to write where each task began and ended. The annotation process also removes noise from the data such as when the participant is walk- ing around or picking up objects that is not related to the task we wanted to detect. With Pupil Player, eye movement data including pupil positions, fixations, blinks and pupil dilation was exported to CSV files. By studying the footage captured, an extra column was added manually to the CSV files in order to highlight which pupil positions belonged to which task. These files were then used in our pipeline developed in Python mainly using the machine learning framework scikit-learn [84] and the data set balancing library imbalanced-learn [85]. Using Python, the data in these files was normalized, aggregated and used for calculating new attributes. A script was created in python for iterating and creating new CSV files containing these new features with various combinations of time windows, tasks and attributes.

3.3.1 Features and Attributes The features we chose to extract are the most recognized features within the field of eye tracking, namely fixations, blinks and [86] but also pupil dilation. Fixations occur when a person is looking at the same point for 200-400 ms [87] therefore we can define a fixation when have not moved within this time range. Blinks can be detected by identifying when the pupil confidence drops below a certain threshold for a slight moment. In our case we used a threshold of 50% and a length of 200 ms. Pupil dilation was provided in the pupil positions exported with Capture, so we used it for calculating its various features. Saccades is when the eyes moves quickly between two or more points. [74] These were calculated using a velocity-based algorithm implemented according to [86] by calculating the distance and velocity between eye positions. These features were then used to calculate attributes (see table2) that could be used with a classifier. All features except dispersion for blinks, saccades and pupil dilation increased the accuracy so these attributes were disregarded.

14 3.3.2 Tasks Using different tasks, we were also able to get different result. By aggregating t9 with t8 we were able to increase our accuracy a small portion. This is most likely due to the fact that these two tasks are very similar which made it difficult for our classifiers to differentiate between them. Removing and combining other tasks either decreased the accuracy or did not have any noteworthy impact on the result.

Table 2: Attributes calculated

Fix Blink Pupil dilation Quantity Quantity Quantity Min size Min duration Min duration Min duration Max size Max duration Max duration Max duration Mean size Mean duration Mean duration Mean duration Dispersion

3.3.3 Time Window A time window was defined as to how long a period of data the attributes should belong to. For example, a ten minute video with a time window of three seconds would result in 200 sections of the video, i.e 200 rows of the new attributes. A larger time window results in less data to classify which could make it difficult to achieve good accuracy. This is why we resorted to using a sliding window, that is a window with overlap from the previous window. The final time window we settled on was five seconds long using 80% overlap as this setting gave us better results.

3.3.4 Sampling The data also tended to be unbalanced, a case where there is a superior number of rows of data for one task compared to other tasks. To mitigate this problem, it is possible to apply sampling [88] on the data. Oversampling is the process of increasing the rows with lesser class values. So if we had 100 rows of t1 and 500 rows of t2, t1 would be copied or in some other manner increased to match t2. Undersampling is the opposite, it reduces the dominant class by removing rows in various ways. These two approaches were also examined to determine the best accuracy. For our data, oversampling worked better than undersampling, so this was the option we settled on.

3.3.5 Classifiers When the attributes and time windows were defined, these files could easily be used within scikit-learn where various classifiers could be used. We settled on two commonly used classifiers, multilayer perceptron (MLP) and random forest (RF). MLP is a commonly used classifier for computer vision and various problems such as image recognition [89] where the input and output is a number of integers that span over a wide range. Random forests are a collection of decision trees [88] and are very efficient and easy to use as it does not have many variables to tune. However, as with most problems that can be solved using

15 machine learning, there is no silver bullet, no classifier that can solve all problems. We had to approach this in a greedy manner in order to find classifiers that gave us reasonable results. For splitting the data into training and testing sets, we used k-fold cross-validation as it is a good approach when using limited data sets [88]. K-fold cross validation works by dividing the data set into subsets, training the model with all subsets except the last one which is used for testing. The number of folds we settled on were five as this yielded good accuracy while still being fast compared to using more folds.

3.3.6 Metrics Just as there is no perfect classifier, there is no single metric that can be used for measuring everything. Therefore three types of metrics (see figures4-7 ) were used, confusion matrix accuracy and f-score [90]. The confusion matrix [88] depicts what the classifier predicted a certain instance to be and what it actually is. True positives (TP) are the instances that the classifier predicted were positive and was correct. True negatives (TN) are the same as TP but for negatives. False positives (FP) are when the classifier predicted an instance to be true but it was actually false. False negatives is the opposite. Accuracy means how large a percentage of all the instances were classified correctly. An unbalanced data set could yield high accuracy for some tasks and thus make the model good at recognizing a specific data set, therefore resulting in misleading accuracy. This is why we also calculate the f- score measurement for each participant. This metric can be more relevant for imbalanced data sets as it shows how precise the classifier is. It is the average score of the precision and recall. Precision is about how accurate the model is at classifying how many predicted positive instances were actually positive. Recall denotes how many actual Positives our model captures through classifying it as Positive.

TP + TN TP accuracy = precision = all instances TP + FP

Figure 4: Accuracy Figure 5: Precision

TP 2 ∗ precision ∗ recall recall = f − score = TP + FN precision + recall

Figure 6: Recall Figure 7: F-score

3.3.7 Pilot Experiment A pilot experiment was conducted in order to evaluate our hardware, software and tasks. Some important feedback and considerations were brought into the main experiment. Three participants were recruited for performing five tasks, wearing the eye tracker. For the first participant, the pupil recognition algorithm in the Pupil software had trouble detecting the pupil due to the participant’s choice of makeup. This made it very difficult

16 to calibrate and track the gaze correctly because it could not distinguish between the darker color of the makeup against the pupil. This discovery led us to limit participants for the main experiment to not wear makeup. For the same participant, the eye tracker was too big and so they tended to be unstable and almost fall off, leading to this participant having to hold the glasses during the session, leading to more instability in the tracking. This led us to add a strap to the back of the glasses, allowing for a much better fit. The preliminary data that was generated for this pilot allowed us to achieve an understanding of how the data behaved and how we should proceed to analyze it. With the data, we could develop the machine learning pipeline that we used for the data generated from the main experiment.

17 4 Results

For the main experiment a total of 118,9 minutes of eye movement data was recorded from the start of the first task until the end of the last task for ten participants. The mean duration for each session was 11,9 minutes for each participant conducting all 13 tasks. In figure8 the accuracy of using multilayer perceptron and random forest using a sliding time window of five seconds with an overlap of 80% is presented. All attributes except dispersion for blinks, saccades and pupil dilation were used and t9 was merged into t8.

Figure 8: Bar diagram showing the achieved accuracy for each participant. Classifiers used was MLP and RF using all features and a sliding time window of five seconds.

The mean accuracy across all participants using MLP was 31.6% while RF resulted in 44.1%. However, these results do not show which classes were labeled what, the classifiers here could have classified some tasks well but others not at all. Therefore, we also present the f-score as seen in figure9 which is a better metric for imbalanced data.

18 Figure 9: Bar diagram showing the achieved f-score for each participant

To get an even more detailed view, we present the results for recognizing each task across all the participants with confusion matrices for MLP and RF. This is shown in figure 10 and 11.

Figure 10: Confusion matrix showing correct and incorrect classifications when data from all participants was aggregated using MLP classifier

19 Figure 11: Confusion matrix showing correct and incorrect classifications when data from all participants was aggregated using RF classifier

As is seen in figure 10 and 11, MLP has a more even distribution of correct and incorrect classifications. RF tended to classify more instances in a fewer number of categories, t2 and t3. However, RF still had a higher f-score according to figure9.

20 5 Discussion and Analysis

In this thesis, the possibility of detecting activities in a kitchen environment has been examined. In this section we will present our discoveries and considerations regarding our results. Due to the novelty of this approach, it is difficult to compare the results directly with previous research. The real world is incredibly noisy, variables such as environment, items, people, activities, hardware, feature calculation and so on will affect the results. Because of this, even comparing AR research in 2D is difficult. Even so, we will try to identify some similarities between our results and previous works in order to see if we can find any correlations or dissimilarities between them. We will also try to answer our previously stated research questions. By analyzing and classifying our data set, we managed to achieve an accuracy around 30-40% depending on the classifier. This answers our first research question regarding how accurately activities can be detected using a mobile eye tracker. We also managed to answer our second research question on what activities can be more accurately detected. We discovered that more fine grained tasks are easier to detect than more general tasks. The home-built eye tracker was not the most robust solution. It was difficult to mount, unstable and could move if the participant touched it or tried to adjust it during usage. This could interfere with the tracking capabilities of Pupil and introduce noise to the data set. Also, having a camera right in front of your eye is not the most comfortable situation. This in combination with a simulated kitchen environment could add to artificial results. The participants could not possibly ignore the eye tracker itself when performing the tasks and the environment was not a real kitchen. Even so, the low cost and low performance of our hardware shows that reasonable activity recognition is possible using such inferior technology. The overall recognition rate was somewhat varied between the participants. This could be due to insufficient tracking from the eye tracker, small data sample, incorrect data annotation or processing. But it could also show that the real world is just too noisy for the classifier to be able to distinguish between the data. Using different features and classifiers showed quite different accuracy rates. RF generally showed better results than using MLP in both accuracy and f-score indicating that it is better at handling imbalanced data. Both classifiers detected some tasks more easily than others, leading to increased accuracy for some while others were not detected at all. This further shows that the choice and configuration of classifiers is very important to achieve good results. The accuracy and f-score for all participants were similar. This could show that differ- ent people performing the same tasks correlate to similar eye movements. P1 showed the lowest accuracy and P3 the highest. However we could not find any specific patterns in their data as to why this was the case. It could be any of the before mentioned variables or something with these participants eye movements that we did not discover. One of the participants, P2, worked as a cook, and so we anticipated that her results would be somewhat different to the rest of the participants as she estimated to perform at least eight hours of kitchen tasks everyday. Her experience as a cook likely contributed to her performing all tasks the fastest. However this did not seem to show when examining the results as her accuracy was similar to the other participants. When we compare the confusion matrices, we can see that MLP handled the imbal- anced data slightly better than RF. Generally, more fine grained tasks such as peeling and

21 slicing potatoes, peeling and tearing carrots, throwing food and wiping dishes yielded a better recognition rate than the others. These results indicate that perhaps some activ- ities are easier to detect than others, namely the ones that are more fine-grained. These tasks most probably contained less noise and more prominent eye movements as they were relatively consistent between all participants. This can be set in contrast to making tea, which was done in different ways. Some participants started this task by placing two cups, putting in teabags and pouring water over it. Other participants started by placing one cup, pouring water in and then putting in a teabag and then doing the same to the next cup. This could make it difficult to classify this data correctly, especially when it is aggregated between all participants. There is a high possibility that using more fine grained tasks such as only looking at the movement of the teabag going up and down in the cup, could have yielded better recognition. Giving more clear and strict instructions to the participants could have resulted in more consistent data for this task. But this could also have led to participants only following instructions and not really thinking and deciding on what to do next and thus making the experiment more artificial.

5.1 Results Compared to Related Work Previous work by [30] showed somewhat similar results to ours. They did achieve a high accuracy but the authors did use the official Pupil hardware which had better performance than ours. They also had 80 hours of data, further strengthening our beliefs that a richer data set would most likely increase our accuracy. The authors looked at both general tasks such as traveling, which can encompass many sub activities and also some that were more fine-grained such as reading. Reading generally achieved a high accuracy of 74-75% while traveling only had an accuracy rate of 14-43%. This is inline with both ours and [36] findings, that a stationary, repeatable activity with low variance is easier to detect than broad activities. However, we were not able to detect the reading task very good. This could be due to differences in the data set, the classifier we used or in the algorithm used for calculating saccades from the eye movement data. In our case, reading was only performed in very brief moments between the tasks. If we had denoted a specific tasks that the participant should read constantly without disruption for a longer time, we might have detected reading task better. We used similar attributes as [30] did but different classifiers. They used a Bag-of-words (BOW) model, an algorithm commonly used within computer vision. It operates by looking at encoded areas of an image and trying to group these together for classification. Perhaps using this classifier for our data set could have yielded different results. Similarly, [36] also used BOW in order to detect a fewer number of general and physical activities. However, their data set was smaller and spanning a smaller set of more general tasks. Some of the tasks they tried to detect were watching, writing, reading and walking. They tried both using a vision based, motion based and a combination of the two approaches. They did achieve a high accuracy when combining their approaches, indicating that a fusion based approach seem to be a good option. Their research also indicates that BOW could be further examined as they achieved good results with this classifier. However, as their tasks were more general, it is not feasible to compare them directly with our study. Also the fact that they only performed their experiment on a single person does not make it viable to draw any major conclusions from their result.

22 The research that we could find that was most similar to ours was in [37]. Just like our research, they examined the possibility of detecting kitchen activities with an eye tracker. However their approach focused on computer vision, mapping and detecting certain objects such as the participant’s hands. Gaze was merely used to detect what the participant was looking at and thus eye features such as saccades or fixations were not used within the classification. This makes their research not really comparable with ours, but it could be interesting to combine their approach with data from eye movement analysis. The authors did manage to yield results of high recognition in tasks such as ”take carrot”, but extremely low in the task ”take milk”. This suggests that it is difficult to detect these small differences between fine-grained tasks, the same problem that we encountered. To summarize, our results showed that a decent accuracy around 30-40% can be achieved in spite of our small test sample and low performance hardware. Recogniz- ing activities in 3D environments is possible but further research is needed in order to be able to generalize anything from our findings.

23 6 Conclusion

We have explored the possibility of recognizing daily everyday activities using eye move- ment data. There is a clear gap in the previous research regarding this topic, few re- searchers have examined 3D environments when detecting activities with eye movement data. Using a home-built eye tracker, we conducted an experiment where ten partici- pants performed everyday activities in a kitchen environment in order to generate data. This data was then annotated and preprocessed and fed to classifiers in order to classify attributes extracted from eye features to certain activities. With our results, we managed to answer our research questions and hypotheses on what level of accuracy can be achieved and what activities were easier to detect. Our results showed that a medium accuracy rate between 30-40% can be achieved for fine-grained tasks but that more general tasks tend to be difficult to detect. In conclusion, our findings suggest that there is a possibility of detecting activities in a 3D environment but that further research, mainly including more data, is needed in order to make any generalization from our results. Our findings indicate that activity recognition in 3D environments still remain difficult and complex. Many variables need to be further examined in detail in order to isolate which can improve the results and which cannot.

6.1 Future Work For future work on this subject, there is a number of topics that can be further explored. Firstly, having more different varied tasks, environments, participants and more repetitions of each task would probably have yielded quite different results compared to ours. We suggest using real environments, perhaps letting participants use their own kitchen that they use daily would have made less artificial results. Also outdoor environments is an interesting area to explore as it is even more noisy than indoors. Secondly, we suggest using higher performance eye trackers, both head-mounted and remote in order to compare the differences between them. Our hardware platform was obtrusive and not very reliable for data collection, using better hardware would mitigate these problems. Thirdly, the importance of the steps in machine learning should not be understated. More complex features could be extracted such as smooth pursuits, smooth movements with the eyes, and be combined such as a flag whenever a blink occurred within a saccade. More complex features could be used for calculating more advanced attributes such as saccades per second or range of a . These attributes could also have been preprocessed differently to reduce noise more efficiently. We also found it difficult to choose which classifiers to use. There are numerous different classifiers with different settings that can be tuned for the specific problem at hand. A study of which classifier works best for which type of eye movement data set would be a valuable contribution for this field of research. Fourthly, combining eye tracking in 3D environments together with object recognition or other approaches is an area of high interest. Previous research using a fusion of sensors but in 2D have shown to yield good results. Therefore using a fusion of sensors in 3D environments would be an interesting area of research.

24 References

[1] M. Weiser. “The Computer for the 21 st Century”. In: Scientific american 265.3 (1991), pp. 94–105. [2] D. Evans. “The Internet of Things: How the Next Evolution of the Internet is Chang- ing Everything”. In: Cisco Internet Business Solutions Group (IBSG) 1 (2011), pp. 1–11. doi: 10.1109/IEEESTD.2007.373646. [3] Phillips. url: https://www2.meethue.com/en-us (visited on 03/23/2018). [4] Z-Wave. url: http://www.z-wave.com/ (visited on 03/23/2018). [5] Fortune Global 500 List 2017: See Who Made It. url: http : / / fortune . com / global500/list/ (visited on 03/23/2018). [6] Samsung. Samsung Delivers Vision for Open and Intelligent IoT Experiences to Sim- plify Everyday Life. url: https://news.samsung.com/uk/samsung-delivers- vision-for-open-and-intelligent-iot-experiences-to-simplify-everyday- life (visited on 03/23/2018). [7] O. B. Sezer, E. Dogdu, and A. M. Ozbayoglu. “Context-Aware Computing, Learning, and Big Data in Internet of Things: A Survey”. In: IEEE Internet of Things Journal 5.1 (Feb. 2018), pp. 1–27. doi: 10.1109/JIOT.2017.2773600. [8] Runtastic. Runtastic: Running, Cycling & Fitness GPS Tracker. url: https:// www.runtastic.com/ (visited on 04/10/2018). [9] T. Toutountzi et al. “EyeOn: An activity recognition system using MYO armband”. In: vol. 29-June-2016. 2016. doi: 10.1145/2910674.2910687. [10] L. Bao and S. S. Intille. “Activity Recognition from User-Annotated Acceleration Data”. en. In: Pervasive Computing. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, Apr. 2004, pp. 1–17. doi: 10.1007/978-3-540-24646-6_1. [11] A. Bulling, U. Blanke, and B. Schiele. “A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors”. In: ACM Comput. Surv. 46.3 (Jan. 2014), 33:1– 33:33. doi: 10.1145/2499621. [12] E. M. Tapia, S. S. Intille, and K. Larson. “Activity Recognition in the Home Using Simple and Ubiquitous Sensors”. en. In: Pervasive Computing. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, Apr. 2004, pp. 158–175. doi: 10. 1007/978-3-540-24646-6_10. [13] O. T. C. Chen et al. “Activity recognition using a panoramic camera for homecare”. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Aug. 2017, pp. 1–6. doi: 10.1109/AVSS.2017.8078546. [14] K. Zhan, F. Ramos, and S. Faux. “Activity recognition from a wearable camera”. In: 2012 12th International Conference on Control Automation Robotics Vision (ICARCV). Dec. 2012, pp. 365–370. doi: 10.1109/ICARCV.2012.6485186.

25 [15] A. Barreto, Y. Gao, and M. Adjouadi. “Pupil Diameter Measurements: Untapped Potential to Enhance Computer Interaction for Eye Tracker Users?” In: Proceedings of the 10th International ACM SIGACCESS Conference on Computers and Acces- sibility. Assets ’08. New York, NY, USA: ACM, 2008, pp. 269–270. doi: 10.1145/ 1414471.1414532. [16] S. Rafiqi et al. “PupilWare: Towards Pervasive Cognitive Load Measurement Using Commodity Devices”. In: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments. PETRA ’15. New York, NY, USA: ACM, 2015, 42:1–42:8. doi: 10.1145/2769493.2769506. [17] C. M. Privitera et al. “The pupil dilation response to visual detection”. In: Human Vision and Electronic Imaging XIII. Vol. 6806. International Society for Optics and Photonics, Feb. 2008, 68060T. doi: 10.1117/12.772844. [18] C. S. Campbell and P. P. Maglio. “A Robust Algorithm for Reading Detection”. In: Proceedings of the 2001 Workshop on Perceptive User Interfaces. PUI ’01. New York, NY, USA: ACM, 2001, pp. 1–7. doi: 10.1145/971478.971503. [19] Y.-C. Jian and H.-W. Ko. “Influences of text difficulty and reading ability on learning illustrated science texts for children: An eye movement study”. In: Computers and Education 113 (2017), pp. 263–279. doi: 10.1016/j.compedu.2017.06.002. [20] A. Bulling et al. “Robust Recognition of Reading Activity in Transit Using Wear- able Electrooculography”. en. In: Pervasive Computing. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, May 2008, pp. 19–37. doi: 10.1007/978-3- 540-79576-6_2. [21] K. Kunze et al. “I Know What You Are Reading: Recognition of Document Types Using Mobile Eye Tracking”. In: Proceedings of the 2013 International Symposium on Wearable Computers. ISWC ’13. New York, NY, USA: ACM, 2013, pp. 113–116. doi: 10.1145/2493988.2494354. [22] A. Bulling, J. A. Ward, and H. Gellersen. “Multimodal Recognition of Reading Activity in Transit Using Body-worn Sensors”. In: ACM Trans. Appl. Percept. 9.1 (Mar. 2012), 2:1–2:21. doi: 10.1145/2134203.2134205. [23] F. T. Keat, S. Ranganath, and Y. V. Venkatesh. “Eye gaze based reading detection”. In: Conference on Convergent Technologies for Asia-Pacific Region TENCON 2003. Vol. 2. Oct. 2003, 825–828 Vol.2. doi: 10.1109/TENCON.2003.1273294. [24] T. Partala and V. Surakka. “Pupil size variation as an indication of affective pro- cessing”. In: International Journal of Human-Computer Studies. Applications of Af- fective Computing in Human-Computer Interaction 59.1 (July 2003), pp. 185–198. doi: 10.1016/S1071-5819(03)00017-X. [25] A. Babiker, I. Faye, and A. Malik. “Pupillary behavior in positive and negative emotions”. In: 2013 IEEE International Conference on Signal and Image Processing Applications. Oct. 2013, pp. 379–383. doi: 10.1109/ICSIPA.2013.6708037. [26] Tobii. url: https://www.tobii.com (visited on 03/22/2018).

26 [27] S. Ishimaru et al. “In the Blink of an Eye: Combining Head Motion and Eye Blink Frequency for Activity Recognition with Google Glass”. In: Proceedings of the 5th Augmented Human International Conference. AH ’14. New York, NY, USA: ACM, 2014, 15:1–15:4. doi: 10.1145/2582051.2582066. [28] S. Jalaliniya, D. Mardanbegi, and T. Pederson. “MAGIC Pointing for Eyewear Computers”. In: Proceedings of the 2015 ACM International Symposium on Wear- able Computers. ISWC ’15. New York, NY, USA: ACM, 2015, pp. 155–158. doi: 10.1145/2802083.2802094. [29] P. Majaranta and A. Bulling. “Eye Tracking and Eye-Based Human–Computer In- teraction”. en. In: Advances in Physiological Computing. Ed. by S. H. Fairclough and K. Gilleade. London: Springer London, 2014, pp. 39–65. doi: 10.1007/978-1- 4471-6392-3_3. [30] J. Steil and A. Bulling. “Discovery of Everyday Human Activities from Long-term Visual Behaviour Using Topic Models”. In: Proceedings of the 2015 ACM Interna- tional Joint Conference on Pervasive and . UbiComp ’15. New York, NY, USA: ACM, 2015, pp. 75–85. doi: 10.1145/2750858.2807520. [31] F. You et al. “Using Eye-Tracking to Help Design HUD-Based Safety Indicators for Lane Changes”. In: Proceedings of the 9th International Conference on Automotive User Interfaces and Interactive Vehicular Applications Adjunct. AutomotiveUI ’17. New York, NY, USA: ACM, 2017, pp. 217–221. doi: 10.1145/3131726.3131757. [32] O. Palinko et al. “Estimating Cognitive Load Using Remote Eye Tracking in a Driv- ing Simulator”. In: Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications. ETRA ’10. New York, NY, USA: ACM, 2010, pp. 141–144. doi: 10.1145/1743666.1743701. [33] X. Zhang, H. Kulkarni, and M. R. Morris. “Smartphone-Based Gaze Gesture Com- munication for People with Motor Disabilities”. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. CHI ’17. New York, NY, USA: ACM, 2017, pp. 2878–2889. doi: 10.1145/3025453.3025790. [34] C. Lankford. “Effective Eye-gaze Input into Windows”. In: Proceedings of the 2000 Symposium on Eye Tracking Research & Applications. ETRA ’00. New York, NY, USA: ACM, 2000, pp. 23–27. doi: 10.1145/355017.355021. [35] U. Sambrekar and D. Ramdasi. “Human computer interaction for disabled using eye motion tracking”. In: 2015 International Conference on Information Processing (ICIP). Dec. 2015, pp. 745–750. doi: 10.1109/INFOP.2015.7489481. [36] Y. Shiga et al. “Daily Activity Recognition Combining Gaze Motion and Visual Features”. In: Proceedings of the 2014 ACM International Joint Conference on Per- vasive and Ubiquitous Computing: Adjunct Publication. UbiComp ’14 Adjunct. New York, NY, USA: ACM, 2014, pp. 1103–1111. doi: 10.1145/2638728.2641691. [37] A. Fathi, Y. Li, and J. M. Rehg. “Learning to Recognize Daily Actions Using Gaze”. en. In: Computer Vision – ECCV 2012. Lecture Notes in Computer Sci- ence. Springer, Berlin, Heidelberg, Oct. 2012, pp. 314–327. doi: 10.1007/978-3- 642-33718-5_23.

27 [38] B. Kitchenham. “Procedure for undertaking systematic reviews”. In: Computer Sci- ence Depart-ment, Keele University (TRISE-0401) and National ICT Australia Ltd (0400011T. 1), Joint Technical Report (2004). [39] M. Kuhrmann, D. M. Fern´andez,and M. Daneva. “On the Pragmatic Design of Literature Studies in Software Engineering: An Experience-based Guideline”. en. In: (Dec. 2016). doi: 1612.03583. [40] D. Budgen and P. Brereton. “Performing Systematic Literature Reviews in Soft- ware Engineering”. In: Proceedings of the 28th International Conference on Soft- ware Engineering. ICSE ’06. New York, NY, USA: ACM, 2006, pp. 1051–1052. doi: 10.1145/1134285.1134500. [41] A. M. O’Brien and C. Mc Guckin. The Systematic Literature Review Method: Tri- als and Tribulations of Electronic Database Searching at Doctoral Level. 1 Oliver’s Yard, 55 City Road, London EC1Y 1SP United Kingdom: SAGE Publications, Ltd., 2016. doi: 10.4135/978144627305015595381. [42] C. Wohlin. “Guidelines for Snowballing in Systematic Literature Studies and a Repli- cation in Software Engineering”. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. EASE ’14. New York, NY, USA: ACM, 2014, 38:1–38:10. doi: 10.1145/2601248.2601268. [43] O. D. Lara and M. A. Labrador. “A Survey on Human Activity Recognition us- ing Wearable Sensors”. In: IEEE Communications Surveys Tutorials 15.3 (2013), pp. 1192–1209. doi: 10.1109/SURV.2012.110112.00192. [44] T. Maekawa et al. “Activity Recognition with Hand-worn Magnetic Sensors”. In: Personal Ubiquitous Comput. 17.6 (Aug. 2013), pp. 1085–1094. doi: 10 . 1007 / s00779-012-0556-8. [45] K.-T. Song and W.-J. Chen. “Human activity recognition using a mobile camera”. In: 2011 8th International Conference on Ubiquitous Robots and . Nov. 2011, pp. 3–8. doi: 10.1109/URAI.2011.6145923. [46] Z. Zhou et al. “Video-based activity monitoring for indoor environments”. In: 2009 IEEE International Symposium on Circuits and Systems. May 2009, pp. 1449–1452. doi: 10.1109/ISCAS.2009.5118039. [47] A. K. S. Kushwaha, M. Kolekar, and A. Khare. “Vision Based Method for Ob- ject Classification and Multiple Human Activity Recognition in Video Survelliance System”. In: Proceedings of the CUBE International Information Technology Con- ference. CUBE ’12. New York, NY, USA: ACM, 2012, pp. 47–52. doi: 10.1145/ 2381716.2381727. [48] O. C. Ann and L. B. Theng. “Human activity recognition: A review”. In: 2014 IEEE International Conference on Control System, Computing and Engineering (ICCSCE 2014). Nov. 2014, pp. 389–393. doi: 10.1109/ICCSCE.2014.7072750. [49] Z. Zhang. “Microsoft Kinect Sensor and Its Effect”. In: IEEE MultiMedia 19.2 (Feb. 2012), pp. 4–10. doi: 10.1109/MMUL.2012.24.

28 [50] J. Lei, X. Ren, and D. Fox. “Fine-grained Kitchen Activity Recognition Using RGB- D”. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing. Ubi- Comp ’12. New York, NY, USA: ACM, 2012, pp. 208–211. doi: 10.1145/2370216. 2370248. [51] Z. Zhou et al. “Activity Analysis, Summarization, and Visualization for Indoor Hu- man Activity Monitoring”. In: IEEE Transactions on Circuits and Systems for Video Technology 18.11 (Nov. 2008), pp. 1489–1498. doi: 10.1109/TCSVT.2008.2005612. [52] L. Wang et al. “Silhouette analysis-based gait recognition for human identification”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 25.12 (Dec. 2003), pp. 1505–1518. doi: 10.1109/TPAMI.2003.1251144. [53] K. Ohnishi et al. “Recognizing Activities of Daily Living with a Wrist-Mounted Camera”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2016, pp. 3103–3111. doi: 10.1109/CVPR.2016.338. [54] K. Zhan, V. Guizilini, and F. Ramos. “Dense motion segmentation for first-person activity recognition”. In: 2014 13th International Conference on Control Automation Robotics Vision (ICARCV). Dec. 2014, pp. 123–128. doi: 10.1109/ICARCV.2014. 7064291. [55] F. Shahmohammadi et al. “Smartwatch Based Activity Recognition Using Active Learning”. In: Proceedings of the Second IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies. CHASE ’17. Piscataway, NJ, USA: IEEE Press, 2017, pp. 321–329. doi: 10.1109/CHASE.2017. 115. [56] S. A. Hoseini-Tabatabaei, A. Gluhak, and R. Tafazolli. “A Survey on Smartphone- based Systems for Opportunistic User Context Recognition”. In: ACM Comput. Surv. 45.3 (July 2013), 27:1–27:51. doi: 10.1145/2480741.2480744. [57] D. Bobkov et al. “Activity recognition on handheld devices for pedestrian indoor navigation”. In: 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN). Oct. 2015, pp. 1–10. doi: 10.1109/IPIN.2015.7346945. [58] S. Dernbach et al. “Simple and Complex Activity Recognition through Smart Phones”. In: 2012 Eighth International Conference on Intelligent Environments. June 2012, pp. 214–221. doi: 10.1109/IE.2012.39. [59] J. R. Kwapisz, G. M. Weiss, and S. A. Moore. “Activity Recognition Using Cell Phone Accelerometers”. In: SIGKDD Explor. Newsl. 12.2 (Mar. 2011), pp. 74–82. doi: 10.1145/1964897.1964918. [60] C. Zhu and W. Sheng. “Realtime human daily activity recognition through fusion of motion and location data”. In: The 2010 IEEE International Conference on Informa- tion and Automation. June 2010, pp. 846–851. doi: 10.1109/ICINFA.2010.5512451. [61] V. Radu et al. “Towards Multimodal for Activity Recognition on Mobile Devices”. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct. UbiComp ’16. New York, NY, USA: ACM, 2016, pp. 185–188. doi: 10.1145/2968219.2971461.

29 [62] M. Buettner et al. “Recognizing Daily Activities with RFID-based Sensors”. In: Pro- ceedings of the 11th International Conference on Ubiquitous Computing. UbiComp ’09. New York, NY, USA: ACM, 2009, pp. 51–60. doi: 10.1145/1620545.1620553. [63] E. Garcia-Ceja, C. E. Galv´an-Tejada, and R. Brena. “Multi-view stacking for activity recognition with sound and accelerometer data”. In: Information Fusion 40 (Mar. 2018), pp. 45–56. doi: 10.1016/j.inffus.2017.06.004. [64] C. Wojek, K. Nickel, and R. Stiefelhagen. “Activity Recognition and Room-Level Tracking in an Office Environment”. In: 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. Sept. 2006, pp. 25–30. doi: 10.1109/MFI.2006.265608. [65] R. I. Ramos-Garcia, S. Tiffany, and E. Sazonov. “Using respiratory signals for the recognition of human activities”. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Aug. 2016. doi: 10.1109/EMBC.2016.7590668. [66] B. Wei et al. “Radio-based Device-free Activity Recognition with Radio Frequency Interference”. In: Proceedings of the 14th International Conference on Information Processing in Sensor Networks. IPSN ’15. New York, NY, USA: ACM, 2015, pp. 154– 165. doi: 10.1145/2737095.2737117. [67] W. J. Ryan et al. “Match-moving for Area-based Analysis of Eye Movements in Natural Tasks”. In: Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications. ETRA ’10. New York, NY, USA: ACM, 2010, pp. 235–242. doi: 10.1145/1743666.1743722. [68] D. Akkil and P. Isokoski. “Gaze Augmentation in Egocentric Video Improves Aware- ness of Intention”. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. CHI ’16. New York, NY, USA: ACM, 2016, pp. 1573–1584. doi: 10.1145/2858036.2858127. [69] A. Bulling, D. Roggen, and G. Troester. “What’s in the Eyes for Context-Awareness?” In: IEEE Pervasive Computing 10.2 (Apr. 2011), pp. 48–57. doi: 10.1109/MPRV. 2010.49. [70] A. Bulling, C. Weichel, and H. Gellersen. “EyeContext: Recognition of High-level Contextual Cues from Human Visual Behaviour”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’13. New York, NY, USA: ACM, 2013, pp. 305–308. doi: 10.1145/2470654.2470697. [71] A. Bulling et al. “Eye movement analysis for activity recognition”. In: 2009, pp. 41– 50. doi: 10.1145/1620545.1620552. [72] C. Meng and X. Zhao. “Webcam-Based Eye Movement Analysis Using CNN”. In: IEEE Access 5 (2017), pp. 19581–19587. doi: 10.1109/ACCESS.2017.2754299. [73] Y. Morita, H. Takano, and K. Nakamura. “Characteristics of Pupil Diameter Vari- ation during Paying Attention to a Specific Image”. In: 2015 IEEE International Conference on Systems, Man, and Cybernetics. Oct. 2015, pp. 2328–2332. doi: 10. 1109/SMC.2015.407.

30 [74] S. P. Liversedge and J. M. Findlay. “Saccadic eye movements and cognition”. In: Trends in Cognitive Sciences 4.1 (Jan. 2000), pp. 6–14. doi: 10 . 1016 / S1364 - 6613(99)01418-7. [75] B. Hwang et al. “Probing of human implicit intent based on eye movement and pupillary analysis for augmented cognition”. In: International Journal of Imaging Systems and Technology 23.2 (2013), pp. 114–126. doi: 10.1002/ima.22046. [76] H. Park et al. “Using eye movement data to infer human behavioral intentions”. In: Computers in Human Behavior 63 (2016), pp. 796–804. doi: 10.1016/j.chb.2016. 06.016. [77] Y.-M. Jang et al. “Recognition of Human’s Implicit Intention Based on an Eyeball Movement Pattern Analysis”. en. In: Neural Information Processing. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, Nov. 2011, pp. 138–145. doi: 10.1007/978-3-642-24955-6_17. [78] D.-G. Lee, K.-H. Lee, and S.-Y. Lee. “Implicit shopping intention recognition with eye tracking data and response time”. In: 2015, pp. 295–298. doi: 10.1145/2814940. 2815001. [79] Y.-M. Jang et al. “Human intention recognition based on eyeball movement pattern and pupil size variation”. In: Neurocomputing 128 (2014), pp. 421–432. doi: 10. 1016/j.neucom.2013.08.008. [80] C. Louis and M. Lawrence. Research Methods in Education. 7 edition. London ; New York: Routledge, Apr. 2011. Chap. 16. [81] O. Briony. Researching Information Systems and Computing. English. 1 edition. London ; Thousand Oaks, Calif: SAGE Publications Ltd, Nov. 2005. [82] M. Kassner, W. Patera, and A. Bulling. “Pupil: An Open Source Platform for Per- vasive Eye Tracking and Mobile Gaze-based Interaction”. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Comput- ing: Adjunct Publication. UbiComp ’14 Adjunct. New York, NY, USA: ACM, 2014, pp. 1151–1160. doi: 10.1145/2638728.2641695. [83] Microsoft. USB Video Class Driver Overview. url: https://docs.microsoft. com/en- us/windows- hardware/drivers/stream/usb- video- class- driver- overview (visited on 04/01/2018). [84] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python ”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [85] G. Lemaitre, F. Nogueira, and C. K. Aridas. “Imbalanced-learn: A Python Tool- box to Tackle the Curse of Imbalanced Datasets in Machine Learning”. In: CoRR abs/1609.06570 (2016). arXiv: 1609.06570. [86] P. Kiefer, I. Giannopoulos, and M. Raubal. “Using eye movements to recognize activities on cartographic maps”. en. In: ACM Press, 2013, pp. 488–491. doi: 10. 1145/2525314.2525467. [87] D. Salvucci and J. Goldberg. “Identifying fixations and saccades in eye-tracking pro- tocols”. In: Proceedings of the Eye Tracking Research and Applications Symposium. Jan. 2000, pp. 71–78. doi: 10.1145/355017.355028.

31 [88] I. H. Witten, E. Frank, and M. A. Hall. Data mining: practical machine learning tools and techniques. en. 3rd ed. Morgan Kaufmann series in data management systems. OCLC: ocn262433473. Burlington, MA: Morgan Kaufmann, 2011. [89] M. M. Graham et al. “CHAPTER 52 - Parametric Image Generation with Neural Networks”. In: Quantitative Functional Brain Imaging with Positron Emission To- mography. Academic Press, 1998, pp. 347–352. doi: 10.1016/B978- 012161340- 2/50054-8. [90] K. M. Ting. “Precision and Recall”. In: Encyclopedia of Machine Learning and Data Mining. Ed. by C. Sammut and G. I. Webb. Boston, MA: Springer US, 2017, pp. 990– 991. doi: 10.1007/978-1-4899-7687-1_659.

32 A Appendix

A.1 Systematic Literature Study Search Keywords

Table 1: Keywords that were used when searching in the databases

eye and gaze tracking activity and intention recognition ”gaze tracking” eye* & intention eye & track* & recognition eye* & ”activity recognition” “eye movement analysis” gaze & ”activity recognition” gaze & intention inertial & ”activity recognition” camera & ”activity recognition”

A.2 Systematic Literature Study Inclusion and Exclusion Criteria

Table 2: The criteria that the literature is filtered through

Nbr Inclusion/Exclusion Description Motivation 1 Relevant databases, key- Only use specified These are the sources words databases, keywords in that are relevant in the abstract computer science field. 2 Publication form Only use proceedings, These formats are com- journals and articles mon and well regarded within computer science. 3 Relevant abstracts and Only read title and ab- To determine if the liter- title stracts ature is relevant or not. 4 Relevant conclusion and Only read conclusion To further determine if summary and discussion the literature is relevant or not. 5 Relevant full text con- Read full text To even further deter- tent mine if the literature is relevant or not. 6 Backwards snowballing Use one iteration of To get even more rele- on chosen literature. backwards snowballing vant references that are on chosen literature used in the previous re- search and well used within the field. This cri- teria assumes that pre- vious criteria is already met.

33 A.3 Applied Inclusion and Exclusion Criteria to the Search Results

Table 3: Summary of hits after applying criteria

Criteria Items left 1 6708 2 3784 3 89 4 66 5 36 6 54

34