Experimental Test report

Document information Project Title 6th Sense Project Number E.02.25 Project Manager Fraunhofer Austria Deliverable Name Verification Report “Experimental Test Report” Deliverable ID Del 4.2 Edition 00.01.01 Template Version 03.00.00 Task contributors

Fraunhofer Austria, FREQUENTIS AG, Fraunhofer FKIE, subcontracted by Fraunhofer Austria

Abstract The project Sixth Sense postulates that the users “body language" is different at “good” and “bad” decisions. The project follows the idea to use the whole body language of a user for communicating with a machine. In our case it is an Air Traffic Controller (ATCO) with an Air Traffic Tower CWP. Specifically we intend to analyse the correlation of the change of the behaviour of an ATCO - expressed through his body language - with the quality of the decisions he is making. For that, an experiment was set up, data about the user behaviour was collected, explored and analysed. This document is the test report of the proof of concept for the Sixth Sense prototype and its core components. Results of our work may be used for an early warning for “bad” situations about to occur or decision aids for the ATCO. Authoring & Approval

Prepared By - Authors of the document. Name & Company Position & Title Date Volker Settgast/Fraunhofer Austria Project Contributor 22.06.2015 Nelson Silva / Fraunhofer Austria Project Contributor 01.07.2015 Carsten Winkelholz, Jesscia Schwarz/ Subcontracted Project 07.07.2015 Fraunhofer FKIE Contributor Michael Poiger / Frequentis AG Project Contributor 08.07.2015 Florian Grill / Frequentis AG Project Contributor 08.07.2015

Reviewed By - Reviewers internal to the project. Name & Company Position & Title Date Theodor Zeh / Frequentis Technical Coordinator 15.07.2015

Eva Eggeling / Fraunhofer Austria Project Manager 15.07.2015

Approved for submission to the SJU By - Representatives of the company involved in the project. Name & Company Position & Title Date Theodor Zeh / Frequentis Technical Coordinator 30.07.2015

Eva Eggeling / Fraunhofer Austria Project Manager 30.07.20125

Rational for rejection None.

Document History

Edition Date Status Author Justification 00.00.01 20/06/2015 Draft Eva Eggeling New Document 00.00.03 07/07/2015 Update Volker Settgast merged version 00.00.04 09/07/2015 Update all merged version 00.00.08 15/07/2015 Update all merged version Submission 00.01.00 30/07/2015 Eva Eggeling Merged Version Version 00.01.01 15/09/2015 Review Version Eva Eggeling/all Resubmission

Intellectual Property Rights (foreground) This deliverable consists of SJU foreground.

2 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Table of Contents TABLE OF CONTENTS ...... 3 LIST OF TABLES ...... 4 LIST OF FIGURES ...... 4 EXECUTIVE SUMMARY ...... 6 1.1 PURPOSE OF THE DOCUMENT ...... 7 1.2 INTENDED READERSHIP...... 7 1.3 ACRONYMS AND TERMINOLOGY ...... 7 2 THE EXPERIMENT ...... 9 2.1 EXPERIMENTAL SETUP ...... 9 2.2 OPERATIONAL SCENARIO ...... 10 2.3 ROLES AND RESPONSIBILITIES ...... 11 2.4 TECHNICAL SETUP OF THE EXPERIMENT ...... 12 2.5 AMQ BROKER ...... 13 2.6 THE HUMAN MACHINE INTERFACE (HMI) ...... 13 3 PERFORMING THE EXERCISES ...... 16 3.1 PROFILE OF PARTICIPANTS ...... 16 3.2 DATA ANALYSIS, EXPLORATION AND VISUALIZATION ...... 17 3.2.1 Heart Rate vs Observations List ...... 18 3.2.2 Eye-Tracker and mouse Analysis ...... 19 3.2.3 Simple Metrics and Data Exploration ...... 20 4 RESULTS ...... 23 4.1 WORKLOAD ESTIMATES BASED ON QUESTIONNAIRES ...... 23 4.2 HINTS FOR H1 - EXPLORING THE SENSOR DATA ...... 28 4.2.1 Sixth Sense Prototype Framework for Data Exploration ...... 31 4.2.2 Categorization of Metrics regarding mental Aspects ...... 34 4.2.3 Research Questions ...... 38 4.2.4 Data Exploration and Analysis ...... 40 4.3 HINTS FOR H2 - ANALYSIS OF THE ARRIVAL AND DEPARTURE WORKFLOWS ...... 54 4.3.1 Implementation ...... 54 4.3.2 Results of the Analysis of ATC Workflow Steps ...... 55 4.3.3 Experiments ...... 57 4.4 EVENT TRACE ANALYSIS ...... 61 4.4.1 Variable Length Markov Models (VLMM) ...... 61 4.4.2 Scatterplot Matrix for Measures ...... 64 4.4.3 Visualization of Sequential Patterns ...... 64 4.4.4 Insights regarding interaction sequences ...... 66 4.4.5 States corresponding to outliers and around ...... 75 4.5 CONCLUSION ...... 77 4.6 FUTURE WORK ...... 79 REFERENCES ...... 80 APPENDIX A ...... 82 A.1 TECHNICAL VERIFICATION DETAILS OF EXERCISE 1 AND 2 ...... 82 A.1.1 Kinect ...... 87 A.1.2 Speech Recognition ...... 92 APPENDIX B QUESTIONNAIRES ...... 94

3 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

List of tables Table 1 - Description of the workflow steps ...... 9 Table 2- Data collection and Quality Assessment for Different Data Sets and Sensors ...... 17 Table 3 - Resume of initial metrics to be visualized and explored ...... 21 Table 4 - Outliers in negative/positive answers...... 28 Table 5 - Resume of most important metrics ...... 31 Table 6 - Classification of most important metrics into categories...... 35 Table 7- List of Main Research Questions ...... 39 Table 8 - Resume of AOI that received most interest time from each user...... 45 Table 9 - Resume of parameters for the Kinect Head Pose ...... 48 Table 10 – Filter/Query to detect airplanes that are in the workflow step TAXI...... 55 Table 11 - Most frequent state sequences for the eye data (top5 for each user)...... 67 Table 12 - Most frequent states of each user for the eye fixation sequences...... 68 Table 13 - Most frequent state sequences for the mouse data (top 5 for each user)...... 69 Table 14 - Most complex state sequences for the eye tracking data (top 5 for each user) ...... 70 Table 15 - Illustration of the most complex state sequences for the eye tracking data ...... 72 Table 16 - Most complex state sequences for the mouse data (top 5 for each user)...... 73 Table 17 - Illustration of the most complex state sequences for the mouse data ...... 74 Table 18 - Examples of state sequences corresponding to outliers in the scatterplots...... 76 Table 19 -Technical specifications of Kinect...... 88 Table 20 - Kinect Results ...... 91

List of figures Figure 1 - Update of the exercise plan as described in the experimental plan...... 9 Figure 2 - Experimental Workflow ...... 10 Figure 3 - Hamburg Airport ...... 11 Figure 4 - Arrival workflow - responsibilities ...... 12 Figure 5 - Departure workflow - responsibilities ...... 12 Figure 6 - Setup working position ...... 13 Figure 7 - Components of the HMI screen ...... 14 Figure 8 - Departure Strips ...... 14 Figure 9 - Arrival Strips ...... 15 Figure 10 - Strip Bay Configuration / Bar...... 15 Figure 11 - RMSSD Formula...... 18 Figure 12 - Z-Score IBI vs negative observations through the total experiment time for user8...... 19 Figure 13 - Areas of interest of the ATC Simulator as defined in Ogama ...... 20 Figure 14 - Ranking of metrics and visualizations ...... 21 Figure 15 - Observation List ...... 22 Figure 16 - Mental Demand Results for all 8 users, 2 experiments...... 24 Figure 17 - Physical Demand Results for all 8 users, 2 experiments...... 24 Figure 18 - Temporal Demand Results for all 8 users, 2 experiments...... 24 Figure 19 - Level of Effort Results for all 8 users, 2 experiments...... 24 Figure 20 - Level of Frustration Results for all 8 users, 2 experiments...... 25 Figure 21 - NASA-TLX Negative Results (not considering “Level of Performance” answers)...... 25 Figure 22 - Level of Performance (for all users, 2 experiments)...... 25 Figure 23 - NASA-TLX Correlation Matrix (taking all answers from all users)...... 25 Figure 24 - SAGAT Based Questionnaire...... 26 Figure 25 - SAGAT Correlated Answers...... 26 Figure 26 - SASHA based Questionnaire...... 27 Figure 27 - SASHA Questionnaires, correlated Plot...... 27 Figure 28 - NASA-TLX and SASHA Correlation Matrix...... 27 Figure 29 - Negative vs Positive Answers (based on all questionnaires)...... 27 Figure 30 - Overview of the Sixth Sense Desktop Application Prototype ...... 31 Figure 31 - Screenshot of the Sixth Sense desktop application UIAction Pace Calculator...... 33 Figure 32 - Sixth Sense desktop application UI Actions Types Monitor...... 33

4 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 33 - Sixth Sense web based reports for supervisors data exploration, also printable...... 34 Figure 34 - Distinction between task load and workload (Hilburn & Jorna, 2001) ...... 35 Figure 35 - Relationship between workload and performance (Veltman & Jansen, 2003) ...... 36 Figure 36 - Events from observation list with high impact on the performance of the users...... 37 Figure 37 - Events from observation list with more impact on the performance of each user...... 38 Figure 38 - Departures and arrivals (green area) vs number of negative observations (red) ...... 40 Figure 39 - Interdependence between arriving airplanes, departures and stress levels...... 41 Figure 40 - Correlation between negative observations and HRV. HRV is a good indicator for periods of negative observations...... 42 Figure 41 - Relation between mouse AOI frequencies observation list and HRV...... 43 Figure 42 - Mouse AOI of user7 that received most interest time during the experiment...... 44 Figure 43 - Eye AOI of user7 that received most interest time during the experiment...... 44 Figure 44 - standard deviation o(2 minutes) and number of errors, capturing very well periods with increased user errors...... 46 Figure 45 - Relation between eye and mouse movements (AOI visits) and occurrence of errors...... 47 Figure 46 - Kinect Head Pose Measurements Schema...... 48 Figure 47 - Kinect Data Representation, visualizing Detected Head Pose vs Count of Negative/Positive Observations vs Type of Observation vs User in Range (or not in range)...... 49 Figure 48 - Kinect Data after applying filters to include only the majority of negative observations (96%)...... 50 Figure 49 - Correlation between total number of mouse clicks and negative observations ...... 51 Figure 50 - Correlation of negative observations and difference in number of words/mouse actions .. 52 Figure 51 - Relationship between number of words spoken and negative observations...... 53 Figure 52 - Example of using CEP to join two different events into one...... 54 Figure 53 - The complete process of consuming, filtering and generating events...... 55 Figure 54 - Analysis of the Processing Time (seconds) for arrivals (orange/brown) and Departures (Blue) for user8...... 56 Figure 55 - DM/ML/AI module with automatically calculated metrics for arrival flights capturing repeated workflow steps (e.g., number of taxi commands or cross runways for all flight)...... 57 Figure 56 - Discovery of Association Rules using the algorithm fp-growth...... 58 Figure 57 - Relation Between the discovered association rules and different variables of the model. 58 Figure 58 - Outliers Discovery for negative observations in the new dataset with metrics counters (captured between successive negative observations)...... 59 Figure 59 - Decision tree to depict reasons for increasing numbers of negative occurrences for different users...... 60 Figure 60 - Polynomial regression analysis for creating a model to predict negative occurrences based on top most metrics (number of eye events or departure flights)...... 61 Figure 61 - Relation of Probabilistic Suffix Tree (PST) and Automation PSA...... 62 Figure 62 - Hypothetical distribution of event durations, if after one event (left) or a sequence of two events (right) a specific event is observed...... 62 Figure 63 - Illustration of the complexity measure of Grassberger...... 63 Figure 64 - Screenshot of the with displayed transition probabilities...... 65 Figure 65 - Illustration of how probabilities for next events in a sequence are displayed...... 65 Figure 66 - Illustration of the user interface combining states with displayed scatterplot matrix...... 66 Figure 67 - Eye-Tracking ...... 82 Figure 68 - Test Setup - Eye-Tracking ...... 83 Figure 69 - Eye Tracking Data Analysis ...... 84 Figure 70 - Test Person 1 – Eye Tracking ...... 85 Figure 71 - Test Person 2 – Eye Tracking ...... 86 Figure 72 - Test Person 3 – Eye Tracking ...... 86 Figure 73 - Test Person 4 – Eye Tracking ...... 87 Figure 74 - Kinect sensor ...... 87 Figure 75 - Sensors included in the Kinect ...... 88 Figure 76 - Test Setup - Kinect ...... 89 Figure 77 - Evaluation of distances and angle ...... 90 Figure 78 - Test Setup – Speech Recognition ...... 92 Figure 79 - Callsign Recognition Rate – Speech Recognition ...... 93

5 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Executive summary The project Sixth Sense follows the idea of using the whole body language of a user for communicating with a machine. In our case it is an Air Traffic Controller (ATCO) with an Air Traffic Tower CWP. Specifically we intend to analyse the correlation of the change of the behaviour of an ATCO - expressed through her/his body language - with the quality of the decisions she/he is making. Result of our work may be used for an early warning for “bad” situations about to occur or decision aids for the ATCO. We used scenarios of Hamburg Airport since its layout has sufficient complexity to bring the test personnel in difficult situations which are needed to test our hypothesis. Sensors for reading the body language were:  Kinect for body movement  Eye tracking for gaze detection  Speech recognition  Mouse position  Room temperature  Heartbeat of user  Expert observations The sensors were recorded through each run together with the workflow/the tasks performed by the user. The workflow was retrospectively analysed by experts who marked bad decisions and/or bad situations arising. Combinations of sensor recordings and different visualisations therefrom were used to detect repetitive patterns of user behaviour correlating to good or bad decisions. Several test runs were performed in two batches to gain as much test data as possible in the available time frame to experiment with. Details are in the paragraphs below.

Key learnings of the work performed were:  An analysis of decision quality through experts is difficult since the intention of the test person stays hidden. Additional self-assessment will add value in future tests.  Analyses of sensor recordings offer infinite possibilities of combinations as well as visualisations therefrom. Further work on the existing data might produce even more significant findings

Conclusion: our test setup and process proofed right. Analytical tools and visualisations used are feasible although there are numerous other possibilities which might be even better. Due to the nature of this kind of exploratory research projects with restricted resources no statistical relevance in the found patterns is recognisable. The number of test persons was too low. However, the concrete patterns which have been found allow deriving early indications for good or bad decisions. There are good indications for positive results when more test data and more time is available for sensor permutation analysis.

6 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

1.1 Purpose of the document This document provides a results report to the experiment in the Sixth Sense project. Chapter 2 describes the setup of the experiment and explains the exercises. The performance of the experiment is summed up in Chapter 3. In Chapter 4 we present the results: We give a classification of our most important metrics related with task-load, mental workload, attention, behaviour and performance. Then we have a deeper look into related research questions and describe the complex data analysis. Relationships between the sensor data streams are discussed and conclusions of the analysis are described in detail. We explain the capabilities of our software framework and explain future directions of research, for example the use of the current results to create predictive models or how to make improvements in the user interface to support the user in making more informed decisions. 1.2 Intended readership This document might be of interest for:  Sixth Sense project members, including the project manager and the core team members.  Representatives of EUROCONTROL and SJU responsible for reviewing and advising the project.  Other researchers working on the related research projects, particularly researchers on error avoidance, new technologies and interaction methods.  Personnel in air traffic management and other parts of the aviation sector. 1.3 Acronyms and Terminology

Term Definition

AI

AMQ Active Message Queue

ARR Arrival

ATCO Air Traffic Control Officer

ATM Air Traffic Management

DEP Departure

DM Data Mining

IBI Inter Beat Interval

MFA Multilateral Framework Agreements

HRV heart rate variability

KPI Key Performance Indicator

ML Machine Learning

Negative error Negative situation that could not be resolved

7 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Term Definition

Positive error Negative situation, which could be solved with effort of the user

SESAR Single European Sky ATM Research Programme

SESAR Programme The programme which defines the Research and Development activities and Projects for the SJU.

SJU SESAR Joint Undertaking (Agency of the European Commission)

SJU Work Programme The programme which addresses all activities of the SESAR Joint Undertaking Agency.

TWR Tower

.

8 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

2 The Experiment This section provides general information on the final design of the experiment, the preparation of the exercises and their performance. In contrast to the original plan described in the experimental plan 4.1, due to limited resources we skipped the AI-module development and did not perform Exercise 3. A few tasks of Exercise 3 (first steps towards prediction) were handled by analysing the data collected in Exercise 2.

Excercise 1 Excercise 2 Excercise 3

Expert KPIs ratings

Test of Development of Predictions of Expert KPIs Tracking Sensors DM/ML/AI‐module DM/ML/AI‐module ratings

Test of DM/ML/AI‐module

Figure 1 - Update of the exercise plan as described in the experimental plan. The prediction and DM/ML/AI-module-test, was part of Exercise 3 and could not be processed because of the limited amount of time, resources and data. First steps regarding predictions are described in Section 4.3.3. 2.1 Experimental Setup The following steps have been conducted to execute the experiment where a participant performs a simulated 60 minutes ground controller shift at a simulated ground controller position.

Name Description

Overall Experimental Provision of an overall briefing, providing an overview about the used Briefing system and the operational scenario conducted during the exercise.

Start of Experiment Reset of the operational scenario.

A_Pre-Questionnaire Collecting information about test person (working experience, etc.).

Recording of data Start of the recording of data, to be collected into the database.

Run Exercise Start of the operational scenario and conducting the exercise

B_Supervisor Observation During the exercise the observer took notes and collected the stress level

C_Post-Questionnaire Collection about subjective feelings (situational awareness, workload).

D_Debriefing Collection of debriefing questionnaire answers of the test person.

Overall Experiment General Debriefing to close the experiment session. Debriefing Table 1 - Description of the workflow steps 9 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Every participant received a map of the airport (Hamburg) and was asked to assume the experiment work place. The participants were informed that they could ask questions about the use of the simulator user interface to the air traffic controller supervisor, present in the room. When all questions were answered, the experiment started, and the air traffic information was loaded into the simulator. Every 10 minutes, the supervisor asked, what was the current stress level experienced by the participant and registered extra notes about his personal evaluation point of view of the current performance of the participant. The experiment lasted for 45 minutes, but it could run for 60 minutes maximum, depending on the current air traffic situation.

Overall Experiment Briefing (incl. Operational)

Start of Experiment

A_Pre‐Questionnaire

B_Supervisor Observation Recording of data Run Exercise (Observation List)

C_Post‐Questionnaire

D_Debriefing

Overall Experiment Debriefing

Figure 2 - Experimental Workflow

Table 1 and Figure 2 provide an overview of the different workflow steps within the experimental scenario. The Questionnaires A-D can be found in Appendix B. 2.2 Operational Scenario The operational scenario was based on Hamburg Airport.

10 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 3 - Hamburg Airport Following constraints have been used to prepare the scenario:  Simulation prepared for approx. 60 min.  Arrivals are automatically simulated until touchdown. (no change of route)  Departures are controlled until take off.  No Runway change is foreseen within the simulation  Taxiway Routes can be selected by the operator.

Configurations during the experiment:  Arrival Runway: 23  Departure Runway: 33  Arrivals: 31 flights  Departures: 27 flights

2.3 Roles and Responsibilities The following roles were participating in the experiment:  Ground Controller: Participant  Runway Controller: Manually Simulated  Pseudo Pilots: Manually Simulated  Observer (supervisor)  Observer (experiment leader)

11 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

The responsibilities within the workflow are displayed in the following Figures:

Figure 4 - Arrival workflow - responsibilities

Figure 5 - Departure workflow - responsibilities

2.4 Technical Setup of the experiment The setup is based on a single simulated controller working position. No 3D view is available at the experiment. The experiment concentrated on the ground traffic management.

The following modules have been used during the experiment:  Traffic Simulator  CWP with EFS, Support Information  AMQ Broker  Eye-Tracker  Mouse  Keyboard  Speech Recognition

12 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Experimental CWP

Simulator Position Ground Position Observer Position

Voice Recognition Ground Controller Position Video / Screen Capturing Server Components AMQ Broker Simulator MySQL Data Logger RWY Controller Position Pilot Simulation

Figure 6 - Setup working position 2.5 AMQ Broker The broker is the central distribution system for all data communication between all components. The transport protocol used is stomp and open wire. ActiveMQ (AMQ) – allows a single point of data exchange between different systems, modules and functional blocks, through the usage of customized XML messages. It supports a variety of cross language clients and prototcols from Java, C, C++, C#, Python … For detailed information please refer to: http://activemq.apache.org/ 2.6 The Human Machine Interface (HMI) The HMI is split into several parts which are described in more detail in the following paragraphs. In general, the right side of the screen is reserved for a representation of the flights, the smartStrips. The middle part can contain a variety of different information including an overview of the airfield as shown in the picture below. The top contains an information bar and to the side there is a containing additional information e.g. status of the system or wind data. Summarized, the screen can thus be separated into:  Sidebar   Button bar (EFS)  Strips  Page Selection (Main Area)

13 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 7 - Components of the HMI screen Figure 8 explains the different fields of the departure strip. Three different sizes are available: DEP MICRO, DEP MEDIUM, and DEP MACRO. The filled explanations mean that this field can be pressed on the strip. Figure 9 explains the fields of the arrival strip. Three sizes are available: ARR MICRO, ARR MEDIUM, and ARR MACRO. The filled explanations mean that this field can be pressed on the strip.

Figure 8 - Departure Strips

14 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 9 - Arrival Strips

Figure 10 - Strip Bay Configuration / Button Bar In Figure 10 the information about the configured bays and the explanation of the button bar are shown.

15 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

3 Performing the Exercises The first exercise of this experiment was used to assess the accuracy of the sensors integrated into the prototype to ensure the necessary quality of the used technologies. The following sensors were tested in Exercise 1: ‐ Eye Tracker ‐ Kinect ‐ Speech Recognition See Appendix A, Exercise 1 of the deliverable D4.2 (Verification Plan): Exercise ID/Title: EXE-E.02.25-VP-0001.0001 /Eye-Tracking adapter Exercise ID/Title: EXE-E.02.25-VP-0001.0002 /Kinect AMQ-adapter Exercise ID/Title: EXE-E.02.25-VP-0001.0003 /Leap Motion AMQ-adapter The Leap Motion was found to be not useful in a sitting mouse environment. Instead, the speech recognition module was evaluated in more detail. More technical details about exercise 1 can be found in Appendix A of this document.

In Exercise 2 the participants followed the experimental workflow of Figure 2 and performed the simulated 60 minutes shift of a ground controller. During the exercise, the supervisor took observation notes and asked the participant for her/his stress level (on a scale from 1 to 5) every ten minutes. The observation notes consist of a time stamp and a short description of the observation. In the second part of Exercise 2 (user 5-8) this process was already automated and stored using the software framework. In a later step, the recorded video capture of the exercise was revised by a domain expert to create the observer list. The observer list consists of selected events which are rated positive, neutral and negative. A positive event occurs when the participant can successfully resolve a negative event. The detailed description of exercise 2 can be found in Appendix B Exercise 2 of the deliverable D4.2 (Verification Plan): Exercise ID/Title: EXE-E.02.25-VP-0002.0001 / Collecting Sensor data, and expert reviews. The questionnaires before, during and after the exercise can be found in Appendix. As mentioned in Section 2, in contrast to the original plan, we did not perform Exercise 3, we could only do the first steps regarding predictions in the data analysis of Exercise 2 data. 3.1 Profile of Participants All participants work in the field of air traffic control but at different expert levels: as air traffic controllers, one En-route, two Ground, one trained as a ground controller but works only in simulations experiments. Years of work experience: The years of professional experience differ between 2, 4, 14 and 20 years respectively. Gender: There were two male (50%) and two female (50%) participants. Age: One participant was aged between 20 and 30, one was aged between 30 and 40 and two were aged between 40 and 50. Language: Two of the participants had German as their mother language, one Romanian and the other Spanish. All communications between pilots and air traffic controllers were handled in English.

16 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Due to limited resources of test participants we had to reuse participants for the experiments. Learning effects caused by this reuse may not be ruled out but as we tried to measure behaviour for individual test runs, this effect can be neglected. 3.2 Data Analysis, Exploration and Visualization After performing Exercise 2 and post-processing of the data, a resume table (see Table 2) with the total number of usable events for each generated dataset (topic) was created.

Data Assessment

Total Topics User User User User User User User User Events Descriptio Variables (Datasets) 1 2 3 4 5 6 7 8 by n Topic Reports from Supervisor & Supervisors 3 51 65 13 123 91 91 107 152 693 Observer and Observers Stress Level 3 StressLevel 6 6 6 6 6 6 7 6 49 reports (from users) Flight 42 FlighObject 616 420 57 436 302 241 340 434 2846 Information Strips 6 Selections 420 302 17 211 197 149 209 268 1773 Selections 10 Eye 0 0 53 169 9097 61770 72534 68116 211739 Eye Tracker Mouse UI 4 GlobalMouse 0 72891 2929 79618 3844 12700 7588 8082 187652 Hook Mouse 7 Mouse 3046 1929 110 1290 916 1838 1266 1915 12310 Listner Kinect 23 Kinect 27351 0 0 0 7561 0 0 0 34912 Listner 12 Voice 1014 1160 36 1242 256 899 1126 1587 7320 Voice Listner Waspmote 4 Waspmote 0 0 0 0 1807 2754 3041 3376 10978 Listner Heart Rate Heart Rate 12 Measurement 0 2978 2274 5347 0 4512 0 5184 20295 Events s Collected Eye AOI Eye Tracking (Fixations, 13 0 0 0 0 715 1625 3396 3767 8788 Areas of Gazes, Interest Sacades) Mouse AOI Mouse (Fixations, Tracking 13 0 2217 0 2325 62 2729 1002 1083 9418 Gazes, Areas of Sacades) Interest Total Data <= Total Collected in Number of 32504 81968 5495 90767 24139 89314 90616 93970 152 508773 2 Variables experiments Number <= Total of Total Number of Number of Events Events 58 Airplanes Collecte d

27 <= Departures 31 <= Arrivals

Table 2- Data collection and Quality Assessment for Different Data Sets and Sensors

The first entry in Table 2 we entered the notes and annotations from the supervisor and observer of the exercise (see also Section 3.2.3 “Observations”). These notes are text notes about for example observations of errors or suboptimal situations. When talking about observations of errors we

17 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. distinguish between positive and negative, where positive means that negative situations, could be resolved by some effort of the user.

The second entry is the stress level. We acquired this information by asking the participant every ten minutes about the subjective stress level on a scale from 1 to 5.

During the exercises we encountered several hardware issues with the Kinect sensor. In favour of a higher eye tracking frequency it was decided to deactivate the Kinect for user 6-8.

3.2.1 Heart Rate vs Observations List Not all the users agreed in wearing the heart rate monitor device (for different reasons health, privacy). For user2, user3, user4, user6 and user8 we collected at least 3 baseline measurements at rest. In addition to the heart beat per minute we measured also the heart rate variability (HRV). The HRV indicates the fluctuations of the heart rate around an average heart rate. An average heart rate of 60 beats per minute (bpm) does not mean that the interval between successive heartbeats would be exactly 1.0 sec. Instead the interval may fluctuate/vary from 0.5 sec up to 2.0 sec. HRV is affected by aerobic fitness and HRV of a well-conditioned heart is generally large at rest. Other factors that affect HRV are age, genetics, body position, time of day, and health status. During exercise, HRV decreases as heart rate and exercise intensity increase. HRV also decreases during periods of mental stress. The HRV is regulated by the autonomic nervous system. Parasympathetic activity decreases heart rate and increases HRV, whereas sympathetic activity increases heart rate and decreases variability. A low HRV indicates dominance of the sympathetic response, the fight or flight side of the nervous system associated with stress, overtraining, and inflammation. Therein lies the beauty of HRV: it offers a glimpse into the activity of our autonomic nervous system, an aspect of our physiology normally shrouded in mystery. For the representation of the HRV we may use the time-domain method of the Root Mean Square of the Successive Differences (or RMSSD) as we can see in Figure 11:  HR = heart rate in beat per minute (bpm) ) = no. of R’s  R - R interval = inter-beat interval (IBI) in msec.  N = no. of R - R interval terms

1 ² 1

Figure 11 - RMSSD Formula.

Alternatively we also calculated the Z-Score and Z-Score IBI measure.

The inter-beat interval (IBI) is a scientific term used in reference to the time interval between individual beats of the heart. IBI is generally measured in units of milliseconds and it is measured automatically when recording with a Polar heart rate sensor. In normal heart function, each IBI value varies from beat to beat. This natural variation is known as HRV (see above). However, certain cardiac conditions may cause the individual IBI values to become nearly constant, resulting in the HRV being nearly zero. The Z-score (HR is the normalized value for heart rate from a distribution by mean and standard deviation. Here we take the average of the 3 heart rate measurements taken when the user was at rest (in the beginning of the experiment). Z-score IBI is the normalized value from a distribution by mean and standard deviation. Here we take the average of all measurements.

We use the Z-Score and Z-Score IBI measure to find out how far from the average baseline inter-beat interval measurements is the current value in time. This allows us to better check for changes in time of the users’ HRV. In Figure 12 we can see an example of the Z-Score IBI plot for user8. In the upper

18 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. part of the graph (count of positive/negative observations) the data is filtered to show only negative observations (experts in red, observers in orange). Below this chart, we can see the Avg Z-Score IBI, where in red we have negative values (decreases in the heart rate variation) and in green we see positive values (increases in the variation). The decrease in variation can be associated with periods of stress. We can observe that before periods of time with more negative observations there are clear indications of stress (decreases in HRV) followed by moments of relaxation when the user regains control.

User Date Time 5

4

3 user8

2 Average

1

0

0,5

0,0

-0,5

-1,0

0 1 4 6 8 1112131819202122232425262730313334363739404142434647484950515253545556575859

Positive / Negative Avg. Z-Score IBI * Negative -1,226 0,761 NegativeX Figure 12 - Z-Score IBI vs negative observations through the total experiment time for user8.

3.2.2 Eye-Tracker and mouse Analysis By monitoring the movement of the eyes and reconstruct the gaze point on the screen (eye tracking) the data stream contains many rapid position changes. The visual perception of the human needs a certain amount of time to realize elements of a . We are interested in gaze positions that are actively realized by the user. These positions are called fixations. The freely available eye tracking analysis software Ogama [1] was used to calculate fixations. The areas of interest (AOI) were defined within Ogama (see Figure 13). The calculation of fixations then automatically takes the AOIs into account and connects the results. For simplicity in the processing pipeline of the data we used the same for the mouse movements. Mouse fixations are positions on

19 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. which the mouse cursor rested for a certain amount of time (in our case: delta t > 66 ms and delta d < 20 pixel). All the fixation results were exported to a comma separated value file (CSV). To use the results in other software like Tableau we had to modify the CSV files. The time stamp had to be converted from seconds to a valid date time format. For eye tracking data we added an off- screen AOI for large time slots (>500 ms) between fixations, because saccades between two fixations never takes so much time, so it can be assumed that tracking had been lost and the operator had fixated something beside the screen, like for example some map which lay in front of him.

Figure 13 - Areas of interest of the ATC Simulator as defined in Ogama

3.2.3 Simple Metrics and Data Exploration After pre-processing the data, we had to decide what data set (topic) and what metrics related to the topic should be visualized, explored and investigated first and in more detail. For this we have created a resume table (see Table 3) which lists the metrics we wanted to analyse, visualize and explore.

Metrics Visualization Initial Requirements

Topic (Dataset) Metric visualizations Mouse Number of mouse clicks per time and per user Observer notes Errors, help requests and annotations noted by the observer Selections Number of strips update per time and per user Selections Strip selection through time Animated Temporal evolution for the different users of the number of Selections Selections Selections and StressLevel Stress Level vs Number of Callsign Interactions Supervisor Notes errors Voice Number of Calls per time (speech recognition) When there was an error report, we show the Workload Metrics Workload Metrics between errors Eye Number of fixations per time (eye) Eye Number of saccades per time (eye) Eye Number of transitions for AOI per time (eye) Eye and Mouse Correlation between transition of eye and mouse

20 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

FlightObject Number of flights in one workflow step per time and per user GlobalMouseWatcher Left Clicks, Mouse Moves Heart Rate Average heart beat per time Kinect Head position, body posture Mouse Number of drag & drops per time and per user, Mouse positions, AOIs Voice and Selections Number of speech recognition results vs strips update Waspmote Sensors (Temp and Light) Temperature and Light values Table 3 - Resume of initial metrics to be visualized and explored

After the creation of the initial visualizations, the project consortium discussed the ranking and importance of the different metrics and the suggested visualization types with respect to practical applicability and meaning for the overall goal of finding patterns for different types of behaviour. Therefore, ATC experts of the company partner got to vote (in a scale from 1 to 5) in order to specify the level of satisfaction with the current visualization and metrics utilized. Figure 14 shows the template. This template also served as a reference list to discuss interesting findings or patterns that could be found in the analysed.

Figure 14 - Ranking of metrics and visualizations

As a starting point, eight different topics were selected independently for visualization and exploration. The selected topics were:  Mouse  Observations  Selections  Selection and Stress level  Voice  Workload Metrics  Eye Mouse AOI  Heart Measurements

21 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

The decision was based on data availability, completeness (data available for the total simulation time), and data quality. Therefore, Kinect and Waspmote could unfortunately not be considered for the data exploration. As shown in Table 2 Kinect data could be recorded only for 2 out of 8 test runs, and for Waspmote in only half of the test runs.

Observations:

In order to identify specific situations in the simulation/exercise and to be able to compare similar situations within the data, different kind of observations have been recorded and ranked by a supervisor:

 Observations during the test run  Offline observations

These observations have been merged into a list with the following information:

 Timestamp  Observation Type  Message  Positive / Neutral / Negative Observation  Category  User  Synchronized Timestamp  Ranking (-5 to 5)

Figure 15 - Observation List

22 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

4 Results Recall of Hypotheses H1 and H2: As described in the Verification Plan D4.1, the performance of the final parameterized DM/ML/AI- algorithms should be tested by answering the following hypotheses:  Hypothesis H1: the DM/ML/AI-module is able to detect situations in which the operator tends to make bad decisions by analysing user-input and user-tracking data  Hypothesis H2: the DM/ML/AI-algorithms are able to identify good and bad workflow patterns Workflow patterns in H2 are sub-sequences of actions the controller performs. These workflow patterns might vary between good and bad controllers. H1 refers to the decision a controller make in a single steps of the workflow.

Both hypotheses refer to the evaluation if the developed DM/ML/AI-modules are able to assess the state of an operator, which was planned to be verified in a third exercise. Since due to limited resources for further experiments and - as a consequence - limited data collection it was not possible to perform the third exercise. Therefore, we present the analysis of the data in more detail with the respect to the question: What kind of pattern has been detected and might be useful for the development of such a module? In the first section of this chapter high level results are presented which describe the results on the general performance of the users. We start in Section 4.1 with the workload estimates that we could generate from the questionnaires. In Section 4.2 we turn our attention to the measured sensor data. We start with simple visualizations and from that we create a list of useful and interesting metrics with different level of complexity (combined sensor data from different sources). We classify the metrics into categories. With this background we created detailed and concrete research questions that guide our analysis, visualization and exploration of data towards good predictors of the users’ behaviours (Section 4.2.3). Combined visualizations are used to show what metrics and what patterns could be found in the collected data. In Section 4.4 we show results of the event trace analysis. At the end of this chapter our conclusion (Section 4.5) and future work (Section 4.6) can be found These findings can be used in the future for the development of models to predict the user’s behaviours. 4.1 Workload Estimates based on Questionnaires The NASA-TLX is a multi-dimensional scale designed to obtain workload estimates from one or more operators while they are performing a task or immediately afterwards. In our case the user filled out the NASA-TLX questionnaire after the experiment. The NASA-TLX has different rating scale descriptions. The first rate description describes the mental workload of the users, i.e., the amount of mental and/or perceptual activity that was required (e.g., thinking, deciding, remembering, calculating looking, searching). We observed that user1, user4, user6 and user8 reported a higher mental workload in the end of their experiments (Figure 16).

23 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Physical Demand Mental Demand 10 15 5 10

5 0 0 Physical Demand Mental Demand user1 user2 user3 user4 user1 user2 user3 user4 user5 user6 user7 user8 user5 user6 user7 user8

Figure 16 - Mental Demand Results for all 8 Figure 17 - Physical Demand Results for all 8 users, 2 experiments. users, 2 experiments.

Next we analysed the reported physical demand. This rating refers to the amount of physical activity that was required (e.g., pushing, puling, turning, controlling, activating). Here user5 and user8 reported a higher physical demand ( Figure 17). Temporal demand refers to the amount of pressure that the user felt due to the rate at which the task elements occurred (e.g, was the task slow and leisurely or rapid and frantic?). User1, user6 and user7 reported a higher temporal demand (Figure 18).

Temporal Demand Level of Effort 10 12 8 10 8 6 6 4 4 2 2 0 0 Temporal Demand Level of Effort

user1 user2 user3 user4 user1 user2 user3 user4 user5 user6 user7 user8 user5 user6 user7 user8

Figure 18 - Temporal Demand Results for all 8 Figure 19 - Level of Effort Results for all 8 users, 2 experiments. users, 2 experiments. We checked the level of effort (Figure 19) reported by the users, i.e, how hard did the user has to work mentally and physically to accomplish the level of performance. For user1, user4, user5 and user7 there was a higher level of effort. The level of frustration is related to how insecure, discouraged, irritated, stressed and annoyed versus secure, ratified, content, relaxed and complacent did the user felt during the task. User1, user5 and user7 reported a higher level of frustration (Figure 20).

24 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Level of Frustration NASA_TLX Workload 10 50 8 40 6 30 4 20 2 10 0 0 Level of Frustration NASA_TLX_Workload SCORE_C2

user1 user2 user3 user4 user1 user7 user5 user4 user5 user6 user7 user8 user6 user8 user3 user2

Figure 20 - Level of Frustration Results for all Figure 21 - NASA-TLX Negative Results 8 users, 2 experiments. (not considering “Level of Performance” answers). Considering that the maximum possible value is 50 for each user and that we consider only the 5 negative related ratings in NASA-TLX (for mental, physical, temporal, level of effort and level of frustration), we can display a general overview of the ratings reported by our users. In Figure 21 we see that user1, user7 and user5 reported a higher workload, although the results for user4 and user6 are very similar to user5. In our examples we mostly use user4, user6 and user8 to better represent the results, because we had more data available for these users. We also show the level of performance reported by the users, i.e., how successful does the user thinks he/she was in accomplishing the goals of the task set by the experimenter (Figure 22). User2, user3, user4 and user5 reported higher ratings, which is consistent with the other (negative) ratings.

Level of Performance 9 8 7 6 5 4 3 2 1 0 Level of Performance

user1 user2 user3 user4 user5 user6 user7 user8

Figure 22 - Level of Performance (for Figure 23 - NASA-TLX Correlation Matrix (taking all all users, 2 experiments). answers from all users). We also used the data analysis tool R to create correlation matrices plots that show the possible relation between the different answers of the users (Figure 23). These correlation matrices help in the interpretation of our data and results. We can for instance observe in Figure 23 the negative correlation between Level of Performance and Temporal Demand (left inferior corner) or the positive

25 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. correlation between Physical Demand and Age. Please note that: for Gender we use 2 (higher value) for female and 1 for male. Next we present the SAGAT based questionnaire. As we can see (Figure 24) user8 had a positive report and user7, user4 and user6 had a higher negative rating. In Figure 25 we can observe the correlated matrix plot for all answers and from all users to the SAGAT based questionnaire.

SAGAT Based 5

0 SAGAT_Score_D ‐5

‐10

‐15

‐20

‐25

user1 user7 user5 user4 user6 user8 user3 user2

Figure 24 - SAGAT Based Questionnaire. Figure 25 - SAGAT Correlated Answers. In the SASHA based questionnaire (

Figure 26) user8 and user2 had a higher positive rating, while user1, user7 and user6 had a more negative rating. In Figure 27 we can observe the correlated matrix plot for all answers and from all users to the SASHA based questionnaire.

26 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

SASHA Based 15

10

5

0 SASHA_Score_C1 ‐5

‐10

‐15

user1 user7 user5 user4 user6 user8 user3 user2

Figure 26 - SASHA based Questionnaire. Figure 27 - SASHA Questionnaires, correlated Plot. Finally, we plotted the mixed correlation between the NASA-TLX questionnaires and the SASHA questionnaires (see Figure 28).

Negative vs Positive Answers 4 3 2 1 0 user1 user7 user5 user4 user6 user8 user3 user2

Total_Times_Positive (taking average to classify) Total Times Negative (taking (minus) to decide)

Figure 28 - NASA-TLX and SASHA Correlation Figure 29 - Negative vs Positive Answers Matrix. (based on all questionnaires).

In the combination of all questionnaires and reports of the users we looked for evident signs of negativity or positivity in the users’ answers. This allows us a better evaluation of the reliability in the users’ answers. In

27 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 29 we present the results regarding this positivity and negativity analysis. User8 seems to be very positive in all the answers, while user7, user4 and user6 are much more negative in general. In Table 4 we see the negative/positive answers outliers. For this we analysed all questionnaires for consistent negative or positive answers. It can be seen that user7 was consistently negative in his answers.

Interesting findings * - positive, negative, user1* user4** user7# considering the 2 top positive users with a negative score or the 2 top users with a positive score ** - only one time user7# user7# user6** positive or only one time negative user3** user8*** user1* *** - 2 times positive user2*** user1* user2*** # - always negative user5 never appears in the top 2 negative or user8*** positive Table 4 - Outliers in negative/positive answers. As a conclusion to the analysis of the questionnaires, the results can give hints for further data exploration. It would make sense to look at the users that reported the highest workload (here user1 and user7, see Figure 21) and check for correlations with their heartrate or number of errors, subject to the condition that these data are available for those users. But sometimes our decision for further analysis was based on other factors, like the data quantity. Including questionnaires into the data analysis is definitely a promising option. Due to limited resources and data quantity we could not exploit the maximum potential. 4.2 Hints for H1 - Exploring the Sensor Data In the first stage after post-processing the measured data collection, we created simplified visualizations for initial discussion and exploration of the data. As a result we agreed on a reference list of metrics, charts and initial findings. Furthermore, we got an idea of more complex visualizations where we could combine several metrics and visualizations types.

The initial list of ideas and questions to study more intensive was the following:

 Similar to the NASA-TLX grouping, maybe we should create visualizations that group the visualizations according to: Workload (heart rate, fixation frequency, number of departures or arrivals), temporal, performance  Make usage of different performance measures  Can we show how fast the utterances were spoken?  Can we Distinguish between sentences that have the same meaning but using fewer words  Is the user using shorter words when the stress levels are higher?  We must focus on the Observations list

28 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

 We should combine stress levels, observations and heart measurements  We should combine voice, selections, observations, heart measurements  We should combine fixations frequency, stress levels, observations and heart measurements  How many departures, arrivals per minute? Is this related with stress and changes in the heart rate variability?

From this starting point, we created a resume table with the most important metrics that could be combined in order to better represent the overall status of the users at each point in time (which is in our case, per minute).

ID Metrics Source Name Data Type

1 Taxi time Arrival Workflow Taxi time duration time

Departure 2 Take-off time Take-off time duration time Workflow Number of Observation Negative 3 Errors negatives count Errors (Experts Observations List) Experts Observation + number of total 4 Number of Deviations or errors count Supervisor + negative errors Observer Errors

Standard Deviation of Eye Fixation stdev eye fixation stdev duration 5 Eye tracker Duration duration time

Number of distinct changes in AOI number of distinct 6 Eye tracker count Eye Tracker mouse AOIs

7 Eye Fixation Frequency (AOI Count) Eye tracker all eye AOIs count count

8 Eye Fixation duration time Eye tracker eye fixation duration duration time

Difference between Average Eye Count of AOI and current count of 9 Eye tracker diff eye count avg count eye AOI (average variation in the AOI count)

Difference between Average eye Fixation Time and current eye avg duration 10 Eye tracker diff eye duration Fixation Time (average variation in diff the Fixation Time)

Difference between Average Count of mouse AOI and current count of 11 Eye tracker diff eye count avg count mouse AOI (average variation in the AOI count) How often has the ATC to interact number of manual and 12 with the Flight Plan, Manuals or FlightPlan count help occurrences Help system 13 Heart Rate Z-Score Heart Rate heart z-score score

29 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

14 Heart rate RMSSD Heart Rate heart rmssd measure 15 Heart Rate interbeat interval Heart Rate heart ibi measure 16 Heart Rate Beats Per Minute Heart Rate heart bpm measure

Standard Deviation of Mouse stdev mouse fixation stdev duration 17 Mouse Fixation Duration duration time

18 Number of Mouse clicks Mouse mouse left clicks count Number of distinct changes in AOI number of distinct eye 19 Mouse count Mouse AOIs Mouse Position Frequency (AOI 20 Mouse all mouse AOIs count count Count) 21 Mouse Fixation duration time Mouse mouse fixation duration duration time Number of Observed error 22 Observer errors count Messages (observer list) 23 Number of Callsigns Selections number of callsigns count Subjective Stress Levels (every 10 24 Stress stress level level minutes) 25 Number of Words Used Voice number of words count number of 26 Number of communications Voice count communications 27 Number of Utterances (phrases) Voice number of utterances count

28 Reaction time to a System Warning Workflow Task reaction time to warning duration time Number of features recalled by the 29 Workflow Task features recalled count count users after the session Time to complete a specific task 30 Workflow Task time to complete task duration time (from workflow) total time spent for 31 Total Time spent per task Workflow Task duration time each workflow task time spent in error 32 Time spent in recovering from errors Workflow Task duration time recovering Number of tasks the user has number of task during 33 completed in a critical amount of Workflow Task count critical time time number of tasks no Number of tasks the user could not 34 Workflow Task complete during critical count complete in a critical amount of time time ratio between task Number of tasks performed vs tasks 35 Workflow Task performed and tasks ratio never performed never performed Number of errors vs correct ratio of correct vs 36 Workflow Task ratio interactions incorrect interactions Workflow Task 37 How many steps to complete task steps to complete task count Steps 38 Number of Departure Flights Workflows departures count count Ratio between Time passed and 39 Workflows tasks pace of the user ratio number of actions performed by the

30 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

user

40 Number of Arrival Flights Workflows number of arrival flights count number of loops per 41 How many loops Workflows count task 42 How many different steps Workflows Nr. of different steps count Nr. of switches between number of switches between 43 Workflows handling arrival and count handling arrival and departure flights departure flights

Table 5 - Resume of most important metrics

4.2.1 Sixth Sense Prototype Framework for Data Exploration We tuned one of the Fraunhofer Austria desktop application prototypes towards the visualization and data exploration needs of the Sixth Sense project.

Figure 30 - Overview of the Sixth Sense Desktop Application Prototype

The main features of the Sixth Sense desktop application prototype framework are:  Replay of all information topics (on-line data analysis and replay)  Graph database storage (e.g., storage: body posture, work-flow step, AOI, action)  Prediction engine training (e.g., experiment on giving recommendations about possible next handling steps for an air-plane)  On-line complex event processing (CEP) and dynamic change of correlation filters  Analysis and visualization of the air-planes arrival and departure work-flow tasks and times (showing any repetition loops)

31 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

 Real-time plot of Eye-tracking and mouse current positions  Visualization of interaction metrics, e.g., current user pace (current effort)  Awareness dashboard with thresholds for cumulated departures or arrivals  Web observation platform using Web-sockets and D3.js  Real-time representation in a time line of the supervisors, observers and experts annotations (stress level report, negative and positive observations)  Handling of voice recognition data from communications between pilots and air traffic controllers (similar to “think aloud protocol”)  Export of datasets for analysis in other tools

We also added to the desktop application prototype the capability of analysis and plotting in real-time during the experiment or during the data replay. Therefore, we could analyse in real-time the eye, mouse current focus and the current user actions pace, type and categorization of current user interactions. We are also able to register and monitor all decision activities in a graph database for posterior analysis. In Figure 30, we see an overview of the Sixth Sense prototype application. To detect the demand on the users, we experimented in analysing the pace of the user (in terms of mouse and eye decisions) in order to automatically calculate the current pace of the user (see Figure 30). The test formula for this calculation uses the number of mouse movements, left clicks and eye- tracker fixations. In the future, with more experience in the contribution of each metric, we can use this capability of interaction pace calculation together with the monitoring of the ATC workflow steps to be able to detect high moments of workload and take the necessary preventive measures (display warnings or make recommendations).

32 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 31 - Screenshot of the Sixth Sense Figure 32 - Sixth Sense desktop application UI desktop application UIAction Pace Actions Types Monitor. Calculator.

We also implemented the automatic capability of detecting what are the current top most user actions in real time. We believe this can be used also to help infer what is the current load on the user (e.g., is the user moving the mouse too much? is the user moving many strips?). In Figure 32 we show an example of how this capability looks like. Our software framework can also be used to extract fully interactive behavioural HTML5 based reports that can include explorative capabilities around different metrics. We can observe the relation in time between mouse clicks, voice call sign recognition and number of user interaction events per minute. The supervisor can easily print the report in the current explorative state.

33 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 33 - Sixth Sense web based reports for supervisors data exploration, also printable.

The prototype is still work in progress but a number of features could already be used for exploration and visualization in Section 4.2.3 to answer some of the research questions or to support the initial findings.

4.2.2 Categorization of Metrics regarding mental Aspects Literature suggests that a combination of different measures assessing the same mental aspect, e.g. workload, can lead to more robust results than considering each measure on its own ( [2]; [3] ). We therefore grouped the most important metrics into four categories that represent certain factors known to be related to operator performance: task load, mental workload, attention and behaviour (see Table 6). As these metrics may influence performance they can be regarded as independent variables whereas the performance measures serve as dependent variables. The identification of relevant categories and allocation of metrics to these categories was based on literature findings. Each category is described in more detail in the following sections.

34 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Categories task load mental attention other metrics* performance workload nr. of fixation nr. of changes in nr. of mouse clicks Error messages arrival flights frequency / AOIs per time unit duration nr. of heart rate fixation duration on Nr. of callsigns/ nr. of tasks departure flights measures AOIs number of completed / communications not completed nr. of subjective standard deviation nr. of words used Time of task task switches stress levels of fixation duration completion *) includes behavioural metrics not assigned to certain user states due to lack of literature findings Table 6 - Classification of most important metrics into categories.

Task load

Studies investigated that task demand characteristics have an influence on workload and performance (e.g. [4]). Often the terms task load and workload are used synonymously. However, as Rohmert [5] stated individual characteristics (e.g. the operators’ experience or ability) determine the degree to which task demands impact workload and performance. That is why the same task load must not result in the same level of workload for each individual (see Figure 34). This definition of task load and workload is also part of the ISO norm DIN EN 10 075-1 (2000) [6].

Figure 34 - Distinction between task load and workload (Hilburn & Jorna, 2001)

According to the Cognitive task load model of Neerincx [7] three dimensions of task load can be distinguished: time occupied, level of information processing and task-set switches. DeGreef & Arciszewski (2009) [8] describe that ‘time occupied’ can be reflected by the volume of information processing which is likely to be proportional to the number of objects present. The level of information processing can be represented by the complexity of the situation and task-set switching can be indicated by the number of different objects and tasks. Based on this classification important metrics for task load in the experiment would be the number of arrival and departure flights that have to be handled by the operator. Task set switching could be extracted by the number of switches between handling arrival and departure flights.

Mental Workload Research indicates that performance of the operator is likely to decrease if mental workload is either too high or too low (e.g., Hancock & Chignell [9] , 1986; Veltman & Jansen, 2003 [10]). Thus, the relationship between workload and performance does not seem to be linear but rather resembles the form of an inverted U-shape as visualized in Figure 35.

35 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Mental workload can be assessed empirically in several ways, including self-rating-methods, physiological measures and behavioural measures. One method used in the experiment is the detection of workload by heart rate metrics e.g. beats per minute; inter-beat interval, the RMSSD and the Z-Score. Literature suggests that heart rate is sensitive for different levels of workload (e.g. Roscoe, 1992 [11], Veltman & Gaillard, 1998 [12], Mulder et al., 2007 [13]). There are also studies in the domain of air traffic control indicating increases in heart rate with higher task demands (e.g. Costa, 1993, Rose & Fogg, 1993). However it can be also affected by other factors such as the emotional state (e.g. anxiety) or fatigue which reduces its diagnosis (Manzey, 1998 [14]). Therefore it seems reasonable to combine this measure with other indicators of workload. There are also metrics that could be extracted from the eye tracker data which can serve as indicators for workload. For example, studies of van Orden et al. (2001) [15] suggest that fixation duration and fixation frequency can be sensitive measures for (visual) workload. Besides the physiological assessment of workload, it was also assessed by a subjective rating every 10 minutes during the experiment.

Figure 35 - Relationship between workload and performance (Veltman & Jansen, 2003)

Attention Performance can be decreased by lack of attentional resources (Young & Stanton, 2002 [16]) and also by “inattentional blindness” (Mack & Rock, 1998 [17]). Inattentional blindness means that unexpected events are not noticed because attention is engaged in another task. Analysing what is fixated by the user has generally been considered a good way to determine attentional focus allocation. For example Just & Carpenter, 1980 [18] formulated the eye-mind hypothesis assuming that what is being fixated is also what is being processed. This assumption has been criticized, as it is also possible to voluntarily divert attention elsewhere while fixating a specific area. Nonetheless fixation analysis seems to be beneficial as users usually direct their gaze where they can find the most useful pieces of information (Bellenkes, Wickens & Kramer, 1997 [19]). In our experiments, several Areas of Interest (AOI) were defined in order to analyse which part of the screen is fixated by the user at a specific time. Measures referring to the distribution of fixations on the AOI are for example the number of AOI fixated per time unit, the number of switches between AOI per time unit and the fixation duration on each AOI per time unit.

Other Metrics This category refers to metric of behavioural responses or actions of the user that might be related to task load and user states. It includes metrics such as number of mouse clicks, the number of communications or the number of words spoken. As their relationship to certain mental aspects has rarely been investigated in the literature, they are not assigned to one specific category. What literature suggests is that deviations from the normal behaviour of an individual can indicate situations of high workload or overload. Promising results could be found for scan patterns (Tole, 1983 [20]) and also for operating procedures in air traffic control tasks (Sperandio, 1978 [21]). Although these findings refer to rather complex behavioural patterns, it seems likely that conditions of

36 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. high task load are also linked to changes in more simple behavioural responses such as the number of mouse clicks or the number of communications. In order to investigate this assumption we also consider these metrics as potential user state indicators.

Performance Performance can be assessed by measures of reaction time/time spent on task completion, accuracy or number of errors. Errors may be the most important measure of performance as they can be safety critical and cost intensive. In the experimental study errors were detected by observations both during the experiment by the supervisor and the observer as well as post hoc by a domain expert watching a scenario replay. We decided to merge all the observation lists into one unique observation list that combines the experts’ video analysis (containing negative, positive and neutral observations) with the supervisor and observers reports taken during the experiments (marked with an extra X, e.g. NegativeX). This is the most plausible way to integrate the expert’s knowledge with observational knowledge. All observations were classified (from -5 to 5) and rechecked by the ATC experts in the consortium. In Figure 36 we can see the types of events from the observation lists with more impact on the overall performance of the users. The stress level report was done by asking the ATCO every 10 minutes and it was classified with an impact = 0. Please note that the observations list is not a direct result of errors solely made by the air traffic controller, but an inherent result of the ATCO’s interactions with the systems. In some parts of the experiment, the pilots also forced the air traffic controller into higher levels of workload or stress, for example, by not complying immediately with instructions given by the air traffic controllers.

Figure 36 - Events from observation list with high impact on the performance of the users.

37 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 37 - Events from observation list with more impact on the performance of each user.

4.2.3 Research Questions With this background the consortium defined general research questions, such as:  How to improve the user interfaces ?  How to detect main causes that lead to mistakes (e.g., using air traffic info, eye tracker, mouse, heart rate data, body pose)  What are the hidden data signs that we can incorporate in an automated system to detect and predict the users next actions or to predict when a user is in a high workload situation or is about to make a mistake?  What are the unknown factors that contribute to higher stress levels or to the lack of situational awareness?  Can air traffic information be combined with sensor information to improve the detection and classification? Based on these general research questions we created a table with more detailed and more concrete main research questions that allows us to guide our analysis, visualization and exploration of data in search for good predictors of the users’ behaviours. The answers to these questions will lead us towards the aim of the Sixth Sense project. The research questions are sorted according to the categories of mental aspects (see Section 4.2.2) task load, mental workload, attention, behaviour.

38 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

ID Research Questions (RQ)

Relation between task load and workload/performance 1 We believe the higher the task load, the higher the mental workload will be. Are the departures, and arrivals per minute related to stress and changes in the heart rate variability? 2 Does the number of taxi-in airplanes at a given time influence/increase the stress level? 3 Does the occurrence of errors (negative observations) increase with higher task load? Relation between workload and performance 4 Does the occurrence of errors (negative observations) increase with higher workload? 5 Can the heart rate variability be a good indicator for user mistakes? Relation between attention and performance/behaviour Can an excessive demand be detected based on the number (or time spent) on areas of 6 interest? When there is an increase in the number of Eye AOI fixations, is there also an increase in the 7 number of Mouse AOI fixations, because there is a relation between eye and mouse work? Relation between behaviour and workload / performance When there is an error, does the user increase eye/mouse movements in order to scan the 8 user interface? Are pauses in the mouse movement activity linked to high workload? Can we show this with our data? 9 When the user is about to make an error, is there an increase in the mouse AOI fixation time?

What are the eye and mouse scan path patterns of the users when they are about to make 10 mistakes? Are these distinctive enough? Kinect Data – Is there the possibility of error detection due to the correlation between air traffic 11 information and the body posture?

12 Is the number of clicks, mouse movements or AOIS related to the occurrence of errors? Can we show how fast the utterances were spoken? Can this metric be utilized to detect 13 periods with more negative observations? Additional research questions Is there a possibility of error detection due to mismatches in the correlation between eye- 14 tracking and voice (call signs) information? Is there a relation between occurrence of errors and an increase in the number of words used in the communications? 15 Can we report what the users' most preferred eye and mouse scanning sequences are?

Table 7- List of Main Research Questions

39 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

4.2.4 Data Exploration and Analysis In this section we answer each research question (RQ) from the “Research Questions” resume table (see Table 7) by exploration and analysis of the data.

RQ 1: We believe the higher the task load, the higher the mental workload will be. Are the departures, and arrivals per minute related to stress and changes in the heart rate variability?

 To answer this question we used the resumed metrics for task load (number of departures and arrivals) and for mental workload (fixation frequencies, heart rate measures and stress levels reported) and we correlated this data with negative error observations.  Regarding the number of departures and arrivals vs observations: It appears that the users have more negative observations in the middle of the experiment, which may be related with the accumulation of time performing the experiment. The data shows that sometimes the users have negative observations after a period of a more intense handling of departures or arrivals.  Many times, only when the arrivals/departures intensity decreases, the users start to have a higher count of negative observations. We do not yet know why, but it could be associated with the fact that the user had a high peek of workload and then starts to make more mistakes. These peeks of higher workload (more arrivals and departures to be handled) might be used as a good predictor for when the number of errors will increase.

Origin FlightObject_Bin_1minute

6 30

44 4 4 20 user5 33 33

22 22 2 22 2 2 2 2 10 11 1

0 0

6 30 5

44 4 4 20 user6 3 33

22 2 2 2 2 2 10 11111 11

0 0

6 30

4 4 20 user7 3 3333

22222222 2 10 11 11 11 1 11

0 0

6 6 30 5

4 4 4 20 user8 33 3 3 3 3 3 3

2 22 2 2 22 2 2 2 2 2 2 10 1 1 1 1 1 11 1 1 111 1

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 47 48 50 51 53 54 55 56 57 58 59

Count of Departure/Arri.. Count of Positive / Neg.. 13216 Figure 38 - Departures and arrivals (green area) vs number of negative observations (red)

RQ 2 Does the number of taxi-in airplanes at a given time influence/increase the stress level?

 We found indications for a relation between the number of taxi-in airplanes (arrivals in the strip bays in the user interface of our ATM tower simulator) and the report of higher stress levels by the users during the exercise. The same was observed in the data regarding the number of airplanes waiting for departure in the strip bays.

40 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

User user1 user2 user3 user4 user5 user6 user7 user8 20 6 5 10 4

5 3

2 Stress Level Airplanes in taxi-in Airplanes in 2 1

1 0 6 10 5

4 5

3

2 Stress Level 2

Airplanes Pending Departure 1

1 0 20 40 60 20 40 60 20 40 60 20 40 60 20 40 60 20 40 60 20 40 60 20 40 60 Time Time Time Time Time Time Time Time

Stress Level Airplanes in taxi-.. Airplanes Pendin.. 15020010

Figure 39 - Interdependence between arriving airplanes, departures and stress levels.

 The stress levels (red) in Figure 39 , reported every 10 minutes by the user, are consistently higher when there is an increase in the arrival or departure airplanes. This interdependence may also be caused by other factors such as time accumulation, fatigue or number of visual objects to be handled. For a more trustable statement we would need to create another type of experiments but it could be an interesting and promising indicator.

RQ 3: Does the occurrence of errors (negative observations) increase with higher task load?  The same conclusions as for RQ ID 2 are valid here.  In our experiment arrivals were at a very constant rate of more or less one airplane per minute, so we focused instead on the total number of airplanes to be managed per minute (arrival and departure).  We observed that the number of airplanes for departure influenced the attention of the users, For sure the number of strips in each bay (and therefore the number of visual UI objects to be managed at a given time) plays an important role in splitting the users’ attention.  The analysis of periods when the mouse stops seem also to be coincident with negative observations. Mouse pause times could be used also as an additional indicator.

RQ 4: Does the occurrence of errors (negative observations) increase with higher workload?  We could observe in the heart rate variability data (for user4, user6 and user8) that: If we cross check the Z-Score IBI heart rate variability with the negative observations in the observations list, we observe that every time before an increase of severe negative observations, there is a steep descent (lower heart rate variability) on the Z-Score IBI values.  According to the literature that states that the heart rate variability can be a good indicator of high stress.  We think that the study of the angle of the line plot (steep descent or steep climbing) could be used as a good predictor for moments of high stress and for the detection of intervals where negative observations are more prone to occur. When combined with the monitoring of the negative Z-Score IBI value, this can be used to help detect negative situations.

41 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Origin 5 0,9

4 0,8

0,7

3 0,6

0,5 2 0,4

0,3 user6 1 0,2

0,1 0 0,0

-0,1 -1 -0,2 Steep angle of positive change in the Z-Score IBI value -0,3 -2

-0,4

-3 21 28 32 45 -0,5 -20 2 4 6 8 1012141618202224262830323436384042444648505254 Minute of Date Time

Positive / Negative Avg. Z-Score IBI Negative NegativeX -0,437 0,923 Figure 40 - Correlation between negative observations and HRV. HRV is a good indicator for periods of negative observations.

 We can see in the figure above the dotted vertical reference lines, when the heart variability decreases (negative changes): Usually this event is followed by an increase in negative observations.  We could not confirm from the data, that occurrence of positive observations (annotated by the experts as efficient efforts to solve a negative situation) have a significant influence to lower the occurrence of more negative observations. With this statement, we mean that these positive events are in parallel with other negative observations and therefore although there is a distinct increase in the heart variability, these positive peeks are normally associated also with negative events.  It would be worth to analyse how quick this change (the slope of the Z-Score IBI values) is still associated with the occurrence of negative observations when the user appears to be relaxed and not in a high stress situation.

42 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Type / OneMinuteBinObs Origin Null mouse 150 100 user4 50 0 5 1

0 0

-1 -5 150 100 user6 50 0 5 1

0 0

-1 -5 150 100 user8 50 0 5 1

0 0

-1 -5 141 3 4 5 12 15 16 17 18 19 21 22 24 25 27 28 29 30 31 33 34 35 37 38 39 42 44 45 47 48 50 51 53 54 55 56 57

Positive / Negative Avg. Z-Score IBI Count of AOI Negative 1 157 Positive -1,354 1,332 Figure 41 - Relation between mouse AOI frequencies observation list and HRV.

 We could use both negative changes and positive changes in the heart variability to create a model to detect periods of high stress associated with the occurrence of more negative observations.  We also analysed the mental workload in terms of fixation frequency and fixation duration correlated with the occurrence of negative observations. We have analysed this correlation for both eye and mouse frequency (in terms of AOI visited in a certain period of time, in our case one minute bins).  We can observe when the user is “fighting” to solve a problem and that before an error there is a moment of reduced activity in the mouse movements (Figure 41) .

RQ 5: Can the heart rate variability be a good indicator for user mistakes?

 From the analysis and data explorations done so far we believe that the heart rate together with the reduction in mouse activity, the number of visual UI objects to be manages (e.g. flight strips), the eye tracking AOI frequency and duration provide very good clues for anticipating moments of stress, high workload and the occurrence of negative observations.  However, the mouse data seem to be more distinctive than the eye tracking data regarding detection of negative observations periods. But definitely the eye tracker data provided us clues on probabilities of next users’ AOI and sequence of actions.

43 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

RQ 6: Can an excessive demand be detected based on the number (or time spent) on areas of interest?  To answer this question we analysed for each user what were the top areas of interest that received more attention from each user. In Figure 42 and Figure 43 we can observe two examples (one for mouse AOI and another Eye AOI) that show the AOI that received more interest by user7.

Type User mouse

pendingdepartures radarTR 13.794 9.105

user7

radarBR 2.807 taxiin taxiout onblock 9.445 4.023 2.272 leftpanel 2.424

Avg. Length

148 13.794 Figure 42 - Mouse AOI of user7 that received most interest time during the experiment.

Figure 43 - Eye AOI of user7 that received most interest time during the experiment.

44 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

 We created a resume table to demonstrate the AOI preferences of each user throughout their experiments (seeTable 8). As we can observe not all users handle the air traffic the same way (workflow steps, communications, and preferences in order of execution and dispatch) or behave in the same way.

AOI User2 User4 User5 User6 User7 User8

Mouse Handoverrunway x x

Startuppushback#taxiout x

taxiin#onblock x pendindepartures x x x taxiin x radarTR x x radarBR#radarBL x radarTL x taxiin x x taxiin#taxiout x taxiout x Startuppushback x

Eye pendingdepartures x x toppanel x

Startuppushback#taxiout x x

startuppushback x taxiin x radarTL#radarBL radarTL#radarTR x

pendindepartures#startuppushback x

HandoverRunway x taxiout x radarBL x Table 8 - Resume of AOI that received most interest time from each user.

 We could observe some AOI that had more attention from the users in general (e.g., the “pendingdepartures” bay strip with three users).  We come back to RQ 6 with the methods of Section 4.4

45 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

RQ 7: When there is an increase in the number of Eye AOI fixations, is there also an increase in the number of Mouse AOI fixations, because there is a relation between eye and mouse work?

 As we can see from the previous figure, most of the times it seems to be the other way around, if there is an eye AOI increase then the mouse AOI decreases. This fits well to the literature that states that a decrease in the number of mouse movements is linked to a higher workload.

User Type OneMinuteBin 400 increase in eye movements

200 user8 eye

0

165,5 600 147,0 150,5 140,0 142,0 112,5 106,5 400 89,0

201,026315789 200 45,0 28,5 29,0 20,0 22,5 20,5 19,5 38,0 13,513,5 13,5 33,5 15,5 7,5 11,5 27,0 0,0 2,0 4,5 3,5 14,0 0 10,0 10,5 8,5 2,5 12,0 8,5 6,0 8,5 8,5

400

200 mouse

0

600 decrease in mouse movements followed by an increase in eye movements

400

200

16,0 17,0 17,0 14,5 13,5 17,0 17,5 14,5 13,5 14,0 17,0 6,0 7,5 0,020,105263158 0,0 4,0 4,5 2,5 1,55,51,02,5 4,5 4,5 1,5 5,5 5,0 1,0 1,5 4,0 4,0 1,0 3,5 1,5 14,0 0 5,0 8,0 7,0 01468111218192021222425273031333436373940414243454748505153545556575859

Type, Measure Names Positive / Negative DiffFromAvgAOI eye, Window_STD_Dev * mouse, Window_STD_Dev Negative -157,0 426,0 NegativeX Figure 44 - Window standard deviation o(2 minutes) and number of errors, capturing very well periods with increased user errors.

 However, it also seems evident that sometimes there are periods when the mouse activity is following the patterns of the eyes. This seems to be linked to the moments when the user has more errors and tries to solve the problem. This is especially true if we take in consideration the window standard deviation (last 2 minutes) like we can observe in Figure 44.

RQ 8: When there is an error, does the user increase eye/mouse movements in order to scan the user interface? Are pauses in the mouse movement activity linked to high workload? Can we show this with our data?

 We investigated the relation between the occurrence of errors and the increase of eye or mouse events. According to the literature, pauses in the mouse movement are known to be linked with high workload periods when working with user interfaces. The data indicates that there is a possible link between reductions in mouse movement and increases in the eye movements that are coincident with the occurrence of negative errors (errors indicated by the experts). If this is true, this could help us in the future to create an algorithm that is able to detect or predict error periods. In Figure 45 we visualized the different fixation frequency changes (represented by lines) and what is happening in the observation data (red bar plots).

46 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

User Type OneMinuteBin

user2 mouse

user3 mouse

user4 mouse

user5 eye Period of increase in movements (AOI visited)

mouse

user6 eye

mouse

simultaneous decreasing of mouse movements user7 eye

mouse decrease in eye movements

user8 eye increases in mouse movements increase in eye movements

mouse

13451216171819212224252728293031333437383942434445474850515354565758

Positive.. Negat.. DiffFromAvgAOI -147,8 157,3 Figure 45 - Relation between eye and mouse movements (AOI visits) and occurrence of errors.

RQ 9: When the user is about to make an error, is there an increase in the mouse AOI fixation time?

 The same conclusions as for RQ 8 apply here, when we analyse the fixation time instead. We also created visualizations where we filter the data to include only mouse move events and we could see if the user was moving the mouse when negative errors occurred or not.

The conclusion is, that the users were never moving the mouse at these negative moments, the users really stopped moving the mouse probably to analyse the current situation.

RQ ID 10: What are the eye and mouse scan path patterns of the users when they are about to make mistakes? Are these distinctive enough?  To analyse eye and mouse scan path patterns of the users we parameterized a stochastics state model (VLMM) as described in section 4.4  By selecting a timeframe around an observed error, specific states could be identified. By analysing when these states occur elsewhere, we did not find any pattern that these states occur significantly more often near errors.  We think the scanning paths need to be analysed in more detail by ATC experts to find additional measures that give, combined with the state sequence, an indication of errors.  RQ 11: Kinect Data – Is there the possibility of error detection due to the correlation between air traffic information and the body posture?

 The head pose provides information about the angle of a user’s head. With these two values one can calculate how far or near a person’s head is during the experiment. Furthermore with

47 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

the information of the head pose we will know the tilt of the head at a specific point in time at the experiment.

Figure 46 - Kinect Head Pose Measurements Schema.

 “Head Coordinate State” in the Kinect data ‐ Head Coordinate State indicates the position of the persons head. (1 is max down and 9 is max up, 0 is not specified).

 Head rotation state Left Right? (0-9) ‐ The left right value gives an indication of how much a person has turned his head left or right. (1 is max left and 9 is max right, 0 is not specified).

 Head Rotation State Up Down (0-9) ‐ The up down values gives an indication of how much a person has turned his head up or down. (1 is max left and 9 is max right, 0 is not specified).

 Sound source angle and the Microphone beam angle ‐ Sound source: Gets the sound source angle (in degrees), which is the direction from where the sound is arriving (direction of a sound source). ‐ The beam angle: Gets the beam angle (in degrees), which is the direction the sensor is set for listening.

 What can we infer by using the Head Coordinate (x, y, and z)? Or Head Pose? ‐ The Head coordinate x, y and z give an overview of the person’s 3D position in a room (values are in meters).

Z > 0, distance of the user Z = 0, (minimum), nearest point at Kinect X > 0, right X < 0, left X = 0, centre position Y > 0, right Y < 0, left Y = 0, centre position Table 9 - Resume of parameters for the Kinect Head Pose

 The calibration of the Kinect sensor for the detection of head and body postures was tricky and difficult. In our 2 experiments we could only collect Kinect data for user1, user3 and user5. Therefore we would need more tests and more data to derive definitive conclusions.  We collected data about Head Coordinate State, Head coordinates, Head Pose Coordinates, Head Rotation State (left, right, up, down) Microphone Beam Angle and Sound Source Angle, User in Range and User Tracked.

48 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Origin 0 0 4 4 7 5 0 7 3 1 1 1 1 1 1 1 Blocking Situation which delays other Arrival After Crossing / Blocking Situation Arrival Queue greater 3 Blocking Situation After Crossing / Queuing Arrival Queue plane / planes Queue Queuing which delays other In range which delays other In range greater 3 Not in range. greater 3 Not in range. plane / planes plane / planes In range Not in In range In range range. 7 0 1 1 user1 Blocking Situation which After Crossing / Queuing 6 6 delays other plane / planes 9 Not in range. 8 1 1 In range 1 1 After Crossing / Arrival Queue After Crossing / Arrival Queue Queuing greater 3 2 Queuing 6 greater 3 1 1 0 In range In range In range In range Blocking Situation which Taxiway Optimisation 1 delays other plane / planes In range After Crossing / Queuing In range Not in range.

0 1

user3

6 6 1 1 Arrival Queue greater 3 Blocking In range Situation which delays other 6 plane / planes 1 Not in range. Arrival Queue greater 3 user5 Not in range.

4 6 1 1 Taxiway Optimisation Blocking Situation which In range delays other plane / planes Not in range.

Positive / Negative Negative Positive Figure 47 - Kinect Data Representation, visualizing Detected Head Pose vs Count of Negative/Positive Observations vs Type of Observation vs User in Range (or not in range).

 However, even by using only the current available data we have found the head coordinate state and the user in range variables very promising for implementing a future error predictive system (please see Figure 47).  As we can also observe by considering only the variable Head Coordinate State = 0, 2, 4, 5, 6, 7, or 9 (we removed states: 1, 3 and 8) and the variable Sound Source Angle (between - 26.6 and 36), we could account (include in the same time interval) at least 96% of all negative observations reported by the experts, as it can be observe in Figure 48. What we mean with this is that we can envision the usage of our CEP filtering mechanisms to be able to greatly reduce the amount of that that needs to be processed.  This process can be achieved by filtering out everything that might be not relevant, or at least to filter out data that can be processed at a later stage using more complex and slower methods, allowing the most interesting data to be immediately analysed using more simple and fast methods.

49 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Origin 0 0 0 2 5 7 3 1 1 1 1 1 Blocking Situation which delays other plane / planes After Crossing / Arrival Queue Blocking Situation which delays other Blocking Situation which delays other Blocking Situation Not in range. Queuing greater 3 plane / planes plane / planes which delays other 20,05 Not in range. Not in range. In range In range plane / planes -10,17 20,11 10,03 20,05 40,11 In range 35,16 25,49 9,27 18,14 10,03 4,73 user1 4 6 1 1 Blocking Situation which delays other After Crossing / Queuing plane / planes In range In range -20,00 10,03 -25,64 4,40

0 1 Communication Not in range. 20,05 11,83 user3

6 6 6 6 1 1 1 1 Arrival Queue Arrival Queue Blocking Situation Blocking Situation greater 3 greater 3 which delays other which delays other In range Not in range. plane / planes plane / planes -20,05 0,00 Not in range. Not in range. 0,00 10,76 10,03 30,08 5,95 17,60 user5

Positive / Negative Negative Positive Figure 48 - Kinect Data after applying filters to include only the majority of negative observations (96%).

RQ 12: Is the number of clicks, mouse movements or AOI’s related with the occurrence of errors?

 By plotting (Figure 49) the total number of mouse clicks (left clicks, mouse left pressed for drag and drop and mouse right clicks) together with the total number of negative observations per minute, we could observe that there is an evident relationship between the number of clicks and the number of errors per minute.  Also it seems apparent that a small increase in the number of clicks was followed by a high increase of negative observations. A high increase in the mouse click activity was followed by a substantial decrease in the number of negative observations (maybe this means that the users are trying to solve difficult situations).

50 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

OneMinuteBinObs

15

10

Increase 5

0

-5

-10

-15 Difference in Count of Action Type Difference in Count of

-20 correlati- correlation correlation correlation correlation on 4

Increase 2

0

-2

-4 Difference in Count of Positive / Negative Difference in Count of Positive -6 01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859

Count of Action Type Count of Positive / Neg.. 22216

Figure 49 - Correlation between total number of mouse clicks and negative observations

 There was only one case where this relation appears to be delayed and it does not occur in the same same minute, (see Figure 49) at the marked with the word “increase”). Here, first there was an increase in the number of mouse clicks followed in the next minute by an increase in the number of negative observations. This might be due to normal delays made by the users while analysing the current situation, occupied with other tasks or distraction. Figure 50 is a general plot taking the values for all users, but the same results remain true if we plot every single user separately.  We only focused in the overall patterns that might be used to create in the future a prediction system, therefore in this case we focused in the direct relation between mouse clicks activity and the occurrence of negative observations.

RQ 13: Can we show how fast the utterances were spoken? Can this be used to detect periods with more negative observations?

 By plotting the relationship between the number of spoken words per minute and the number of negative occurrences per minute, we could observe that the users spoke in average between 6 and 8 words per minute.  As we can observe in Figure 50 a big increase in the number of words used by the ATCO seems to point periods of high concentration of negative observations. But this does not always happen.

51 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

RQ 14: Is there a possibility of error detection due to mismatches in the correlation between eye-tracking and voice (call signs) information? Is there a relation between occurrence of errors and an increase in the number of words used in the communications?

 It was not possible to check for mismatches between eye-tracking (of callsigns) and the spoken callsigns. This was due to the fact that neither the simulator user interface (in the radar area) could provide us with information about the callsign (for when the user was looking to a specific airplane). Neither the eye-tracker sensor would allow to track each small point (airplane) in the radar with accuracy without using additional and improved selection strategies. For this reason it was not possible to cross check if the callsigns viewed by the user matched the callsigns spoken by the user (although the voice recognition system could recognize the spoken callsigns with a very high degree of accuracy).  In the future we could improve the simulator to provide feedback information when a user is looking to an airplane in the radar area. We could achieve this by creating a selection circle (with a certain threshold area, instead of just using a small eye cursor). This selection circle would capture any airplane inside the area of the circle and then the simulator would have to provide info about the hovered airplanes (similar to the normal mouse hovering or selection).  However, we could observe a relation between the increase in the number of words used by the air traffic controllers and the occurrence of negative observations.  This seems to follow always the same pattern. Sometimes there is a clear decrease in the number of words used followed by a significantly increase in the number of words spoken by the air traffic controllers.  Especially by looking at the negative observation descriptions this seems to be correlated with the worst situations annotated by the experts (such as putting on hold several times the same air plane, resolving the crossing of runways or having too many airplanes to be resolved in the taxi or departure strip bays).

Origin New TS

4 negative observation user6 negative observation 2 / Negative

Count of Positive 0

100 increased mouse activity increased mouse activity

0

less mouse activity -100 less mouse activity ActionType_Differen.. 1000 significantly more spoken words used 500 significant difference in the number of spoken words used

0 NumberWords_.. significantly less words 4 used

user8 2 / Negative

Count ofPositive 0

100

0

-100 ActionType_Differen.. 1000 significant reduction of the number of words used 500 before the negative observations

0 NumberWords_.. 01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859

Positive / Negative ActionType_Diffe.. NumberWords_D.. Null Negative -179,7 143,4 -278 963 NegativeX Figure 50 - Correlation of negative observations and difference in number of words/mouse actions

52 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

RQ 15: Can we report what the users' most preferred eye and mouse scanning sequences are?

 We analysed most probable eye and mouse scanning sequences per user (see Section 4.4.1.1).  It seems that for e.g. user6 shows more diversity in used eye scanning patterns. But this result needs to be analysed into more detail.

OneMinuteBinVoice 5 Count of Positive / Negative:2 Count of Positive / Negative:2 NumberWords:6 NumberWords:11 22 22 2 2 2

111 11111111 1 1 11 11 Count of Positive / .. / Positive of Count 1 20 user4 10 7,166666667 6,000 5,000

5 4,000 4,000 13,000

Count of Positive / Negative:3 11,000 2 NumberWords:6 NumberWords 1 5 4 33 33

222 222 2

1111 1 Count of Positive / .. / Positive of Count 1 20 Count of Positive / Negative: user5 10 4 5 6,000 NumberWords:6

5 4,000

2 NumberWords 1 5 44 4 Origin 33 3

22 2 2 2 2

1111111 Count of Positive / .. / Positive of Count 1 20 user6 10 8,000

5,5 6,000 6,000 5,000 5,000 5,000 5,000

5 4,000 Count of Positive / Negative: 2 Count of Positive / Negative: 4 NumberWords 3 NumberWords:10 1 NumberWords:8 5 44 4 33 3 3 3

2 22 22 22 22 2 22 2

1 1 1 11 1111 1 1 Count of Positive / .. / Positive of Count 1 20 user8 10

7,6 8,000 8,000 6,000 6,000 5 10,000 2 NumberWords 1 0 1 3 4 5 6 7 8 9 10111214151617181920212224252829303133343637394042444850515354575859

Moving Average of Cou..

1,000 4,000 Figure 51 - Relationship between number of words spoken and negative observations.

53 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

4.3 Hints for H2 - Analysis of the Arrival and Departure Workflows In contrast to H1, which refers to decisions a controller makes in a single step of the workflow, we refer in H2 to workflow patterns as sub-sequences (steps) of actions the controller performs. These workflow patterns might vary between good and bad controllers. In order to detect good and bad workflows – and therefore to find hints for H2 - we took into consideration: on-block situations, repetition of certain workflow steps (number of cross runway situations or on hold) and the total processing time of each airplane.

4.3.1 Implementation Not all the necessary workflow steps for this part of the analysis are delivered automatically by the simulation framework and the different data streams from the sensors or the ATM system. Therefore, we used our own automatic workflow detection and analysis component. Time based data stream oriented applications are used across several fields. One of the strategies for correlating and extracting information from these data streams by employing are Complex Event Processing (CEP) Systems. CEP combines several events to generate a composite or derived event. These events contain new meaningful information to study the underlying process. Furthermore, CEP gives the opportunity for a loose coupling between software components [22].

Figure 52 - Example of using CEP to join two different events into one.

In our software prototype the detection of the workflow steps for every airplane is achieved by making usage of CEP Server implementation called NEsper. NEsper allows (through a tailored Event Processing Language - EPL) the registration of queries into the NEsper server [23]. After the incoming events are separated from the message queue component (by replaying the experiments data automatically), the CEP-Server consumes the events and triggers, depending on the registered queries, new and more meaningful events. In our example for automatic workflow detection, the events from the message queue called FlightObject contain ATC information about departures and arrivals. These events are accepted and processed by the CEP Server. The information is stored by the ATM systems in form of XML messages. Beforehand, we register a specific query to automatically process and detect each workflow step. The CEP components detect the FlightObject messages and filter relevant information such as: FOID, Callsign or AtcType to realize the workflow detection. After separating each workflow step in the air traffic message, an event is triggered, coded with the callsign of the airplane. The triggered event can be immediately consumed by our software prototype and we can visualize the current workflow step of each airplane and even show any step repetition (e.g., same airplane put on hold more than one time). An example for detecting the workflow step TAXI by utilizing an EPL query is given in Table 10. The first expression is for filtering relevant information Callsign. The second expression is for triggering an event if an airplane is in workflow step of TAXI.

54 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

//This sensor counts the number of Aircrafts in the Workflowsteps TAXI expr ="Insert into Atc \n" + "SELECT Identifier, \n" + "fligthObjectPublication.FO.id AS AtcFOID, \n" + "fligthObjectPublication.FO.flightPlan.flight_plan.aircraft_identification.identifier AS Callsign, \n" + "fligthObjectPublication.FO.departureInfo.runwayId AS AtcRunwayId, \n" + "fligthObjectPublication.FO.atcState.role AS AtcRole, \n" + "fligthObjectPublication.FO.atcState.type AS AtcType \n" + "FROM FlightObjectPublicationLocationSensor"; createStatement("AtcChange", expr); expr ="create context sepWorkflowTaxi partition by AtcFOID FROM Atc"; createStatement("CEPVariable", expr); expr = "context sepWorkflowTaxi select Identifier," + "AtcFOID, AtcType, count(AtcType) as countStep, AtcRole FROM Atc \n" + "WHERE AtcType = 'TAXI_WITH_TAXI'"; createStatement("WorkflowstepTaxi", expr);

Table 10 – Filter/Query to detect airplanes that are in the workflow step TAXI.

In Figure 53, the entire process of filtering and consuming CEP events that extract and correlate information about each workflow step is represented. Furthermore, the process of data visualization and generation of new data sets is also shown.

Figure 53 - The complete process of consuming, filtering and generating events.

4.3.2 Results of the Analysis of ATC Workflow Steps The DM/ML/AI module is now able to distinguish the different workflow steps (please see Figure 5) performed by the ATCOs for each airplane landing or departure, as well as, the number of times that a certain airplane is on one of this workflow steps. Additionally, we can now automatically analyse the processing time (managing time spent by the air traffic controller) of each airplane. Next, we show examples of metrics that the DM/ML/AI module is able to extract. We start (Figure 54) by showing relations between the processing time (ProcessSeconds) spent by the handling arrival or departure flights and the occurrence of negative (with high impact) observations (classified according to the experts).

55 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Timestamp

Avg. Impact: -3,500 Avg. Impact: -3,833 Count of Positive / Negative: 8 Count of Positive / Negative: 12 10

5

Avg. Impact: -4,500 Avg. Impact: -3,571 Count of Positive / Negative: 4 Count of Positive / Negative: 7 Count of Positive / Negative of Positive Count 0 Avg. Arrival ProcessSeconds: 1.259,0 Avg. Departure ProcessSeconds: 684,7

600 1000

400

500 200 Avg. ProcessSeconds Avg. ProcessSeconds Avg. 0 0 0

Message: -1 Avg. Impact: -4,500 CSA543 is blocked Count of Positive / Negative: 4 by DLH3TP Avg. Impact: Avg. Arrival ProcessSeconds: 1.259,0 -5,000 -2 Count of Positive / Negative: 1 Avg. Impact: -4,000 Count of Positive / Negative: 4 -3 Avg. ImpactAvg. -4

-5 0 1 2 3 4 5 6 7 8 9 11 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

Departure AVG(Proces.. Arrival AVG(ProcessSe.. Count of Positive / Neg.. Avg. Impact 112 219,0 684,7 516,0 1.259,0 -5,000 -3,000 Figure 54 - Analysis of the Processing Time (seconds) for arrivals (orange/brown) and Departures (Blue) for user8.

Like we have stated before, the DM/ML/AI module is also able to automatically filter, correlate and use metrics related to each workflow step of the arrival and departure workflows (achieved through the usage of CEP filters). These metrics are taken from the standard ATCO workflow descriptions (Figure 5). They allow us to discover the best and worst sequences in terms of processing time or repeated workflow steps. It also allows us to know the exact current state of each airplane. We can use this information to predict next states and to prepare in advance recommendations regarding next best actions. Next, we present screen captures, taken from user8, regarding the performance in handling arrival and departure flights. We can observe the total processing time for handling all tasks related with a specific aircraft and also the number of repetitions for each step of the correspondent workflow (arrival or departure). We used this information for correlation of stress levels report, experts and supervisor observations and sensor data. From the data we see obviously a relation between the different options taken by the ATCO, like the number of times a user puts a flight on hold, or the number of times an airplane has to make cross runways and the total processing time of a flight or the occurrence of negative observations. In general, the longer a flight takes to process the higher the negative impact will be (taken from negative observations).However, in some situations, like in a blocking situation there is no obvious correlation with processing times or other workflow steps metrics (see blocking situation in Figure 54).

56 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 55 - DM/ML/AI module with automatically calculated metrics for arrival flights capturing repeated workflow steps (e.g., number of taxi commands or cross runways for all flight).

The analysis of correlations between workflow steps values and all the other available metrics (performance, sensor data, workload, etc.) is a very challenging research question. In the future we would like to make use of graph databases capabilities to capture and semantically annotate all the users’ decisions (sensor data, steps, clicks) in order to follow step by step all the user decisions and to perform more exhaustive data analysis and correlations.

4.3.3 Machine Learning Experiments As an extra step, we also tried to develop experimental models for the detection of outliers, discovery of patterns of for the creation of prediction models related with the detection of negative observations. We have started by creating a special dataset that contained the several metric counters in intervals between errors. With this new dataset at hand we applied algorithms that return a set of association rules from the given set of frequent item sets. To achieve this goal we focused on a-priori algorithms, specifically the FP-Growth algorithm that calculates all frequent item sets from the given example-set using a FP-tree data structure (all the attributes were converted to binomial). Frequent item sets are groups of items that often appear together in the data, here it is also important to know the basics of market-basket analysis for understanding frequent item sets. Association rules are if/then statements that help uncover relationships between seemingly unrelated data. An example of an association rule would be "If a customer buys eggs, he is 80% likely to also purchase milk." An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item (or item set) found in the data. A consequent is an item (or item set) that is found in combination with the antecedent. Association rules are created by analysing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. The frequent if/then patterns are mined using the operators like the FP-Growth operator. By utilizing a create association rules operator these frequent item sets are taken and the generation of association rules is performed. The algorithm tries to find at least the specified number of item sets with highest support taking the 'min support' into account, in our case 0.8.

57 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 56 - Discovery of Association Rules using the algorithm fp-growth.

We can see in Figure 56 a resume of the metrics used in our model (they represent occurrence counting’s between successive negative observations) for one user (user8). In our case these associations’ rules could help us understand the relation between different metrics and the occurrence of errors. Next in Figure 57 we can see an example of a graph visualization of the relation between different metrics and the different discovered association rules.

Figure 57 - Relation Between the discovered association rules and different variables of the model.

58 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

We also experimented with the algorithm k-NN Global Anomaly Score. This algorithm calculates the outlier score based on a k-nearest neighbours implementation. The outlier score is by default the average of the distance to the nearest neighbours, it can be set to the distance to the kth nearest neighbour which is similar to the algorithm proposed by Zengyou He et al (2003) [24] by setting the corresponding parameter. The higher the outlier the more anomalous the instance is. The operator is also able to read and write a model containing the k nearest neighbours set. Typically, 99% of the execution time is used to compute the neighbours, so it is a good idea to store the model, for example, when looping over a parameter. The operator checks whether the model and the example set fit together. The model can be used for any of the nearest-neighbour based algorithms. The parameter k used to create the model needs to be the same or larger as the parameter k specified in the operator. Otherwise, the model is re- computed.

Figure 58 - Outliers Discovery for negative observations in the new dataset with metrics counters (captured between successive negative observations).

Next, in Figure 59, we present decision trees to analyse the correlations between the different metrics and the occurrence of errors. We used the dataset that contains metrics between successive intervals of negative observations. For example, for user7 the increase of negative occurrence appears to be linked to the number of eye movements or a combination of this with the number of departures and the number of arrivals.

59 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 59 - Decision tree to depict reasons for increasing numbers of negative occurrences for different users.

The regression test uses the variables that explain our data the most. By utilizing the software GMDH (extended trial version) we were able to create a draft of a prediction model. It is applied by separating the dataset into a test dataset with 80% of the data and uses 20% of data as training data, using a bootstrap. We used a polynomial approximation method to build the predictive model. These algorithms detect the structure of the data and create some polynomial equations with weights for each of the variables that are then used in the definition of the prediction model.

In Figure 60 we can observe the model being applied to both the 80% known dataset (first part of the graph in blue) and then applied to the unknown 20% part of the dataset (in red). The grey line that comes from the beginning of the chart and then morphs into a red greyed out area over the red part of the graph show, how well the model was predicting and adapting to the data (known and unknown to the algorithms). The match was very accurate for this example, and it was a first indication that these variables made sense for future developments regarding the creation of a prediction framework.

60 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 60 - Polynomial regression analysis for creating a model to predict negative occurrences based on top most metrics (number of eye events or departure flights).

4.4 Event Trace Analysis

We analysed the traces : .. of a set of possible events ,.., like eye-and mouse-fixations by parameterising a Variable Length Markov Model (VLMM). VLMMs provide an efficient method to learn a model of discrete event systems. Different to other models, like Hidden Markov Models that are able to describe more general processes, no previous knowledge about the process is needed. States of VLMM can easily be interpreted, since each state is labelled by a corresponding subsequence within the data. A state chart can be calculated from the VLMM and most probable state sequences can be determined. The occurrence of a state can be associated with a timespan within the data. A state can have different attributes, i.e. complexity or the entropy of the probability distribution of next events. These measures can additionally be used to look for patterns associated with the observations. This is accomplished by a tool of the Fraunhofer FKIE and will be described in the following in more detail.

4.4.1 Variable Length Markov Models (VLMM) The parameterization of VLMM is very intuitive. The algorithms grow a tree of sequences. Each node of the tree represents a sub sequence. The root node is the sequence with no events; the children of a node are representing a sequence which extends the sequence of its parent by one previous event. Every node within the tree is labelled by a unique sequence of events. Hence, the parent of one node represents a sequence that looks one step less into the past as its children. The algorithms grow the tree from the root and do only include nodes which correspond to sequences that occur sufficiently often in the data where the observation of the sequence contain more information about the event following next than if only a suffix of the sequence is considered (a suffix of a sequence is if the sequence is shorted by the events at the beginning. is a suffix of ). All algorithms are very similar and mainly differ in the criteria when a node corresponding to a sequence is included into the tree. We use some statistical criteria [25]. Each node is also associated with the observed empirical probability which event follows next. Therefore the tree can easily be used to make predictions. By looking up a leave node that corresponds to a suffix of the observed sequence its associated probability distribution of next events can be retrieved. The tree of sequences can be converted into a probabilistic state machine also called Probabilistic Suffix Automation [26].

61 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

PST: PSA:

0.75 e (0.5,0.5):: ⟼ 0, 1 (BA) 0.25 A B 0.25 BA AA (AA) 0.5

(0.5,0.5) A B (0.5,0.5) 0.75 0.25 A B 0.5

B (B) 0.5 AA (0.75,0.25) BA (0.25,0.75)

0.5

Figure 61 - Relation of Probabilistic Suffix Tree (PST) and Automation PSA.

To transform the tree (also called Probabilistic Suffix Tree, PST) into a PSA the leaves are used as states and additional states are added to make sure that every state has a follower. The probabilistic state machine can be used for simulations or for visualizing typical event sequences by a state chart. Furthermore, additional attributes of the states can be calculated. The occurrence of states over time can also be determined and visualized this way.

4.4.1.1 Discrimination of states by event durations Often for determining the user's state it’s important how much time has passed between different events, since they provide important information on whether e.g. a user has a entry read. The idea is to consider each sequence to be checked not only whether a change in the prefix causes a change in the probability distribution, but also whether the durations of the events in the sequence contain information on the next event [27]. In order to decide this, the distribution of event durations within a sequence in dependence on the observed next event is analysed (Figure 62). ) 1 ) i+ 1 ) 1 i+  , i+  ) , m )

 ( n , ( )

k  (   = 1 = i = i - i  

 ( (  T T   

o(1) o(2) o(3) o(4) o(5) o(1) o(2) o(3) o(4) o(5)  i1  i1

Figure 62 - Hypothetical distribution of event durations, if after one event (left) or a sequence of two events (right) a specific event is observed.

For a specific sequence : .. possible next events are shown on the x-axis. The y- axis shows the durations of the events in the sequence if the event indicated on the x-axis has been observed. Figure 62 shows on the left side the case that has been observed. In dependence on which event has been observed next, the event durations for are different. The mean event durations seems to be longer if . Therefore predictions can be improved if the durations of the previous events are considered by including this sequence into the tree with additional information

62 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. about the event durations. The right side of Figure 62 shows an example where two previous events are considered. The patterns in the mean event durations contain information about the next event. To decide if these differences are relevant, statistics can be applied, like e.g. analysis of variance (ANOVA). Because the durations are not normally distributed we used the non-parametric Kolmogorov-Smirnov test.

4.4.1.2 Complexity of States To calculate the complexity of a stochastic process the informational dimension of the block-entropy is used to assess how much information is contained in previous events for the prediction of the next event, weighted by how far they lay in the past [28]. The more information about the further development of the system is contained in far past events, the higher the complexity of the stochastic process. As a measure of information that is contained in the observation of a previous event sequence about the next event, the reduction of entropy is used. Entropy is a measure of the uncertainty with which an event occurs. In case of considering sequences .. of length n, the entropy of the probability distribution of the next event can be calculated by

h |.. .. |..log |..

.. The needed empirical probabilities can easily be obtained from the PST. For larger n more of the past events are considered in predicting the next event and hence the entropy (uncertainty about the next event) will be reduced. If the memory of the process is limited, this value will reach a limit lim→ hn h. Grassberger [29] used the area under the curve Λ hn h as a measure of complexity for a stochastic process: ≝Λ. By this definition information gain that comes from events more far in the past is weighted higher than information from events in the near past. This is illustrated in Figure 63.

Figure 63 - Illustration of the complexity measure of Grassberger.

In this way complexity is defined on the overall process. But for an analysis it is important to investigate the origin of the complexity. If the process is transformed into a PSA the contribution of single states to the overall contribution can be determined [30]:

: ⋀ :||,log:||, :

There are two features of a state that determines its contribution to the overall complexity. First, how much information contributes from far past events to the prediction of the next event, but also how frequently this state occurs in a process. If one wants to eliminate this correlation of state complexity with its frequency it makes sense to divide state complexity by its initial probability, since the initial probability is defined to represent the expected relative frequency of the state to occur in a sequence:

/

63 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

It is natural to associate this complexity with work load, especially when the external factors can be isolated and only internal factors are considered for the generation of the interaction sequences [28]. Since the occurrence of a state is associated with time intervals, the complexity per state can be used to calculate a time series of this complexity measure.

4.4.2 Scatterplot Matrix for Measures The VLMM provides a good consolidated description of the occurrence of events and provided us with additional measures for the operators’ state at a given time. To make the link to other time series of measures we used a scatterplot matrix. A scatterplot matrix shows 2D-scatterplots of all possible combinations of considered measures in a table, where each column contains scatterplots with the same measure plotted on the x-axis and each row contains scatterplots plotting the same measure on the y-axis. Beside the complexity measure, derived from the VLMM we also calculated the entropy for the diversity of next events in a timespan and also the mean number of events within a timespan. Furthermore we included the IBI of the heartbeat data as an additional workload measure and the mean number of flights as a task load measure into the scatterplot matrix. We normalized each time series by calculating for each considered measure a mean value for timespans of 10 seconds. In this way each point in a scatterplot is associated with a timespan and therefore there is a link between dots in the scatterplot and the occurrence of states within this timespan. By selecting outliers within one of the scatterplots the corresponding states can be identified and their occurrence on the timeline can be displayed in a histogram. By including markers for the observations into the histogram, correlations of states or possible outliers in the scatterplot to the observations can be identified. To improve this capability we coloured points red that correspond to time intervals short before the observations with most negative impact (-5).

4.4.3 Visualization of Sequential Patterns In this section we will shortly explain the visualizations we used to present and analyse the sequential patterns found by the VLMM algorithms. Figure 66 shows the user interface. On the right, there is the state chart. On the left is the user who is to be analysed and whose event data should be displayed in the state chart. The for the PST is located in the centre. Nodes can be expanded for exploration and sub sequences can be selected for display. At the bottom a time line histogram is displayed that shows the distribution of the occurrence of a selected state or corresponding dots in the scatterplots. The left top centre can be used to show different visualizations, which can be selected by the tabs on the top. Figure 64 displays the probability distributions of the selected state augmented on the screenshot of the Sixth Sense simulator user interface. The colour scheme of the nodes in the state chart can be changed to: frequency, complexity, or uni. The state chart contains only links between states with sufficiently high frequency of events except for links that are needed to make sure that every node in the state chart has a next and a preceding link.

64 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 64 - Screenshot of the user interface with displayed transition probabilities.

Figure 65 shows how the probabilities of next events associated with a state are visualized in more detail. Green circles: next events (eye/mouse), blue circle: prefix / starting point of sequence, light red circles: middle part of the sequence. By mouse over/click on the prefix details of the next events are shown. Arrows indicate how probabilities of next events change if the middle sequence is extended by the prefix. In this case the middle sequence is: radar->pushback->radar. The prefix considered is pushback. The arrows show that the probability to return to pushback increases from 68% to 80% and to off-screen from 4.5% to 14.9% whereas the probabilities to taxiin decreases from 9.1% to 1.1% and to pending departures from 13% to 1.1%. The width of the arrows scales with the overall probabilities and the proportion of red and green colour with their extent of change.

Figure 65 - Illustration of how probabilities for next events in a sequence are displayed.

65 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 66 shows how the scatterplot matrix has been used. Within each scatterplot a region can be selected by the mouse. The frequency of states occurring in corresponding time intervals to the selected points in the scatterplot will be reflected by an adapted colouring of the nodes in the state chart if the frequency colouring scheme is selected. The other way around, if states are selected in the state chart all points in the scatterplot that do not co-occur will be dimmed. In this way, the analyst can easily switch between the different views and drill down into patterns of interest.

Figure 66 - Illustration of the user interface combining states with displayed scatterplot matrix.

4.4.4 Insights regarding interaction sequences In the following section we’ll present and discuss examples for eye and mouse states sequences that we have found. With respect to RQ10 and RQ15 we are interested in concise sequences that might be used as indicators and predictors of an operator’s behaviour.

4.4.4.1 Most frequent States per User In RQ15 we ask if we are able to determine the most preferred mouse and eye sequences per user. For this we are looking for the most frequent state sequences in the data. In general it would not be easy to say at which length to cut such sequences, but to choose the state sequences of the VLMM is a plausible choice. Table 11 lists the top 5 most frequent eye state sequences for each user. In the tables following within this section you will notice some prefixes like Not(event1 || event2). It means that the state represents a sequence where the first element is not listed in the parenthesis of the Not- statement and the sequence without the first element matches the rest of the sequence after the Not- statement.

66 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

User Eye state sequences N complexity user6 Not(radar), taxiin, taxiin 1050 5.13899 startuppushback, startuppushback 511 5.41615 Not(taxiin), radar, radar, radar, radar, radar, radar 525 10.4532 pendingdepartures, pendingdepartures 415 5.27982 handoverrunway 337 4.24181 user7 Not(taxiin), radar, radar, radar, radar, radar, radar, radar, radar, radar, radar 1406 10.6399 Not(startuppushback || offscreen || radar), taxiin 646 3.57608 pendingdepartures 340 3.76519 startuppushback, startuppushback 266 5.70176 handoverrunway 226 3.9879 user8 Not(pendingdepartures), radar, radar, radar, radar 1853 6.68078 taxiin, taxiin 933 5.11017 pendingdepartures 425 3.90836 offscreen 317 3.95789 handoverrunway 294 4.12408

Table 11 - Most frequent state sequences for the eye data (top5 for each user). The most frequent state sequence is very distinct for each user and occurs nearly two times more often than the second one in the ranking list. All the more remarkable, that this distinct top state are similar for user 7 and user 8 and both contain longer sequences on radar whereas for user 6 the state describes focused attention on taxiin. Taking a look on the number of fixations this focused attention on the radar can also be seen: user 6: 35% radar, 26% taxiin, user 7: 64% radar, 13% taxiin, user 8: 53% radar, 20% taxiin. But surprisingly, for user 6 the overall number of fixations on radar is bigger than the number of fixations on taxiin. The state sequence containing radar is the third most found state sequence for user 6. This contradiction indicates that user 6 uses radar in a much more flexible way, combining it with different AOIs. Table 12 illustrates some of the most frequent eye state sequences by the display of the delta in the transition probabilities.

User Eye sequence N complexity user6 not(radar),taxiin,taxiin 1050 5.1

67 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. user 7 not(taxiin), radar, radar, radar, radar, radar, radar, radar 1400 10.3

User8 radar, radar, radar, radar 1834 6.68

Table 12 - Most frequent states of each user for the eye fixation sequences. The same analysis can be applied to the mouse moves. Table 13 lists the top 5 most frequent mouse state sequences for each user. User Mouse state sequences N complexity user2 radar 537 3.9514 taxiin 464 4.3576 startuppushback 386 4.34803 Not(pendingdepartures), handoverrunway 87 4.80334 nowhere, nowhere, nowhere, nowhere 89 2.77347 user3 nowhere, nowhere, nowhere 100 2.00923 bottomcenter 30 3.06534 radar 30 3.48681 Not(nowhere), nowhere 7 2.57605 toppanel 6 3.99731 user4 radar 1291 2.67213 taxiin 295 3.50671 handoverrunway 86 3.69004 Not(startuppushback), startuppushback 82 4.07647 startuppushback, startuppushback 74 ‐0.0556529

68 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

user5 taxiin 113 0.798235 startuppushback 52 0.600654 handoverrunway 35 0.734859 radar 32 1.7067 pendingdepartures 27 0.5925 user6 radar, radar, radar, radar 392 2.07238 taxiin, taxiin, taxiin, taxiin 285 2.06089 startuppushback 235 3.79028 taxiout 96 4.26805 Not(taxiin), taxiin 85 3.63707 user7 startuppushback 181 3.65007 taxiin, taxiin 164 4.45134 radar, radar 37 4.61829 taxiout 32 4.01068 leftpanel 23 2.64312 user8 taxiin, taxiin 229 1.3014 startuppushback, startuppushback 94 1.35668 taxiout 53 4.06747 Not(taxiin), taxiin 50 3.53306 Not(startuppushback), startuppushback 38 3.82724

Table 13 - Most frequent state sequences for the mouse data (top 5 for each user). The detected mouse state sequences are significantly shorter than the eye state sequences. Many state sequences only contain one single event. However, the reason for this is that much less mouse events were present, since the mouse has not been moved as much as the eye: e.g. for user6 5795 eye fixations were recorded but only 1503 mouse “fixations” (user 7: 529 mouse vs. 6963 eye, user 8: 562 mouse vs. 6387 eye). Most mouse events were registered for user 4 and user 2 (1977 and 1883), and the least mouse events where recorded for user3 and user5 (188 and 282). In this sense it is amazing that user2 has no state sequences with radar that is longer than one event, which indicates that there are really no longer correlations to the past for mouse events associated with radar. In contrast to this, the most frequent state sequence of user6 contains four successive radar events. Also for user2 there are longer mouse fixation sequences with successive radar fixations. Considering only a sequence with a single radar-fixation the probability to stay on radar for user2 is only slightly lower than for user6 (80% vs. 88%). But for user6 the probability distribution of next events changes the more often radar is repeated in the sequence. For user6 the probability to stay on radar increases up to 92% whereas for user2 it stays constant so that there is no reason to consider longer state sequences repeating radar.

4.4.4.2 Most complex states per user Another way to look at the state sequences is to focus on the most complex state sequences. Table 14 shows the top 5 complex states of the VLMM of the eye tracking data for each user. All complex state sequences contain many successive radar fixations preceded by another fixation, mostly taxiin. These states indicate that the probability to return to an AOI is increased after successive fixations on radar. This reflects that in the cognitive workflow the operator has to complete a task associated to the preceding AOI, e.g. taxiin, by collecting the necessary information on the radar and returns back to the AOI where information has to be put in. Besides taxiin other AOIs preceding successive radar fixations are: pendingdepartures, and startuppushback. For user6 the probability that on six successive radar fixations taxiin follows if taxiin also precedes increases from 5% to 25%. For user6 the successive radar fixations are longer if taxiin precedes than if startuppushback precedes. This effect seems not to be statistically since the state sequence taxiin, radar, radar, radar (not in the table listed) occcurs 95 times in the data nearly as much as

69 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. startuppushback, radar, radar, radar (83 times). This means that at least user6 needs more radar fixations if working on taxiin than if he is working on startuppushback.

User Eye state sequences N complexity user6 taxiin, radar, radar, radar, radar, radar 57 10.5224 Not(taxiin), radar, radar, radar, radar, radar, radar 525 10.4532 Not(taxiin || radar), radar, radar, radar, radar, radar 88 9.22915 taxiin, radar, radar, radar, radar 69 9.04634 startuppushback, radar, radar, radar 83 8.17981 user7 taxiin, radar, radar, radar, radar, radar, radar, radar, radar, radar 51 11.7606 Not(taxiin), radar, radar, radar, radar, radar, radar, radar, radar, 1406 10.6399 radar, radar taxiin, radar, radar, radar, radar, radar, radar, radar, radar 61 10.1068 Not(taxiin || radar), radar, radar, radar, radar, radar, radar, radar, 109 9.91921 radar, radar offscreen, radar, radar, radar, radar, radar, radar, radar 57 9.64865 user8 pendingdepartures, radar, radar, radar, radar 31 8.02889 startuppushback, radar, radar, radar 77 6.82221 pendingdepartures, radar, radar, radar 41 6.70428 Not(pendingdepartures), radar, radar, radar, radar 1853 6.68078 offscreen, radar, radar 129 5.84961

Table 14 - Most complex state sequences for the eye tracking data (top 5 for each user)

User 7 does also show this long successive radar sequences preceded by taxiin, whereas the model for user 8 contains mainly long radar sequences preceded by pendingdeparture and startup pushback. Also the sequence taxiin, radar, radar, radar, radar, is 124 times present in the data of user 8 compared to 69 times for user 6 it is not included into the model, since the probability to return to taxiin is not increased. Table 17 illustrates the top most complex eye state sequences and their occurrences on the time line. The pictures show the effect discussed so far. On long radar sequences probability to stay on radar decreases and to go back to fixation before the long radar sequence increases. Analysing the histogram for the occurrence of these complex events does not show any correlations with the negative observations.

70 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

User Eye state sequence N complexity user6 taxin,radar,radar,radar,radar,radar 52 10.52

user7 taxiin,radar,radar,radar,radar,radar,radar,radar,radar,radar 51 11.75

71 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

User8 pendingdeparture, radar, radar, radar, radar 31 8.2

Table 15 - Illustration of the most complex state sequences for the eye tracking data Looking on the most complex state sequences for the mouse data (Table 16) there are no such complex state sequences as for the eye tracking. We already discussed this above. However for user2 user4, user6, and user7 there is at least one state sequence containing more than one event. They are shown in Table 17.

User Mouse state sequences N complexity user2 pendingdepartures, handoverrunway 31 6.537 Not(nowhere), nowhere, nowhere, nowhere 15 6.12915 pendingdepartures, pendingdepartures 71 5.56096 Not(nowhere), nowhere, nowhere 24 5.08845 Not(pendingdepartures), handoverrunway 87 4.80334 user3 Toppanel 6 3.99731 30 3.48681 Radar Leftpanel 5 3.46379 Taxiin 2 3.10449 Bottomcenter 30 3.06534 user4 Not(startuppushback), startuppushback 82 4.07647 Taxiout 21 4.04965 Nowhere 17 3.96738 Onblock 16 3.79514 Handoverrunway 86 3.69004 user5 startuppushback#taxiin 1 2.4141 Taxiout 1 2.4141 taxiin#handoverrunway 2 2.4141 Onblock 4 2.4141 Bottomcenter 1 2.41407

72 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. user6 Not(taxiin), taxiin, taxiin, taxiin 48 5.81504 Not(taxiin), taxiin, taxiin 61 4.74621 Not(radar), radar, radar, radar 35 4.67024 Onblock 12 4.61976 Taxiout 96 4.26805 user7 startuppushback, taxiin 30 4.89652 radar, radar 37 4.61829 taxiin, taxiin 164 4.45134 Taxiout 32 4.01068 Nowhere 6 3.82285 user8 Radar 19 4.57338 startuppushback#taxiin 4 4.40386 Pendingdepartures 14 4.38562 Taxiout 53 4.06747 Handoverrunway 15 3.96427

Table 16 - Most complex state sequences for the mouse data (top 5 for each user). For user2 the most complex state sequence means that if the user moved the mouse from pendingdeparture to handoverrunway he will move it with increased probability to startuppushup next and not to taxiin as otherwise, if not started from pendingdeparture. For user4 the most complex state sequence tells us that two mouse fixations on startuppushback in succession increases the probability to move the mouse to taxiin. For user 6 the visualization of the displayed state sequence shows that if there has been a mouse fixation different to taxiin before a succession of three mouse fixations within taxiin the next mouse fixation will most probably stay in taxiin. For user7 a startuppushback mouse fixation preceding a taxiin mouse fixation will increase probability for next mouse fixation on startuppushback again. The mouse fixations of user2 are more complex than from other users by numbers.

User Mouse state sequence N complexity user2 pendingdeparture, handoverrunway 31 6.5

73 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged. user4 Not(startuppushback), startuppushback 82 4.07

user6 Not(taxiin), taxiin, taxiin, taxiin 48 5.81504

user7 startuppushback, taxiin 30 4.89652

Table 17 - Illustration of the most complex state sequences for the mouse data

74 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

For the eye tracking data no obvious correlation of negative states with negative observations can be read from the histograms.

4.4.5 States corresponding to outliers and around Additionally we are looking for interesting states by associating outliers in the scatterplots with co- occurrence with states. Table 18 illustrates some findings. Overall the strict appliance of this method was not possible, since for no user all sensor data were available. For user 6-8 there were no heartbeat data available whereas for user 3-5 there heartbeat data available but no eye moving data. For convenience and only to demonstrate the method, we only use examples from the scatterplot correlating entropy (diversity) of mouse and eye fixations. Overall both measures seem not to be correlated. Looking for outliers in scatterplots where both measures in general are expected to be correlated is more promising, since the usual correlation can easily be identified and outliers might be more leap out. One measure like this would might have been heartbeat data, and number of eye fixations. However also in uncorrelated measures outliers might show some variation from usual behaviour.

User Outlier in Eye entropy Most freq. eye sequence vs. Mouse entropy

User 6 Pendingdeparture,pendingdeparture

User 7 Not(radar || taxiin || startpushup), taxiin

75 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

User 8 Offscreen

Table 18 - Examples of state sequences corresponding to outliers in the scatterplots.

The procedure for creating Table 18 was as follows. We marked a cluster of outliers in the scatterplot which is shown on the left. The histogram below the scatterplot displays the corresponding time intervals with co-occurring states. Then we looked for the most frequent states within these time intervals and selected it to display the histogram and see where this state occurs elsewhere. As can be seen in these examples the outliers at least for the example of user 6 and user 7 do lay in the vicinity of an negative observation, but the corresponding state sequences are not distinctive enough and do occur in many other time intervals which cannot be associated with negative observations.

76 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

4.5 Conclusion In order to collect meaningful data for the Sixth Sense project, we prepared and performed two experiments with ATCOs. An intensive research and integration work was needed for that, including state of the art related to air traffic control, data mining and machine learning, decision making, psychology and other important topics. We designed and implemented a software framework that allows collecting data in real time about the users' behaviours. For this we had to integrate different systems, sensors and sources of information. We started by unifying all the air traffic data sources with all the sensor technologies used (e.g., heart rate and heart rate variability, Kinect and body/head posture, eye tracker and areas of interest, user interaction information, environmental sensors, air traffic controller workflow steps analysis, real time data filtering, capability of replaying complete experiments and other functionalities). We do not yet use all the capabilities of our framework, namely the graph database registration and graph analysis or the prediction engine component because these components were not the main scope of this project. In the first ATC experiment, exercise 1, we collected enough data to test and confirm the practical utility of our data collection approaches, and sensors integration. Also this first exercise helped in the clarification of our research questions, allowing us to create a common and clearer picture of our final goals. In the second experiment, exercise 2, we collected much more data and we automated manual data collection steps. We integrated the supervisor, observer and ATC stress level reports inside our software framework to allow us to automatically analyse and treat all the aspects related with "think aloud" and observational protocols. All eight users had to answer several questionnaires, from where we extracted valuable information about the different preferences or user experiences when handling air traffic. We used the questionnaire outcomes to search for answers regarding the difficulty of the experiment, usability of the system, workload, situational awareness, performance and other important measures. Including questionnaires into the data analysis is definitely a promising option, but due to limited resources and data quantity we could not exploit the maximum potential. From the two exercises we collected at least 600.000 events distributed among several datasets. The handling of the complexity and amount of data (many events, different datasets, multiple variables, time series, and behavioural sensed data) required multiple strategies for pre-processing, analysis, discussion sessions, exploration and visualization. The obtained results are presented in this report. In Sixth Sense we are interested in looking for patterns or hidden data signs that allow us to detect moments of bad and good decisions that could be incorporated in an automated system in order to detect and predict the users’ next actions. Based on psychological findings the metrics obtained from the experiments were aggregated in task load, mental workload, attention, behaviour, and performance categories. This categorization established a ground truth of possible useful predictors to detect moments of high workload, high stress, and loss of situational awareness. Guided by these findings, 15 research questions were established and addressed during data analysis. This includes exploration of  the number of arrivals and departures per minute in relation to errors,  increases in eye movements when the user is having periods of high workload that relates to the occurrence of negative observations,  the relation between mouse pauses and increases in eye fixation times, the number of areas of interest visited per minute, lower heart rate variability, how the voice communications (number and speed of words spoken) is related to negative observations,  the most preferred areas of interest by the users,  how we might use the Kinect head pose and sound source angle variables to detect problematic time periods that might allow us to reduce the amount of data that needs to be analysed in real time.

77 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

We gathered promising evidence that the strategies employed and the predictors found will be very useful in the design of a new automated cyber physical system that is able to detect unusual behavioural situations in the field of air traffic control. The most promising metrics and as a consequence the post promising hints were found in relations between different data streams. One is the link between reductions in mouse movement and increases in the eye movements, coincident with the occurrence of negative observations. The heart rate variability together with the reduction in mouse activity, the number of visual UI objects to be managed and the eye tracking AOI frequency and duration provide very good clues for anticipating moments of stress and high workload. There are direct relations between an increase in the number of words used by the air traffic controllers and the occurrence of negative observations. And we found a correlation between the users head position and negative observations that indicate promising model creation for predictions. The presented results show how important the incorporation of behavioural analysis is for the design of automated systems that are able to analyse, detect and predict unsafe situations and systems that are even useful to react or advise for better and safer actions. Our results can also be applied to the improvement of existent systems and user interfaces. Taking into consideration that we were mainly interested in behavioural indicators, for which the incorporation of new sensors and methods like voice recognition system, eye tracker, etc. was essential, we believe that we identified behavioural causes that play an important role in the report of higher stress levels, high workload or even to the lack of situational awareness. These behavioural causes are for example number of visual objects to be handled (arrivals and departures per minute), number of areas to be monitored, delays and problems in communication with the pilot, time accumulation and also emotional factors.

78 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

4.6 Future Work New experiments to collect more data would be the next step. For an improved data quality and quantity per experiment, the sensor output could be distributed to multiple machines in order to reduce the data load. This was the reason for lack of Kinect data. In order to answer specific questions about situational awareness or task completion times we would need to create specific and shorter experiments focused on smaller tasks. This would also simplify measure time or speed. We envision the use of graph models and prediction engines applied to behaviour analysis or to the prediction of next user actions or next best suggestions. Deep learning and agent based models are also important components for building more intelligent systems especially to incorporate cognitive features that better map the users’ behaviours and the users’ decision making processes. Here the inclusion of cognitive architectures could also be beneficial. The collection of real data is a challenge due to the extensive preparations and system integrations work. A plug and play solution for our components and sensors to an existent standardized simulation sensing platform would be desirable. This would save precious time and the results would be more comparable to other studies (e.g. [31], [32]). We would like to see developments in the creation of real time behavioural monitoring dashboards that can account for the calculation of different types of costs. Specifically in the case of Sixth Sense, we would like to improve the implementation of costs related with interaction and user behaviour (e.g., costs related with having to focus to attention on not so meaningful areas of interest, or costs related with useless movement of eyes, mouse or body activity). These would allow to quantify specific decisions and to analyse the impact of those decisions not only in terms of time or effort but also in terms of financial effort. We expect that more and more gesture, voice and natural language based user interface capabilities will be used by ATCOs. Therefore, there will be the need to perform new similar studies that take in account the usage of more assistive technologies in the working place. Also we envision the incorporation of emotional costs, e.g. analysing the costs of frustration in the communications between ATCOs and pilots or periods of inactivity or extreme effort or dislike. The system would gain from the use of extra emotional related sensing technologies. Regular manual reports, reporting about stress level, or the current status of the environment perceived by the users, such as room temperature, air humidity or noise, would also be a beneficial contribution. However, what we are aiming at is the inclusion of sensors that can automatically sense most of these environmental or psychological facts in an automatic way. By incorporating this data we can create even more innovative automated cognitive systems. We did the first steps in this direction by calculating interactivity complex metrics like the number of AOI visited, interaction effort in terms of number of words used, areas of interest visited, number of visual objects to be handled or the “UI interaction pace” measure and the analysis of the processing time of airplanes globally and at each workflow step, by taking the ATCOs standardized workflow processes for handling airplane departures and arrivals as basis for our analysis. However, this is definitely further specific and complex research. This would also require the inclusion of more sensing technologies for detecting emotions like Electroencephalography and camera based approaches. It is also desirable to have better real time body tracking capabilities to increase the awareness of the automated system in respect to the users’ position and posture. However, in the scope of Sixth Sense we tried to make only minimal changes to the working environment of the ATCO. Finally it would be very interesting to quantify the factors mentioned above in terms of real financial impact and not only in terms of ergonomic or time constraints.

79 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

References

[1] A. Vosskühler, V. Nordmeier, L. Kuchinke and A. M. Jacobs, “OGAMA (Open Gaze and Mouse Analyzer): open-source software designed to analyze eye and mouse movements in slideshow study designs,” Behaviour Research Methods, pp. 1150-1162, 2008. [2] A. Haarmann, W. Boucsein and F. Schaefer, “Combining electrodermal responses and cardiovascular measures for probing adaptive automation during simulated flight,” Applied ergonomics, 40(6)., 2009. [3] K. F. Van Orden, T. P. Jung and S. Makeig, “Combined eye activity measures accurately estimate changes in sustained visual task performance,” Biological Psychology, 52, p. 221–240, 2000. [4] P. A. Hancock, G. Williams and C. Manning, “Influence of task demand characteristics on workload and performance,” The International Journal of Aviation Psychology. Special Issue on Pilot Workload: Contemporary Issues 5(1), pp. 63-86, 1995. [5] W. Rohmert, “Das Belastungs-Beanspruchungskonzept,” Zeitschrift für Arbeitswissenschaft, pp. 193-200, 1984. [6] “DIN EN 10 075-1,” Ergonomische Grundlagen bezüglich psychischer Arbeitsbelastung. Teil 1: Allgemeines und Begriffe, 2000. [7] M. A. Neerincx, “Cognitive task load design: Model, methods and examples,” Handbook of Cognitive Task Design, pp. 283-305, 2003. [8] T. E. de Greef and H. F. R. Arciszewski, “Triggering Adaptive Automation in Naval Command and Control,” Frontiers in Adaptive Control, pp. 165-188, 2009. [9] P. A. Hancock and M. H. Chignell, “Input information requirements for an adaptive human- machine system,” Proc. of the Tenth Department of Def. Conf. Psych., vol. 10, pp. 493-498, 1986. [10] J. A. Veltman and C. Jansen, “Differentiation of Mental Effort measures: Consequences for Adaptive Automation,” Operator Functional State, pp. 249-259, 2003. [11] A. H. Roscoe, “Assessing pilot workload. Why measure heart rate, HRV and respiration?,” Biological Psychology, 34, pp. 259-287, 1992. [12] J. A. Veltman and A. W. K. Gaillard, “Physiological indices of workload in a simulated flight task,” Biological Psychology, 42(3), pp. 323-342, 1996. [13] B. Mulder, H. Rusthoven, M. Kuperus, M. de Rivecourt and D. de Waard, “Short-term heart rate measures as indices of momentary changes in invested mental effort,” Human Factors Issues, 2007. [14] D. Manzey, “Psychophysiologie mentaler Beanspruchung,” Ergebnisse und Anwendungen der Psychophysiologie (Enzyklopädie der Psychologie, C. Serie L. Bd5), p. 799 – 864, 1998. [15] K. F. Van Orden, W. Limbert, S. Makeig and T. P. Jung, “Eye activity correlates of workload during a visuospatial memory task,” Human Factors, 43(1), pp. 111-121, 2001. [16] M. S. Young and N. A. Stanton, “Attention and automation: New perspectives on mental underload and performance,” Theoretical Issues in Ergonomics Science,3, p. 178–194, 2002. [17] A. Mack and I. Rock, “Inattentional Blindness,” Cambridge, MA: MIT Press, 1998. [18] M. A. Just and P. A. Carpenter, “A theory of reading: From eye fixations to comprehension,” Psychological Review 87(4), pp. 329-354, 1980. [19] A. H. Bellenkes, C. D. Wickens and A. F. Kramer, “Visual scanning and pilot expertise: The role of attentional flexibility and mental model development,” Aviation, Space, and Environmental Medicine, pp. 569-579, 1997. [20] J. R. Tole, A. T. Stephens, M. Vivaudou, A. R. Ephrath and L. R. Young, “Visual scanning behavior and pilot workload,” NASA Contractor Report No. 3717, 1983. [21] J. C. Sperandio, “The regulation of working methods as a function of workload among air traffic controllers,” Ergonomics, 21, pp. 195-202, 1978. [22] S. Lehmann, R. Dörner, U. Schwanecke, N. Haubner and J. Luderschmidt, “UTIL: Complex,

80 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Post-WIMP Human Interaction with Complex Event Processing Methods,” Workshop ”Virtuelle und Erweiterte Realität”, pp. 109-120, 2013. [23] “EsperTech: Event Series Intelligence,” EsperTech Inc. , 2015. [Online]. Available: http://www.espertech.com/esper/nesper.php. [Accessed 7 7 2015]. [24] . Z. He, X. Xu and S. Deng, “Discovering Cluster Based Local Outliers,” Pattern Recognition Letters, pp. 9-10, 2003. [25] C. Winkelholz and C. M. Schlick, “Statistical Variable Length Marov Chains for the parameterization of Stochastic UserModels from Sparse Data,” in IEEE interation Conference on Systems, Man, and Cybernetics, The Hague, 2004. [26] D. Ron, Y. Singer and N. Tishby, “The Power of Amnesia: Learning Probabilistic Automata with Variable Length,” Machine Learning, vol. 25, no. 2/3, pp. 117-149, 1996. [27] F. Kruger, C. Winkelholz and C. M. Schlick, “System for a model based analysis of user interaction patterns within web-applications,” in IEEE Internation Conference on Systems, Man, and Cybernetics, Anchorage, Alaska, 2011. [28] C. M. Schlick, C. Winkelholz, F. Motz and H. Luczak, “Self Generated Complexity and Human Machine Interaction,” IEEE Transactions on Systems, Man, and Cybernetics, Part A, vol. 36, no. 1, pp. 220-232, 2006. [29] P. Grassberger, “Towards a quantitative theory of self-generated complexity,” International Journal Theoretical Physics, vol. 25, no. 9, pp. 907-938, 1986. [30] C. Winkelholz and F. Kruger, “Anwendung des EMS-Werkzeugkastens zur Analyse von Mensch- Technik-Interaktion im militärischen Kontext,” Frauhofer FKIE, Wachtberg, 2012. [31] A. Isaac, O. Straeter and D. Van Damme, “A Method for Predicting Human Error in ATM HERA- PREDICT,” HRS/HSP-002-REP-07. Bretigny-Sur-Orge, France: EUROCONTROL, 2004. [32] S. Loft, S. Sanderson, A. Neal and M. Mooij, “Modeling and Predicting Mental Workload in En Route Air Traffic Control: Critical Review and Broader Implications,” Human Factors: The Journal of the Human Factors and Ergonomics Society, pp. 376-399, 6 2007. [33] G. Costa, “Evaluation of workload in air traffic controllers,” Ergonomics, 36 (9), pp. 1111-1120, 1993. [34] G. Cugola and A. Margara, “Processing Flows of Information: From Data Stream to Complex Event Processing,” ACM Computing Surveys 44.3, p. 15:1–15:62, 6 2012. [35] B. Hilburn and P. G. Jorna, “Workload and air traffic control,” Stress, workload and fatigue, 2001. [36] R. M. Rose and L. F. Fogg, “Definition of a responder. Analysis of behavioral, cardiovascular and endocrine response to varied workload in air traffic controllers,” Psychosomatic Medicine, 55, pp. 325-338, 1993.

81 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Appendix A A.1 Technical Verification Details of Exercise 1 and 2

During exercise 1 the accuracy of the sensor systems have been evaluated to be used within the experiment.

Technology:

The Tobii REX Developer Edition which we used in the Sixth Sense project is a gaze interaction device which is available today for developers of interactive applications. It can be placed on a monitor or laptop screen and measures in real time the gaze position of the user on the screen.

Technical Specification:

 Sampling rate: 30 Hz (std. dev. approx. 3 Hz)  Freedom of head movement, width x height at 70 cm: 50 x 36 cm (20 x 14 inch)  Operating distance (eye tracker to subject) 40 – 90 cm  System latency 48 – 67 msec  Mounting alternatives: Adhesive mounting brackets for monitors, laptops and tablets. Desk stands for tripod and desks.  Windows Operating Systems 7 and 8, both 32-bit and 64-bit.

Figure 67 - Eye-Tracking

Setup:  24“ Wide screen (Full HD)  75 cm table height  65 cm distance to eyes  18cm screen height

82 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 68 - Test Setup - Eye-Tracking

Evaluation: The evaluation was performed by recording and analysing the eye-tracking data of four people, based on a five point calibration. 300 data sets for each person and each point have been recorded.  Test Person 1: Female, no glasses, no contact lenses  Test Person 2: Male, no glasses, no contact lenses  Test Person 3: Male, contact lenses  Test Person 4: Male, glasses

Based on the collected data, the following information has been calculated. Additionally, a visualisation of the area in relation to the reference points has been generated. This information has been used to analyse the quality of the eye-tracking data.  Average coordinates  Average coordinates – Reference Point  Standard Deviation

Evaluation Result: Based on the collected data the availability and quality of the data has been analysed. Please find below the information of the measured point of each test person in relation to the reference point.

83 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

100 100 1820 100 100 980 1820 980 960 540 Point 1Point 2Point 3Point 4Point 5 Test Person 1 XY XY XY XY XY Average 120,63 83,12 1815,48 99,85 142,92 879,26 1790,90 906,50 962,86 505,29 Average ‐ Ref 20,63 ‐16,88 ‐4,52 ‐0,15 42,92 ‐100,74 ‐29,10 ‐73,50 2,86 ‐34,71 Standard Deviation 12,63 43,75 14,12 46,06 25,78 292,32 16,24 257,76 13,91 180,38 Test Person 2 XY XY XY XY XY Average 177,78 100,13 1804,95 132,14 211,22 898,91 1765,94 910,33 969,00 527,95 Average ‐ Ref 77,78 0,13 ‐15,05 32,14 111,22 ‐81,09 ‐54,06 ‐69,67 9,00 ‐12,05 Standard Deviation 17,16 59,12 15,89 71,14 28,71 232,64 49,56 240,15 23,26 176,42 Test Person 3 XY XY XY XY XY Average 181,38 61,71 1725,56 105,34 241,81 771,84 1620,09 808,41 978,54 495,10 Average ‐ Ref 81,38 ‐38,29 ‐94,44 5,34 141,81 ‐208,16 ‐199,91 ‐171,59 18,54 ‐44,90 Standard Deviation 123,44 65,37 158,21 131,73 68,62 250,20 151,33 244,87 92,41 132,99 Test Person 4 XY XY XY XY XY Average 76,86 46,46 1448,61 59,12 125,25 774,51 1436,80 707,70 755,38 437,14 Average ‐ Ref ‐23,14 ‐53,54 ‐371,39 ‐40,88 25,25 ‐205,49 ‐383,20 ‐272,30 ‐204,62 ‐102,86 Standard Deviation 43,80 71,43 50,44 58,15 62,57 217,82 94,94 272,98 39,20 190,69 Figure 69 - Eye Tracking Data Analysis

Based on these findings the following constraints can be identified to be used within the experiment.

Name Description

Side Areas The quality of data is more accurate in the centre of the screen than in the side areas. (Please note that there was a software update after the test, which should improve the quality of the side areas)

Identification The minimum area to be detected shall not be less than 80x80 pixels Area

Outliers Within the experiment the outliers shall be identified and filtered.

Contact Lenses There could be a loss of quality for people with hard contact lenses. (has to be identified in the calibration phase of the experiment if there are any problems with the test person)

Glasses There could be a loss of quality for people wearing glasses (has to be identified in the calibration phase of the experiment if there are any problems with the test person)

Taking into account the points mentioned above, the eye-tracking system provides accurate information. Therefore it is recommended to be used within the experiment.

84 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Visualisation of the eye-tracking information for each test person: The following graphs are showing the visualisation of the test person based on different factors (female/ male, contact lenses, glasses …) This has been measured with specific test persons to identify issues with female / male / glasses and contact lenses.

Test Person 1: Female, no glasses, no contact lenses

Figure 70 - Test Person 1 – Eye Tracking

 Test Person 2: Male, no glasses, no contact lenses

85 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 71 - Test Person 2 – Eye Tracking

 Test Person 3: Male, contact lenses

Figure 72 - Test Person 3 – Eye Tracking

 Test Person 4: Male, glasses

86 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 73 - Test Person 4 – Eye Tracking

A.1.1 Kinect Technology:

The Kinect sensor (also called a Kinect) is a physical device that contains cameras, a microphone array, and an accelerometer as well as a software pipeline that processes colour, depth, and skeleton data.

Figure 74 - Kinect sensor

Inside the sensor case, a Kinect for Windows sensor contains:

87 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 75 - Sensors included in the Kinect

Kinect Array specifications

Viewing angle 43° vertical by 57° horizontal field of view Vertical tilt range ±27° Frame rate (depth and 30 frames per second (FPS) colour stream) Audio format 16-kHz, 24-bit mono pulse code modulation (PCM) Audio input A four-microphone array with 24-bit analogue-to-digital converter (ADC) and characteristics Kinect-resident signal processing including acoustic echo cancellation and noise suppression Accelerometer A 2G/4G/8G accelerometer configured for the 2G range, with a 1° accuracy characteristics upper limit.

Table 19 -Technical specifications of Kinect. Setup:  Kinect mounted above the screen  24“ Wide screen (Full HD)  75 cm table height  100cm average distance to Kinect  18cm screen height

88 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Figure 76 - Test Setup - Kinect

Evaluation: The evaluation was performed by measuring the real angles and distances and compares them to the measured Kinect data of 4 people. As starting point the distances have been measured to identify the point of losing head pose and the point of losing the tracking of the Kinect sensor, starting with 150cm. The measuring of the angles took place with a distance of 100cm. (based on the minimum distances measured during the test)

89 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Kinect

α β

Kinect

Losing Tracking

Losing Head Pose

Start Measuring

Figure 77 - Evaluation of distances and angle

Based on the collected data, the following information can be summarised.  Deviation of the 0° head pose  Deviation of the 45° head pose  Deviation of the -45° head pose  Average Angle of all Test Persons  Standard Deviation of Test Persons  Average Angle – Reference Point  Distance of losing head pose  Distance of losing Tracking

90 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Evaluation Result: Based on these findings the following constraints can be identified to be used within the experiment.

Losing Head Losing Pose Tracking 0° α = 45° β = ‐45°

cm cm ° ° ° Test Person 1 100 50 0,6 42 37 Test Person 2 74 62 2 41 41 Test Person 3 75 47 0,3 38 40 Test Person 4 70 48 ‐1 44 43 Average 79,75 51,75 0,97 41,25 40,25 Average ‐ Reference N/A N/A ‐0,97 3,75 4,75 Standard Deviation 13,67 6,95 0,91 2,50 2,50 Table 20 - Kinect Results

Name Description

Maximum angle The angle of the head (alpha or beta) should not be more than 60°. Otherwise the Kinect is not able to track the head of the person.

Closest position The minimum distance of the user should always be at least 80 cm. Below this distance the Kinect is not able to track a person's head.

Tracking To get the best tracking results it should be avoided to have additional people in environment the background of the tracked person. Furthermore the room should provide good lightning conditions.

Tilt angle The tilt angle of the Kinect should be adjusted individually by each test person to get the best tracking results.

The Kinect component provides a standard deviation of 2.5 degrees. A minimum distance of 1m to the equipment shall be considered to ensure the maximum result. Taking into account the points mentioned above, it is possible to use it within the experiment.

91 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

A.1.2 Speech Recognition

Technology: In computer science, speech recognition is the translation of spoken words into text. Speech recognition only implies that the computer can take dictation, not that it understands what is being said. This process is important in the Sixth Sense context because it provides a fairly natural and intuitive way of controlling the ATMS while allowing the user's hands to remain free. The difficulty in using voice as an input method lies in the fundamental differences between human speech and the more traditional forms of computer input. Setup:  USB Headset with Push-to-Talk (PTT)  separated Speech Recognition component

Experimental CWP

Simulator Position Ground Position

Speech Recognition PTT component

Figure 78 - Test Setup – Speech Recognition

Evaluation: The evaluation took place by simulating ATM commands and observing the identified results. This has been performed with 10 users, to receive a variety of different accents. This has been performed with a setup including only the speech recognition and the ground position. During the evaluation only the callsign identification has been taken into account Following information has been recorded:  Callsign Recognised  Callsign Not Recognised  Callsign Wrongly Recognised.

Overall: 775 ATM commands including a callsign have been observed.

92 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Based on the collected data, the following information of the quality of the recognition rate of the speech recognition has been identified.

Evaluation Result: Please find below the summary of the collected information.

allsign Recog. Callsign Not Recog. Callsign Wrongly TOTAL Recog.

747 28 0 775

96% 4% 0% 100%

Callsign Not Recog %; 4%

Callsigne Recognised %; 96%

Figure 79 - Callsign Recognition Rate – Speech Recognition

Based on these findings, the following constraints can be identified to be used within the experiment.

Name Description

Microphone It is important that the microphone volume is correctly adjusted, as otherwise the Volume recognition rate could be influenced.

PTT During the exercise it is important that the test person uses the PTT button as in real operation, due to the fact that it triggers the speech recognition.

Taking into account the points mentioned above and the analysed finding, the speech recognition system provides accurate information (96% recognition rate) in relation to callsign identification. Therefore it is recommended to be used within the experiment.

93 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

Appendix B Questionnaires

94 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

95 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

96 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

97 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.

- END OF DOCUMENT-

98 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher and the source properly acknowledged.