<<

Automatic Detection from Face, Physiology, Task Performance and Fusion during Affective Interference

a, b b a M. Sazzad Hussain 1, Rafael A. Calvo , Fang Chen

a National ICT Australia (NICTA), Australian Technology Park, Eveleigh, NSW 1430, Australia b School of Electrical and Information Engineering, University of Sydney, NSW 2006, Australia

Abstract.

Cognitive load is experienced during critical tasks and also while engaged emotional states are induced either by the task itself or by extraneous experiences. Emotions irrelevant to the representation may interfere with the processing of relevant tasks and can influence task performance and behavior, making the accurate detection of cognitive load from nonverbal information challenging. This paper investigates automatic cognitive load detection from facial features, physiology and task performance under affective interference. Data was collected from participants (n=20) solving mental arithmetic tasks with emotional stimuli in the background and a combined classifier was used for detecting cognitive load levels. Results indicate that the face modality for cognitive load detection was more accurate under affective interference, whereas physiology and task performance were more accurate without the affective interference. Multimodal fusion improved detection accuracies, but it was less accurate under affective interferences. More specifically, the accuracy decreased with increasing intensity of emotional .

Keywords: Cognitive load measurement, affect, machine learning, multimodality, data fusion, HCI.

Acknowledgement

M. Sazzad Hussain was supported by Australia Award and National ICT Australia (NICTA). NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.

1 Corresponding author. Tel: 61435009691 Email address: [email protected] 1. Introduction

Understanding the various aspects of human psychology, along with the theories of computer science, is burgeoning in the field of human computer interaction (HCI) (Card, Moran, & Newell, 1983). New sensor technologies, such as low-cost physiological sensors, webcams, GPS, Internet tracking, etc., can be used for gathering great volumes of multimodal data. Machine learning techniques are applied to analyze, find patterns and make predictions related to user experiences. The information obtained is then employed for designing human-centered systems that can adapt to users psychological state, generate responses and provide assistance accordingly. Researchers (Duric et al., 2002; Hollender, Hofmann, Deneke, & Schmitz, 2010; Pantic & Rothkrantz, 2003; Schultz, Peter, Blech, Voskamp, & Urban, 2007; Sharma, Pavlovic, & Huang, 1998) in the area are investigating behavioral, perceptual, attentional, cognitive and affective aspects of human psychology as well as their interdependencies, with the goal of improving usability, user experience, performance and learning. Psychological phenomena such as cognition and affect in particular, are becoming increasingly important topics of research as they are needed for making systems more aware of the user’s inner states of mind (Calvo & D'Mello, 2010; F. Chen et al., 2012). Scenarios such as air traffic control rooms, traffic management centers, customer call centers, etc., involve complicated tasks for operators to manage and solve. For instance, air traffic control has been recognized as a complex job with concerns related to the workload that overwhelms controllers and compromises the safety of air travel (Sperandio, 1978). Cognitive load (CL) influences task performance, therefore, the ability to measure it in real-time can support users affected by cognitive overload through personalized adaptive systems (F. Chen et al., 2012; de Tjerk, Henryk, & Neerincx, 2010). These systems can help users not only achieve high productivity but also avoid task related stress and frustration that could lead to error and hazardous situations. Complicated tasks can be responsible for inducing high cognitive load and this may also introduce other psychological factors such as emotions. On the contrary, personal feelings during critical task activities may also influence the experience of CL and introduce overload. The interference of such factors will make CL detection more challenging (S. Chen & Epps, 2012). This paper investigates the automatic detection of CL levels from multimodal information under the influence of affective components (e.g. arousal).

1.1. Cognitive Load Detection: Theory and Techniques

The term ‘cognitive load’ is used in to refer to the load caused by the executive control of working memory (F. Paas, Tuovinen, Tabbers, & Van Gerven, 2003). The limited working memory (capacity and duration) available during task activities results in the experience of CL (F. Chen et al., 2012; S. Chen & Epps, 2012; Huang & Tettegah, 2010; F. Paas et al., 2003). The finite working memory can become either under loaded or overloaded based on the amount of information processed simultaneously (Cowan, 2001). CL is experienced all the time, but the impact is more severe under critical conditions, for instance higher CL may be experienced if the task is complex. Research in Cognitive Load Theory (CLT) has focused on identifying instructional designs that create unnecessary working memory load, such as extraneous load, in order to provide more effective alternatives (Ayres & Gog, 2009; Kirschner, Ayres, & Chandler, 2011). Sustaining the optimal CL is important for minimizing errors during critical tasks (e.g. driving, air traffic control, etc.) as well as for maintaining performance (e.g. customer service, call center operation, etc.). Analytical and empirical methods can be used for assessing CL in general (F. Paas et al., 2003). CL is a factor that users experience, while mental effort is a unit that users actually exert in response (Jong, 2010). Self-rating techniques used to report the experienced mental effort have shown to be quite reliable based on numeric indicators (Gopher & Braune, 1984; Odonnell & Eggemeier, 1986). For realistic situations, this approach may seem impractical, because the questionnaires can interrupt the flow of tasks and even add more tasks for the potentially overloaded user. Techniques employing task performance include the concurrent measurement of primary and secondary task performances (Odonnell & Eggemeier, 1986). Hockey (2003) proposed a model that presents the relationship between performance and workload. According to the model, higher performance can still be achieved within the ‘effort’ region, but high CL results in a greater chance of errors with performance beginning to decline. The indicators of performance include reaction time, accuracy and error rate. Besides these techniques some studies have explored speech for acoustic and prosodic patterns (pitch variation in voice) as a cue to reflect CL (Le, Ambikairajah, Epps, Sethu, & Choi, 2011; Le, Epps, Choi, & Ambikairajah, 2010; Yin, Chen, Ruiz, & Ambikairajah, 2008). The rate of peaks in pitch and pauses are proven to be good indicators of CL. Behavioral measures, such as eye-gaze tracking (Gütl et al., 2005), mouse (Ark, Dryer, & Lu, 1999) and keyboard inputs, digital pens, etc., have also been explored to find patterns as indicators of CL. The changes in the cognitive functioning are also reflected on physiological variables based on the assumption that any changes in cognitive functioning are reflected in the human physiology. Physiological measures such heart rate variability, skin response, brain signals, and pupil dilation have been explored and reported in the past (Beatty & Lucero-Wagoner, 2000; Berka et al., 2007; Kramer, 1991; F. G. W. C. Paas & Van Merriënboer, 1994; Shi, Ruiz, Taib, Choi, & Chen, 2007; Stanners, Coulter, Sweet, & Murphy, 1979). These studies have demonstrated that the physiological data can correlate (more or less depending on the type of signal) with the level of stimulation experienced and can also represent various levels of mental effort.

1.2. Interdependencies between Cognition and Affect

Behavioral and physiological expressions of feelings, along with factors such as moods, attitudes, affective styles, and temperament, have been described as affective phenomena (Calvo & D'Mello, 2010). In the context of this study, affect (i.e. emotion) refers to the feeling of positive or negative pleasance (i.e. valence) and its intensity (i.e. arousal). Cognition and affect are regarded as two interrelated aspects of human functioning and cannot be seen as completely separate phenomena (Huang & Tettegah, 2010; Immordino-Yang & Damasio, 2007). Immordino-Yang and Damasio (Immordino-Yang & Damasio, 2007) provided a theoretical framework showing the interdependencies between cognition and emotions. There is also a close relationship between the operation of working memory and affective components (Kalyuga, 2011). Studies have also suggested that interdependences between cognition, affective states and other human functioning (e.g. behavioral, perception) are essential for building adaptive intelligent HCI systems (Duric et al., 2002). In previous studies, a number of techniques have been investigated for affect detection (Calvo & D'Mello, 2010). At the same time, research in the area of CL measurement and detection has progressed separately (F. Paas et al., 2003). Both communities have used self-reports and behavioral, physiological and multiple modalities for finding evidence of affects and CL separately. Despite the distinct research efforts and progress in automatic affect and CL detection, some studies have emphasized the importance of both based on their application, especially in HCI. Schultz R. et al. (2007) described the importance of CL and emotion detection for usability studies and proposed a framework for investigating ongoing emotions and high CL. The framework called RealEYES is intended to record and analyze multimodal information such as video and audio streams, gaze, mouse, keyboard, event data, and physiological data. The framework also integrates machine learning techniques using their dataset. More recently, Huang and Tettegah (2010) have emphasized the importance of CL and emotions in designing serious games and proposed a framework. The study investigated both the presence and absence of empathy and their impact on CL. The ultimate goal is to enhance learning by focusing on how players can reduce CL based on its interaction with empathy.

1.3. Cognitive Load and Affective Interference

Although positive affects related to the task activity have been reported to be useful, for instance, during learning (Kalyuga, 2011), artifacts (e.g. arousal) that are irrelevant to the working memory representations can interfere and conflict with the processing of the relevant primary task (D’Esposito, Postle, Jonides, & Smith, 1999; Levens & Phelps, 2008). The presence of artifacts due to working memory demands have been seen as a fundamental problem in CL detection, especially using nonverbal information such as behavioral and physiological modalities (S. Chen & Epps, 2012). For example, as part of human nature, affective components may arise during tasks due to the nature of the task or personal feelings. Complicated tasks will drive high CL and also introduce affects (e.g. frustration, confusion). The Yerkes-Dodson law (Yerkes & Dodson, 1908) provides an empirical relationship between arousal and performance, where performance increases with arousal up to a certain point and then decreases. Moreover, affects related to personal feeling (e.g. anger, anxiety) during task will also impact the experience of CL. It has been reported that the affective factors are likely to have an effect on task performance (Blair et al., 2007; Erthal et al., 2005), and even influence modalities such as physiological responses (Bradley, Miccoli, Escrig, & Lang, 2008; S. Hussain, Chen, Calvo, & Chen, 2011; Picard, Vyzas, & Healey, 2001). This introduces the necessity of investigating nonverbal modalities and the features that are more sensitive to CL compared to affect (and vice- versa), evaluating appropriate machine learning techniques, and identifying instructional designs that can not only reduce unnecessary working memory load, but also eliminate affective factors. If affective and CL aspects of information are processed in parallel, in this context the causing simultaneous demand for working memory resources by both the affect and the task, then information-processing biases could occur as a function of to affective or CL aspects of information. Attending to the information with affect associated could interfere with the tasks due to the limited working memory for processing information. This phenomenon is referred to as ‘affective interference’ (Siegle, Ingram, & Matt, 2002). The impact of affective components during tasks and how they influence performance have been investigated in some studies (Blair et al., 2007; Erthal et al., 2005). However, research evaluating the interference of affective components on other modalities (e.g. physiological) during CL tasks are scarce. Some exceptions include studies (Bradley et al., 2008; S. Chen & Epps, 2012; Stanners et al., 1979) that have used pupil and eye features as indicators of cognition with affect.

1.4. Cognitive Load Detection under Affective Interference

Hussain et al. (2011) investigated the automatic classification of two levels of CL from both task performance and physiological responses, and their fusion during affective interferences. This paper expands the study to further explore CL detection from multimodal features. Multimodal data was collected from a systematic experiment while participants performed mental arithmetic tasks under affective interference stimulated by photos. The difficulty of the task was varied so as to impose a changing load on the limited working memory. The stimuli (i.e. photos) were simultaneously presented to induce varying intensities (i.e. arousal) of positive and negative affect. In the context of this paper and study, the affective states induced were irrelevant to the task activity in order to investigate the affective interference. Features from face (color and movement), physiological channels (heart activity, skin response, breathing pattern, and eye activity), and task performance (reaction and accuracy) are used with machine learning methods for automatically detecting CL levels during affective interference. Feature selection (correlation based) and classification (combined classifier) techniques are applied in this study for evaluating automatic CL detection from the multimodal information. The following questions are investigated and could be useful for implementing more intelligent HCI systems: a. What is the accuracy of CL detection from face, multichannel physiology, task performance and their fusion? As a way to study the relationship between CL detection and affect we add interference, so what is the accuracy under affective interferences? b. What is the accuracy for detecting two (low, high), three (low, medium, high, and five levels (very low, low, medium, high, very high) of CL respectively? c. Features are automatically selected to ensure that they are the most useful for the CL classification algorithms. How do these features change when we add the affective interference? Are there any features that are common between the two? d. What is the accuracy of CL detection during the discrete intensity levels of emotional arousal (e.g. high arousal)? Section 2 and 3 explain the experiment for collecting the data and the computational framework respectively. The results are presented in section 4 followed by discussions and conclusions in section 5.

2. Study and Data Collection

The purpose of this study is to collect physiological signals and facial videos in response to CL tasks under the interference of emotional stimulus. The study was approved by the ethics committee of the University of New South Wales prior to the data collection. The data was recorded while participants performed arithmetic tasks under affective interferences stimulated by photos from the International Affective Picture System (IAPS) (Lang, Bradley, & Cuthbert, 2005). IAPS contains a collection of color photographs with normative ratings of emotion (valence, arousal, dominance), providing a set of stimuli for studying emotion and attention. Using Self-Assessment Manikin (SAM) (Lang et al., 2005), subjects have rated the pictures on the dimensions of valence, arousal, and dominance, which were used for finding the normative ratings in the collection. Figure 1 shows a sample of the experimental setup.

2.1 Participants

The study was advertised to approximately 300 students and staff at the National ICT Australia (NICTA) and the first 20 respondents (11 males and 9 females, whose age ranged from 22 to 48) were selected to take part. Participants were healthy and were instructed not to take any drugs and to avoid caffeine prior to the experiment. They signed an informed consent prior to the experiment.

Figure 1. Sample experimental setup and sensors

2.2 Sensors and Recording

A Logitech Webcam Pro 9000 placed on top of the screen, facing the participants was used to record their faces. Videos were recorded with CaptureFlux2 software in color (24-bit RGB with 3 channels, 8 bits/channel) at 15 frames per seconds (fps) with a pixel resolution of 640×480 pixels. Electrocardiogram (ECG), skin conductivity (SC), and respiration (RESP) sensors were placed on participants, and a BIOPAC MP1503 system with AcqKnowledge software was used to acquire the physiological signals at a sampling rate of 1000 Hz for all channels. Two electrodes were placed on the wrists for collecting ECG. SC was recorded from the index and middle finger of the left hand and a respiration band was strapped around the chest. Eye activity was recorded using a FaceLAB 44 (Seeing Machines) remote eye tracker system with a sampling rate of 60 Hz. Participants were free to move their head but they were instructed to keep their sight within the screen display area.

2.3 Protocol

2 CaptureFlux: http://paul.glagla.free.fr/captureflux_en.htm 3 BIOPAC: http://www.biopac.com/data-acquisition-analysis-system-mp150-system-windows 4 FaceLab: http://www.seeingmachines.com Each trial took approximately 15 minutes for preparation (consent forms, sensor setup and explanation of experiment protocol) prior to the experiments. Physiological signals, eye activity, facial video, and task performance were recorded during the entire duration of the trials. A short training was presented prior to the experiment and participants were given a change to practice with a demo session. The study lasted approximately 35 minutes, while participants completed mental arithmetic tasks with emotionally stimulating images in the background. As part of the activity, participants were asked to sum four numbers that were sequentially displayed on screen, and selected the correct answer from 10 choices. Each task was presented for 12 seconds before prompting the answer choices. They were instructed to be accurate and respond as fast as possible. Self-report was collected using a 9-point scale, indicating easy and difficult from 1 to 9 after each task completion. This was followed by a 5 seconds pause showing a blank screen with a fixation cross just before the next task. Figure 2 presents the arithmetic task presentation protocol for each activity.

Figure 2. Arithmetic task presentation protocol. The gray background was replaced with IAPS images to induce the intensities of arousal during task activity.

The number of digits for addition and the number of carries produced by addition determined the level of task difficulty. Levels 1, 2 and 3 had one digit for each of the four adders but no carry, one carry and two carries produced during the addition process respectively: while levels 4 and 5 used two-digit addition with one carry produced at only the lower digit in level 4, and one carry generated at both digits in level 5 during addition process. Each participant completed a total of 7 sessions with a short break in between. They were presented with 5 difficulty levels of tasks twice in every session, and the sequence of the tasks was randomized (Klingner, Tversky, & Hanrahan, 2010). The first session used a plain grey background (task without affective interferences) while the other 6 sessions used different images from 6 categories (positive and negative valence with three degrees of arousal) in the background. The images were selected from IAPS according to the normative ratings of valence and arousal to induce emotions, and they were presented based on the affective circumplex model (Russell, 1980). Table 1 shows the examples of images used in each affect category for elicitation. Each participant completed a total of 70 tasks (5 different level of tasks x 2 x 7 sessions).

Table 1: Example of images from IAPS used as stimulus for affect elicitation. Affect Category (Arousal/Valence) Image Example

High/Positive Erotic couple, fireworks, adventure Medium/Positive Cute animals, baby, children Low/Positive Animals, landscape, flower High/Negative Snake, spider, surgery, death Medium/Negative Guns, vomit, roach, crying people Low/Negative Rocks, dustpan, garbage

3. Computational Framework

A stimulus and task presentation program was implemented in MATLAB, which was used during the experiment. The computational framework (also in MATLAB) for feature extraction, feature selection, and classification was implemented with the support of in-house and third party codes and toolboxes (Figure 3). The Raw Data Inspector & Computational Interface allows synchronization of the physiological channels, video, and reports for data visualization and annotation. A brief description of the three main modules as part of the framework is provided in the following subsections.

Figure 3. MATLAB based computational framework for feature extraction, feature selection and classification.

3.1 Feature Extraction

Feature vectors were calculated and extracted from the time window corresponding to the duration of the task presentation (12 seconds). Initially, each feature vector was also labeled with the corresponding level of task difficulty (1 to 5). Then another two sets of feature vectors were replicated from this to obtain two and three CL levels respectively. Based on the digit of adders presented as part of the task difficulty during the experiment, levels 1, 2, 3 (one digit adders), and levels 4, 5 (two digit adders) were grouped and relabeled as low and high respectively to achieve two CL levels. To obtain three CL levels, the two extremes (levels 1, 2 and 4, 5) were grouped and relabeled as low and high respectively and level 3 was relabeled as medium. Levels 1, 2, 3, 4, and 5 were labeled as very low, low, medium, high, and very high respectively for five CL levels. A total of 298 features were extracted from the face video, physiological channels and task performance: (i) Face Features. Two types of image-based features (geometric and chromatic) were extracted from face video (Tian, Kanade, & Cohn, 2002) using MATLAB and Open Computer Vision library (OpenCV)5. An extended boosting cascade classifier (Haar classifier) (Viola & Jones, 2001) was used for tracking the face and extracting related geometric features. Due to the movement and illumination changes, the geometric and chromatic features (Table 2) determine the position and color of the face in each frame. Dynamic features were calculated based on these image-based features. Statistical features such as mean, median, standard deviation, minima and maxima were computed. Different actions of the user’s head, such as tilting, nodding and leaning forward or backward, were calculated and detected by subtracting the values of the last frame of the time window from the first values. A total of 115 unique features were extracted from the video (59 from geometric and 56 from chromatic).

Table 2. Description of facial channels and features. Name Description

color Chromatic move Geometric R, G, B Red, Green, Blue WB White Balance X, Y X-, Y-coordinate of detected face W, A Width, area of detected face Movement Distance between the position of face in frames

(ii) Physiological Features. Pupil and eye movement features (e.g. pupil size, fixation, saccade) (Bradley et al., 2008) were extracted from the data collected with the eye tracker, and blink features (number of blinks, blink duration) were extracted from the video recorded with the webcam. ECG, SC, and RESP signals were down sampled to 200 Hz to reduce computation. Augsburg Biosignal toolbox (AuBT)6 was used for extracting statistical features from the ECG, SC and RESP channels. Common features for all channels, such as mean, median, standard deviation, frequency spectrum range, ratio, minimum, maximum and features related to the signal type (e.g. heart rate variability, respiration pulse, frequency), were calculated and extracted. Table 3 presents description of the physiological features. A total of 181 unique features were extracted from the five physiological channel signals (84 ECG, 21 SC, 67 RESP, and 9 Eye).

Table 3. Description of physiological channels and features. Name Description

pupil Pupil dilatation blk Blinks ecg Electrocardiography sc Skin conductivity rsp Respiration 1Diff Approximation of first derivation 2Diff Approximation of second derivation pulse Pulse signal amlp Amplitude signal

5 OpenCV: opencv.willowgarage.com/wiki/ 6 AuBT: informatik.uni-augsburg.de/de/lehrstuehle/hcm/projects/tools/aubt/ P, Q, R, S, T p-, q-, r-, s-, t-peaks (ecg) hrv Heart rate variability Fix Fixation Sacs Saccades

(iii) Performance Features. Two performance features were extracted from the experiment logs: reaction time and accuracy score. Reaction time (reacTime) refers to the time taken to select an answer during the time the multiple choice answers were displayed on the screen. Accuracy score (accScore) refers to either getting the answer right or wrong.

3.2 Fusion and Normalization

The effect of multichannel or multimodal fusion can be evaluated by integrating features from the channels or modalities. This was performed simply by merging the features, which is also known as feature-level fusion. All features from chromatic and geometric channels were considered as face modality, all features from physiological channels were considered as physio modality, and performance features were considered as perf modality. The fusion model contained features from all three modalities. To address individual variations and to perform the normalization, the z-score standardization (equation 1) was applied for all features over all subjects.

푥 푖 −휇 푧 푖 = ; 푖 = 1: 푛 (1) 휎

Where, μ represents the mean and σ represents the standard deviation over n instances.

3.3 Feature Selection

Feature selection techniques are used for choosing the best subset of features that are strongly correlated to the class information. The process removes redundant and noisy features and reduces the dimensionality of the feature space. The feature selection was implemented in MATLAB using the DMML7 wrapper for Weka (M. Hall et al., 2009). The correlation based feature selection (CFS) was used for choosing the best subset of features, the same technique applied in our previous paper (S. Hussain et al., 2011). The feature selection was performed separately both for all individual modalities and fusion. The CFS technique evaluates the merit of a subset of features by looking into the predictive ability of the individual features along with the level of redundancy between them (M. A. Hall, 1999). Equation (2) represents the merit of feature subset S consisting of K features.

푘 푟푐푓 푀푒푟푖푡푆 = (2) 푘 푘+푘(푘−1) 푟푓푓

Where, 푟푐푓 is the average value over the correlations of all feature-classification, and 푟푓푓 is the average value over the correlations of all feature-feature. The subset

7 DMML: featureselection.asu.edu/software.php with the highest merit, as measured by equation (2) found during the search, is used to reduce the dimensionality of both the original training data and the testing data. The

CFS is defined by equation (3). The 푟푐푓푖 and 푟푓푖푓푗 are the correlation variables.

푚푎푥 푟 +푟 +…+푟 퐶퐹푆 = 푐푓1 푐푓2 푐푓푘 (3) 푆푘 푘+2(푟 +⋯+푟 +⋯+푟 ) 푓1푓2 푓푖푓푗 푓푘푓1

3.4 Classification

Combining classifiers aim to provide a more accurate and efficient classification decision. A certain individual classifier may do well on a certain modality, but it is challenging to generalize one classifier for multiple channels or modalities. Instead of just one classifier a subset of classifiers (aka base classifiers) can be considered along with the best subset of features. Combining classifiers is suitable in CL studies where features contribute from multiple modalities and individuals in varying environmental setups (M. S. Hussain, Monkaresi, & Calvo, 2012). The advantage is that the accuracy of the combined classifiers will be in average better than all the base classifiers (Kuncheva, 2004; Utthara, Suranjana, Sukhendu, & Pinaki, 2010). The Vote classifier (Kuncheva, 2004) is used for classification in this study considering three types of base classifiers: lazy, function, and tree. Firstly, the three base classifiers are evaluated: k-nearest neighbor (KNN), support vector machine (SVM), and decision trees (J48). SVM, KNN, and J48 are particularly popular based on their performance and compatibility with a variety of applications (Nguyen, Li, Bass, & Sethi, 2005). These supervised learning algorithms that are simple to implement span a variety of machine learning theories (function, lazy, trees), making them appropriate for combined classifiers to address the diversity of features and variety of subjects. The final classification results are achieved with the Vote classifier. This meta-classifier uses the average probability rule to combine the probability distributions of base classifiers, in this case SVM, KNN, and J48. The Vote classifier determines the class probability distribution computing the mean probability distribution of the base N arbitrary classifiers as follows (Seewald, 2003):

푃 푝푟푒푑 = 푁 푖 (4) 푖=1 푁

Where, 푃푖 refers to the probability given by classifier i. The Voting prediction for j ′ ′ classes are mapped using 푃푖 instead 푃푖 in equation (3). 푃푖 is the vector of p., for all j, where,

1, 푖푓 푗 = 푎푟푔푚푎푥 푃 , 푓표푟 푔푖푣푒푛 푖 푃′ = 푗 푖푗 (5) 푖푗 0, 표푡푕푒푟푤푖푠푒

The classifiers were implemented using MATLABArsenal 8 , a wrapper for the classifiers in Weka (M. Hall et al., 2009). The CVParameterSelection, a meta- classifier in Weka was used for estimating the parameter values for the classifiers.

8 MATLABArsenal: cs.siu.edu/~qcheng/featureselection/index.html CVParameterSelection performs the parameter estimation using the cross-validation technique. In this study, the K values of 1 and 3 were used for KNN classification. The exponent value of 1.0 (linear kernel), complexity factor of 1.0 with epsilon 1.0E-12 was set for SVM. The C4.5 decision tree is used with confidence factor set to 0.25, and considering the subtree operation when pruning. The average probability rule was used for the Vote classifier.

3.5 Training, Testing and Evaluation

For each analysis, the normalized datasets from all subjects were initially merged into a larger dataset. This allows the generation of the dataset for a combined-model. All dataset were then shuffled, thus randomized. The training and testing was performed separately with 10-fold cross validation 9 . In 10-fold (k-fold) cross-validation, the dataset or sample was randomly partitioned into 10 subsamples. Of the 10 subsamples, a single subsample was retained as the validation data for testing the model, and the remaining 9 (k-1) subsamples were used as training data. The cross-validation process was then repeated 10 times (the folds), with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds were then averaged to produce a single estimation. The accuracy score is used for reporting the overall classification performance. The ZeroR classifier was used for calculating the baseline classification accuracy. Any classification accuracy above baseline would indicate above random performance for the classifier.

4. Results and Discussions

This section presents classification results for detecting two levels (2L), three levels (3L) and five levels (5L) of CL respectively under the interference of arousal. The degrees of arousal intensities such as high arousal (highAr), medium arousal (medAr), low arousal (lowAr), and no arousal (noAr) are investigated. Initially, the self-reports were validated with the presented task difficulty. The average ratings across 20 subjects for the five difficulty levels are: M=1.8 (SD=0.6) at level 1, M=2.3 (SD=0.8) at level 2, M=3.0 (SD=0.8) at level 3, M=4.1 (SD=1.0) at level 4, and M=6.2 (SD=1.5) at level 5. The Friedman test revealed 휒2(4)=77.6, (p<0.01), a significant effect of task difficulty on self-reports. This result validates our design of CL levels using task difficulty. Section 4.1 discusses and evaluates the features selected using the CFS technique for classification. This is followed by the classification results of the individual modalities and their fusion for detecting CL levels with and without affective interference in section 4.2. CL classification results are also presented for the various arousal intensity levels. Due to the failures of eye calibration for 5 subjects, results are presented for 15 subjects in this paper.

4.1 Feature Evaluation

The fusion model has features from all the individual modalities. Therefore it is important to evaluate the features that are most useful for CL detection. The

9 5-fold cross validation was also evaluated but provided almost similar (slightly lower) results as 10-fold. illustration in Figure 4 shows the name of the selected features from the CFS technique (described in section 3.3) for the various CL level types (2L, 3L, 5L), with and without affective interferences. The circle represents features that were common for the three CL level types for both with and without affective interferences, including pupil and performance related features. The oval shapes represent features that were common among the three CL level types. The features in the dotted rectangle are the ones that were common between two CL level types, and the features inside the solid rectangle are the ones that were unique to a particular level type. Features from all modalities contributed for CL in the presence of affect or under affective interference. However, there was no contribution of face features for CL in the absence of affect (no affective interferences). Furthermore, features from all physiological channels and from perf were contributing in two and three levels of CL. Only features from SC, pupil and perf were present for five levels of CL.

Figure 4. Features selected (CFS based technique) from different CL level types, with and without affective interferences

4.2 Classification

The classifier (i.e. detecting CL levels or classes) produced different baseline accuracies for the different CL level types (2L, 3L, 5L) based on the class distributions (i.e. labels) in the training sample. The baseline classification accuracy (ZeroR) was 60%, 40% and 20% for 2L, 3L, and 5L of CL respectively. The error bar in Figures 5, 6, and 8 represent the baseline accuracies. The classification results are presented for the individual modalities (face, physio, perf) and their fusion (fusion) using the Vote classifier as described in section 3.4. The features presented in Figure 4 were used for classification to evaluate CL detection with and without the interference of affect. Figure 5 shows the classification accuracy of individual modalities and their fusion for detecting 2L, 3L, and 5L of CL under affective interferences. The individual modalities and their fusion achieved accuracies above the baseline. The fusion showed improvement over the individual modalities for 2L (77%) and 3L (59%) but slightly lower accuracy compared to face for 5L (36%). For 2L and 3L, perf exhibited the highest accuracy, 76% and 58% respectively, over the individual modalities. However, for 5L, face showed the highest accuracy (38%).

Figure 5. Classification accuracy and baseline accuracy (error bar) of individual modalities and their fusion for detecting various levels of CL under affective interferences.

Figure 6 shows the classification accuracy of individual modalities and their fusion for detecting 2L, 3L, and 5L of CL without any affective interference. The fusion improved accuracy over all individual modalities for the three CL level types (90%, 67%, and 37% respectively). Therefore, the absence of affect presented 13% and 8% improvement in accuracy with fusion for 2L and 3L respectively. The accuracy of physio was the highest over the individual modalities for 2L (83%) and 3L (63%) but slightly lower than perf for 5L (31%). The accuracy of face was the lowest and even lower than the baseline in this scenario.

Figure 6. Classification accuracy and baseline accuracy (error bar) of individual modalities and their fusion for detecting various levels of CL without any affective interference

The analysis from Figures 5 and 6 demonstrates that physio, perf, and fusion can exhibit higher CL detection accuracies for 2L and 3L when there is no affective interference. However, the accuracy of face falls drastically in the absence of affective components. The geometric and chromatic features used for this analysis have proved to be more suitable for affect detection, reported by some studies (Monkaresi, Calvo, & Hussain, 2012; Tian et al., 2002). This may explain why face exhibited better performance under affective interferences. Our previous study (S. Hussain et al., 2011) has also demonstrated that physiology and task performance have similar accuracy for detecting CL levels (two levels). Physiology proved to be a good modality for detecting CL when there was no affective interference. The analysis can be further analyzed by looking at classification accuracies during the discrete arousal intensities (Figures 7 and 8). Figure 7 illustrates the accuracy of detecting 2L, 3L, and 5L of CL from the individual modalities during discrete intensities of arousal (highAr, medAr, lowAr, noAr). The physio and perf showed a similar trend in accuracy from lowAr to highAr and exhibited the highest accuracy in noAr. The face showed more variations where accuracy increased from highAr to lowAr. However, the accuracy of noAr for face was lower than the baseline.

Figure 7. Classification accuracy of detecting 2L, 3L, and 5L of CL from face, physio and perf during discrete degrees of arousal (highAr, medAr, lowAr, noAr).

Figure 8 represents the accuracy of detecting 2L, 3L, and 5L of CL from fusion during discrete intensities of arousal (highAr, medAr, lowAr, noAr). The lower bound error bar represents the baseline accuracy. The accuracy increased from highAr to noAr almost linearly for 2L. This trend was also observed for 3L and 5L but only from highAr to lowAr. The accuracy of noAr was slightly higher than highAr but lower than medAr and lowAr for 3L, and even lowest for 5L. By comparing the results in Figures 7 and 8, it is clear that fusion exhibited higher accuracy compared to the individual modalities for various intensity levels of arousal (Table 4). Despite the poor accuracy from face, it was able to show higher accuracies with the other modalities in fusion for detecting CL under affective influence. The physio and perf were the most suitable modalities for detecting CL levels when there is no affective influence.

Figure 8. Classification accuracy and baseline accuracy (error bar) of detecting 2L, 3L, and 5L of CL from fusion during discrete degrees of arousal (highAr, medAr, lowAr, noAr).

Table 4. Percentage of improvement by fusion over the best individual modality for the various intensities of arousal CL/Ar highAr medAr lowAr noAr 2 perf perf face physio Levels (-3%) (0%) (6%) (7%) 3 face face face physio Levels (6%) (6%) (8%) (4%) 5 face face face physio Levels (3%) (9%) (3%) (3%)

5. Conclusion

Self-reports are accurate measures of CL, but are less practical in naturalistic HCI. Performance is also a good indicator of CL, but it is application dependent. Detecting CL from behavioral and physiological modalities is extremely challenging and still controversial. Therefore, results may vary from one study to another based on the nature of the experiments, the modalities used for collecting data, the number of classes, and the computational techniques applied. In this study, we have considered a variety of modalities and their fusion to investigate different CL level types for automatic detection. The study has identified the strengths and weaknesses of the modalities in various scenarios, with and without the interference of affective components and how multimodal fusion can enhance performance. The study showed that CL detection becomes challenging and less accurate during affective interferences. Furthermore, automatic CL detection was less accurate during higher intensities of arousal. Features from face video, physiology, task performance and their fusion performed above the baseline during affective interferences with up to 77%, 59% and 38% accuracies for detecting two, three, and five levels of CL respectively. Higher detection accuracies were achieved (except from face performing below baseline for the CL level types) when there was no affective interference during task with up to 90% and 67% accuracy for 2L and 3L respectively (achieved from fusion). The face features, which previously have not been explored by CL research, prove to be very useful for automatic CL detection under affective interference. Pupil features and reaction time were common and robust in CL classification for both with and without affective interferences and for the three types of CL levels. The face modality showed a variation during the different intensity levels of arousal, where increase of CL detection accuracy was noticed from high to low arousal. Physiology and task performance showed less variation from high to low arousal and performed lower than the face. However, they both outperformed face when there was no affective interference. The fusion showed improvement over the individual modalities during the various intensity levels of arousal. However, increasing arousal intensity decreased CL detection accuracy. Even though the analysis in this study is based on a systematic experimental setup (task and stimulus presentation), experiments have to be conducted for realistic scenarios. For now it is limited to the controlled setup as an effort to evaluate and investigate CL detection under affective interference. Real world applications, intending to achieve optimal human and machine performance concurrently in order to avoid errors and hazards during critical tasks, need to understand the modalities, the computational techniques, and the human psychological factors that may influence the accuracy of automatic CL detection. This paper, with the support of some important modalities and popular machine learning methods, identifies sensors, modalities, and features that may be suitable for building systems for automatic CL detection.

References

Ark, W. S., Dryer, D. C., & Lu, D. J. (1999, Aug. 22-26). The emotion mouse. Paper presented at the 8th International Conference on Human-Computer Interaction: Ergonomics and User Interfaces, Munich, Germany. Ayres, P., & Gog, T. (2009). Editorial: State of the art research into Cognitive Load Theory. Computers in Human Behavior, 25(2), 253-257. Beatty, J., & Lucero-Wagoner, B. (2000). The pupillary system. Handbook of psychophysiology (2nd ed.), Cambridge University Press, xiii, 1039, 142-162. Berka, C., Levendowski, D. J., Lumicao, M. N., Yau, A., Davis, G., Zivkovic, V. T., . . . Craven, P. L. (2007). EEG correlates of task engagement and mental workload in vigilance, learning, and memory tasks. Aviation, Space, and Environmental Medicine, 78(5), B231-B244. Blair, K. S., Smith, B. W., Mitchell, D. G. V., Morton, J., Vythilingam, M., Pessoa, L., . . . others. (2007). Modulation of emotion by cognition and cognition by emotion. Neuroimage, 35(1), 430-440. Bradley, M. M., Miccoli, L., Escrig, M. A., & Lang, P. J. (2008). The pupil as a measure of emotional arousal and autonomic activation. Psychophysiology, 45(4), 602-607. Calvo, R. A., & D'Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on Affective Computing, 1(1), 18-37. Card, S. K., Moran, T. P., & Newell, A. (1983). The psychology of human-computer interaction. Hillsdale, NJ: Lawrence Erlbaum Associates. Chen, F., Ruiz, N., Choi, E., Epps, J., Khawaja, M. A., Taib, R., . . . Wang, Y. (2012). Multimodal behaviour and interaction as indicators of cognitive load. ACM Transactions on Interactive Intelligent Systems, 2(4). Chen, S., & Epps, J. (2012). Automatic classification of eye activity for cognitive load measurement with emotion interference. Computer Methods and Programs in Biomedicine, pii: S0169- 2607(12)00283-0. doi: 10.1016/j.cmpb.2012.10.021. Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24(1), 87-114. D’Esposito, M., Postle, B. R., Jonides, J., & Smith, E. E. (1999). The neural substrate and temporal dynamics of interference effects in working memory as revealed by event-related functional MRI. Proceedings of the National Academy of Sciences, 96(13), 7514-7519. de Tjerk, E. G., Henryk, F. R. A., & Neerincx, M. A. (2010). Adaptive automation based on an object- oriented task model: Implementation and evaluation in a realistic C2 environment. Journal of Cognitive Engineering and Decision Making, 4(2), 152-182. Duric, Z., Gray, W. D., Heishman, R., Li, F., Rosenfeld, A., Schoelles, M. J., . . . Wechsler, H. (2002). Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. Proceedings of the IEEE, 90(7), 1272-1289. Erthal, F. S., de Oliveira, L., Mocaiber, I., Pereira, M. G., Machado-Pinheiro, W., Volchan, E., & Pessoa, L. (2005). Load-dependent modulation of affective picture processing. Cognitive, Affective, & Behavioral Neuroscience, 5(4), 388-395. Gopher, D., & Braune, R. (1984). On the psychophysics of workload: Why bother with subjective measures? Human Factors: The Journal of the Human Factors and Ergonomics Society, 26(5), 519- 532. Gütl, C., Pivec, M., Trummer, C., García-Barrios, V. M., Mödritscher, F., Pripfl, J., & Umgeher, M. (2005). Adele (adaptive e-learning with eye-tracking): Theoretical background, system architecture and application scenarios. European Journal of Open, Distance and E-Learning, 2. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18. Hall, M. A. (1999). Correlation-based feature selection for machine learning. The University of Waikato. Hockey, G. R. J. (2003). Operator Functional State as a Framework for the Assessment of Performance Degradation Operator Functional State: The Assessment and Prediction of Human Performance Degradation in Complex Tasks (pp. 8-23). Amsterdam: IOS. Hollender, N., Hofmann, C., Deneke, M., & Schmitz, B. (2010). Integrating cognitive load theory and concepts of human-computer interaction. Computers in Human Behavior, 26(6), 1278-1288. Huang, W.-H. D., & Tettegah, S. (2010). Cognitive load and empathy in serious games: A conceptual framework. Gaming and Cognition: Theories and Practice from the Learning Sciences, 137-151. Hussain, M. S., Monkaresi, H., & Calvo, R. A. (2012, Dec. 5-7). Combining classifiers in multimodal affect detection. Paper presented at the The Australian Data Mining Conference (AusDM 2012), Sydney, Australia Hussain, S., Chen, S., Calvo, R. A., & Chen, F. (2011, Nov. 17). Classification of cognitive load from task performance & multichannel physiology during affective changes. Paper presented at the MMCogEmS: Inferring Cognitive and Emotional States from Multimodal Measures, ICMI 2011 Workshop, Alicante, Spain. Immordino-Yang, M. H., & Damasio, A. (2007). We feel, therefore we learn: The relevance of affective and social neuroscience to education. Mind, Brain, and Education, 1(1), 3-10. Jong, T. (2010). Cognitive load theory, educational research, and : some food for thought. Instructional Science, 38(2), 105-134. Kalyuga, S. (2011). Cognitive load in adaptive multimedia learning. In R. A. Calvo & S. K. D'Mello (Eds.), New Perspectives on Affect and Learning Technologies (Vol. 3, pp. 203-215): Springer New York. Kirschner, P. A., Ayres, P., & Chandler, P. (2011). Contemporary cognitive load theory research: The good, the bad and the ugly. Computers in Human Behavior, 27(1), 99-105. Klingner, J., Tversky, B., & Hanrahan, P. (2010). Effects of visual and verbal presentation on cognitive load in vigilance, memory, and arithmetic tasks. Psychophysiology, 48(3), 323-332. Kramer, A. F. (1991). Physiological metrics of mental workload: A review of recent progress (pp. 279- 328). San Diego, CA, USA: Navy Personnel Research and Development Center. Kuncheva, L. I. (2004). Combining pattern classifiers: Methods and algorithms: Wiley-Interscience. Lang, P. J., Bradley, M. M., & Cuthbert, B. N. (2005). International affective picture system (IAPS): Technical manual and affective ratings Technical Report A-6: University of Florida, Gainesville, FL. Le, P. N., Ambikairajah, E., Epps, J., Sethu, V., & Choi, E. H. C. (2011). Investigation of spectral centroid features for cognitive load classification. Speech Communication, 53(4), 540-551. Le, P. N., Epps, J., Choi, E. H. C., & Ambikairajah, E. (2010, Aug. 23-26). A study of voice source and vocal tract filter based features in cognitive load classification. Paper presented at the International Conference on Pattern Recognition (ICPR’10), Istanbul, Turkey. Levens, S. M., & Phelps, E. A. (2008). Emotion processing effects on interference resolution in working memory. Emotion, 8(2), 267-280. Monkaresi, H., Calvo, R. A., & Hussain, M. S. (2012, May 21-25). Automatic natural expression recognition using head movement and skin color features. Paper presented at the International Conference on Advanced Visual Interfaces, AVI 2012, Capri Island (Naples), Italy. Nguyen, T., Li, M., Bass, I., & Sethi, I. K. (2005, Dec. 12-14 ). Investigation of combining SVM and Decision Tree for emotion classification. Paper presented at the Seventh IEEE International Symposium on Multimedia, Irvine, California, USA. Odonnell, R., & Eggemeier, F. (1986). Workload assessment methodology. Handbook of Perception and Human Performance, 2, 42-41. Paas, F., Tuovinen, J. E., Tabbers, H., & Van Gerven, P. W. M. (2003). Cognitive load measurement as a means to advance cognitive load theory. Educational Psychologist, 38(1), 63-71. Paas, F. G. W. C., & Van Merriënboer, J. J. G. (1994). Variability of worked examples and transfer of geometrical problem-solving skills: A cognitive-load approach. Journal of , 86(1), 122-133. Pantic, M., & Rothkrantz, L. J. M. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, 91(9), 1370-1390. Picard, R. W., Vyzas, E., & Healey, J. (2001). Toward machine emotional intelligence: Analysis of affective physiological state. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(10), 1175-1191. Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. Schultz, R., Peter, C., Blech, M., Voskamp, J., & Urban, B. (2007). Towards detecting cognitive load and emotions in usability studies using the RealEYES framework. Usability and Internationalization. HCI and Culture, 4559, 412-421. Seewald, A. K. (2003, Aug. 09-15). Towards a theoretical framework for ensemble classification. Paper presented at the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico. Sharma, R., Pavlovic, V. I., & Huang, T. S. (1998). Toward multimodal human-computer interface. Proceedings of the IEEE, 86(5), 853-869. Shi, Y., Ruiz, N., Taib, R., Choi, E., & Chen, F. (2007, 28 April - 3 May). Galvanic skin response (GSR) as an index of cognitive load. Paper presented at the CHI'07 Extended Abstracts on Human Factors in Computing Systems, San Jose, California, USA. Siegle, G. J., Ingram, R. E., & Matt, G. E. (2002). Affective interference: An explanation for negative attention biases in dysphoria? Cognitive Therapy and Research, 26(1), 73-87. Sperandio, J. C. (1978). The regulation of working methods as a function of work-load among air traffic controllers. Ergonomics, 21(3), 195-202. Stanners, R. F., Coulter, M., Sweet, A. W., & Murphy, P. (1979). The pupillary response as an indicator of arousal and cognition. Motivation and Emotion, 3(4), 319-340. Tian, Y., Kanade, T., & Cohn, J. F. (2002, May 19-20). Evaluation of Gabor-wavelet-based facial action unit recognition in image sequences of increasing complexity. Paper presented at the 5th IEEE International Conference on Automatic Face and Gesture Recognition, Washington D.C., USA. Utthara, M., Suranjana, S., Sukhendu, D., & Pinaki, C. (2010). A survey of decision fusion and feature fusion strategies for pattern classification. IETE Technical Review, 27(4), 293-307. Viola, P., & Jones, M. (2001, Dec. 8-14). Rapid object detection using a boosted cascade of simple features. Paper presented at the Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA. Yerkes, R. M., & Dodson, J. D. (1908). The relation of strength of stimulus to rapidity of habit- formation. Journal of Comparative Neurology and Psychology, 18(5), 459-482. Yin, B., Chen, F., Ruiz, N., & Ambikairajah, E. (2008, 31 March - 04 April). Speech-based cognitive load monitoring system. Paper presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008) Las Vegas, Nevada, USA.