sensors

Article Tracking of Mental Workload with a Mobile EEG Sensor

Ekaterina Kutafina 1,2,* , Anne Heiligers 3, Radomir Popovic 1 , Alexander Brenner 4, Bernd Hankammer 1, Stephan M. Jonas 5, Klaus Mathiak 3 and Jana Zweerings 3

1 Institute of Medical Informatics, Medical Faculty, RWTH Aachen University, 52074 Aachen, Germany; [email protected] (R.P.); [email protected] (B.H.) 2 Faculty of Applied Mathematics, AGH University of Science and Technology, 30-059 Krakow, Poland 3 Department of Psychiatry, Psychotherapy and Psychosomatics, School of Medicine, RWTH Aachen University, 52074 Aachen, Germany; [email protected] (A.H.); [email protected] (K.M.); [email protected] (J.Z.) 4 Institute of Medical Informatics, University of Münster, 48149 Münster, Germany; [email protected] 5 Department of Informatics, Technical University of Munich, 85748 Garching, Germany; [email protected] * Correspondence: ekutafi[email protected]

Abstract: The aim of the present investigation was to assess if a mobile electroencephalography (EEG) setup can be used to track mental workload, which is an important aspect of learning performance and motivation and may thus represent a valuable source of information in the evaluation of cognitive training approaches. Twenty five healthy subjects performed a three-level N-back test using a fully mobile setup including tablet-based presentation of the task and EEG data collection with a self- mounted mobile EEG device at two assessment time points. A two-fold analysis approach was chosen including a standard analysis of variance and an artificial neural network to distinguish the   levels of cognitive load. Our findings indicate that the setup is feasible for detecting changes in cognitive load, as reflected by alterations across lobes in different frequency bands. In particular, we Citation: Kutafina, E.; Heiligers, A.; Popovic, R.; Brenner, A.; Hankammer, observed a decrease of occipital alpha and an increase in frontal, parietal and occipital theta with B.; Jonas, S.M.; Mathiak, K.; increasing cognitive load. The most distinct levels of cognitive load could be discriminated by the Zweerings, J. Tracking of Mental integrated machine learning models with an accuracy of 86%. Workload with a Mobile EEG Sensor. Sensors 2021, 21, 5205. https:// Keywords: mHealth; EEG; N-back; cognitive effort; wearable doi.org/10.3390/s21155205

Academic Editor: Steve Ling 1. Introduction Received: 8 July 2021 Research and technological advances of the last several decades have led to an in- Accepted: 28 July 2021 creased awareness of the importance of personalized approaches to human medical care Published: 31 July 2021 and wellbeing. In the domain of mental health, data-driven individualized feedback is cautiously being integrated to support diagnosis and the development of personal treat- Publisher’s Note: MDPI stays neutral ment schemes. Mental health apps are a promising approach to collecting data at several with regard to jurisdictional claims in time points with minimal distortion of the everyday life routines of the participant. How- published maps and institutional affil- iations. ever, empirical support is often lacking (see review [1]) as many applications are limited to self-reported data. While such data is an important source of information, it may be distorted by interoceptive ability and social desirability biases. In addition, many psychi- atric disorders are underpinned by alterations on the neural level that cannot be captured by self-reporting measures but may represent a valuable source of information for the Copyright: © 2021 by the authors. development and evaluation of treatment schemes. For example, electrophysiological data Licensee MDPI, Basel, Switzerland. such as electroencephalography (EEG) and electrocardiography (ECG) have been used to This article is an open access article generate information on the emotional state [2] or the stress level of an individual [3] as distributed under the terms and conditions of the Creative Commons well as for diagnostic purposes and the evaluation of treatment progression [4]. Attribution (CC BY) license (https:// In particular, EEG measurements are a promising indicator of mental workload [5] creativecommons.org/licenses/by/ which is a general term describing the relationship between the task demands and mental 4.0/). resources needed to be employed to complete a task. The actual workload is related to

Sensors 2021, 21, 5205. https://doi.org/10.3390/s21155205 https://www.mdpi.com/journal/sensors Sensors 2021, 21, 5205 2 of 21

several aspects including task demands and working memory capacity. In an attempt to classify workload, Hogervorst and colleagues [5] compared the performance of several physiological markers, including skin conductance, respiration, ECG, pupil size, eye blinks, and EEG while inducing different levels of workload with an N-back task. According to the authors, classification of workload was best for the EEG data. Importantly, the performance of the classification did not significantly improve upon combining different features. In previous studies, cognitive load has been associated with variations in different EEG frequency bands. Most consistently, changes in the alpha and theta frequency bands have been reported in response to different levels of workload [6–8]. For example, alpha power has been shown to increase during rest, i.e., when mental load is low [6]. Upon intensification of mental workload, alpha power has been shown to decrease [9]. The opposite response pattern can be observed for theta frequency bands, which show an increase with rising mental effort [10]. Thus, EEG-informed assessment of mental workload may provide a feasible solution for the objective evaluation of task difficulty. This is highly relevant to many practical applications, including safety assessment in the context of driver or pilot fatigue [11,12] and working efficiency or learning [13]. In a clinical context, objective measures of workload could access the success of cognitive trainings, which are promising for the betterment of brain injury or stroke survivors [14,15], as well as for patients with mental disorders [16]. An objective and easy to measure physiological marker for brain function improvement could support clinical decision making by providing additional information on the effi- cacy of treatment options and hence, information that is important in the context of the start or termination of the therapy. In addition, deriving objective markers for cognitive load could be used to develop adaptive training protocols, thereby increasing training efficiency and creating individually tailored training protocols without the necessity for human monitoring. This is important since two essential problems are associated with self-monitoring of cognitive load. Firstly, the user needs to have an adequate interoceptive awareness of the mental effort needed for the task. Secondly, the monitoring necessitates a dual-task performance (i.e., completing the task and monitoring cognitive workload) that will require additional cognitive resources and hence may reduce task performance and learning [13]. Flexible adjustment of the training based on an objective marker of mental effort may enhance training performance by preventing excessive demands on the patients that may decrease motivation. So far, research has mostly implemented clinical or research-grade EEG devices to measure brain function in healthy individuals and patients. While these devices offer advantages in terms of high signal quality, their setup is time consuming and cannot be used in home-care scenarios. To realize an implementation of personalized care in daily medical routines that extends beyond health care in hospitals, facilitation of use of the devices, an optimal cost-benefit association, and time-efficiency are essential. The development of mobile sensors provides the possibility to perform measurements quickly, easily, and without the necessity of visiting a medical facility, thus promoting the expansion of health-care options in the natural settings of the patients. For example, Hussain and Park [17] proposed a set-up with mobile EEG built into a sleeping mask for ischemic stroke patients’ monitoring, employing machine learning algorithms for group classification. In another work, Murphy et al. [18] used machine-learning on dry-electrode EEG data to track cognitive aging. In this project, we aimed to determine the potential of a mobile EEG device to detect changes in electrical brain activity based on cognitive load. In order to achieve this goal, we implemented a fully mobile setup consisting of a mobile EEG device connected to a tablet to execute the cognitive performance task. In line with previous investigations, we expected to observe frequency band changes, particularly a reduction in alpha and an increase in theta band power depending on cognitive load, as measured by a modified version of the N-back task. To create a comprehensive display of the data, we combined a statistical approach with a machine learning-based condition classification. Furthermore, we wanted Sensors 2021, 21, 5205 3 of 21

to assess the usability of the setup by the participants to evaluate if its self-responsible use in daily life is feasible. The current investigation contributes to our understanding of the applicability of mobile EEG devices in the context of the assessment of cognitive load. This aspect is of high importance since it may serve as a starting point for various applications in daily life such as the development of an assistive tool for cognitive trainings in patients to monitor and flexibly adapt the difficulty level of the training. In the following, we will describe the experimental set-up (tasks, hard- and software), the two-fold analysis approach (standard analysis of variance and the artificial neural network), and the findings on the feasibility of our set-up and user experience.

2. Materials and Methods 2.1. Participants Twenty-five healthy volunteers participated in this monocentric prospective study. One participant was excluded from data analysis based on the results of the clinical interview resulting in a final sample for data analysis of 24 (age in years: 24.4 ± 3.3; 17 female; see Table1). All participants were recruited by advertisement at the RWTH Aachen University Hospital, native German speakers and right-handed according to the Edinburgh Handedness Inventory with the exception of one individual. All participants had normal or corrected-to-normal vision. Inclusion criteria were age between 18–65 years, voluntary participation and ability to consent; severe psychiatric disorders and epilepsy served as exclusion criteria. To objectify the criterion “severe psychiatric disorder” the screening of the German version of the Structured Clinical Interview for assessment of DSM-IV-TR criteria [19] was performed by medically trained personnel. According to the interview, participants had no history of or current psychiatric disorder. All participants gave their written informed consent prior to inclusion in the study and received 20 € as compensation for their participation. The study was submitted to and approved by the local ethics committee of the Faculty of Medicine of the RWTH Aachen University (EK 284/18). The pre-registration at the German clinical trial register (Deutsches Register klinischer Studien, DRKS00016150) comprised the hypothesized theta and alpha effects as primary outcome. Analysis of the P300 component will be reported at a later time point.

Table 1. Descriptive statistics.

M SD Min Max Range Education 15.8 2.6 13 21 8 Education father * 15.2 3.3 9 18 9 Education mother * 14.7 3.4 9 21 12 Digit span forwards 8.2 1.9 5 12 7 Digit span backwards 6.8 1.4 5 11 6 Verbal intelligence (WST) 102.5 8.1 89 118 29 System Usability Scale (SUS) 79.6 11.2 57.5 95.0 37.5 Notes. N = 24. * one missing data point, n = 23.

Prior to the start of the main study, the technical setup was tested on the researchers to prevent unexpected failure of hardware or software components as well as to check the overall experimental flow.

2.2. Experimental Setup 2.2.1. Experimental Procedure Participants were invited to the lab on two days (see Figure1). The two measurement days were separated by at least one week, but not more than two weeks. Measurements took place at the same time of day for each participant individually; however, measurement times between participants varied. On day 1, the diagnostic screening took place and sociodemographic data was acquired. All participants completed two neuropsychological measures: a verbal intelligence test (Wortschatztest—WST, [20]) and a working memory Sensors 2021, 21, x FOR PEER REVIEW 4 of 21

2.2. Experimental Setup 2.2.1. Experimental Procedure Participants were invited to the lab on two days (see Figure 1). The two measurement days were separated by at least one week, but not more than two weeks. Measurements took place at the same time of day for each participant individually; however, measure- ment times between participants varied. On day 1, the diagnostic screening took place and sociodemographic data was acquired. All participants completed two neuropsycho- logical measures: a verbal intelligence test (Wortschatztest—WST, [20]) and a working

Sensors 2021, 21, 5205 memory test (digit-span task). Furthermore, the German version of the Positive4 of 21 and Neg- ative Affect Scale (PANAS, [21]) was administered to assess the affective state of the par- ticipants. In order to investigate the usability of the device, participants had to complete the Technologytest (digit-span Usage task). Inventory Furthermore, (TUI-II, the German [22]). version On ofday the 1, Positive two subscales and Negative of Affectthis inventory were Scaleselected: (PANAS, (a) technology [21]) was administered anxiety and to assess (b) curiosity. the affective After state ofcompletion the participants. of the In question- naires,order all toparticipants investigate the received usability a of verbal the device, explanation participants on had the to complete setup of the the Technology mobile EEG de- vice. Subsequently,Usage Inventory (TUI-II,subjects [22 were]). On dayasked 1, two to independently subscales of this inventory use the were device selected: under (a) the guid- technology anxiety and (b) curiosity. After completion of the questionnaires, all participants ance ofreceived a simplistic a verbal user explanation manual on with the illustrati setup of theons mobile of the EEGapplication device. Subsequently,of the device. At this stage subjectsof the experiment, were asked to independently participants use received the device help under from the guidancethe investigator of a simplistic if they user were not able tomanual place with the illustrations electrodes of adequately the application themse of the device.lves to At ensure this stage a sufficient of the experiment, signal-to-noise ratio. participants received help from the investigator if they were not able to place the electrodes adequately themselves to ensure a sufficient signal-to-noise ratio.

Figure 1. StudyFigure protocol. 1. Study protocol.Participants Participants completed completed two lab two visits. lab visits. On day On day 1 they 1 they received received aa diagnostic diagnostic screeningscreening and and neuro- psychologicalneuropsychological assessment, and assessment, had to andfill in had the to fillpre-assessment in the pre-assessment of the of usability the usability questionnaire. questionnaire. On On day 11 and and day day 2, 2, partic- ipants completedparticipants the completed PANAS the to PANASassess tothe assess affective the affective state state prio priorr to toand and after after thethe EEG EEG recordings. recordin Priorgs. toPrior the start to the of the start of the cognitive tasks, participants received standardized instructions for the setup of the mobile EEG device. After successful cognitive tasks, participants received standardized instructions for the setup of the mobile EEG device. After successful application, a fixation cross was displayed and participants were instructed to keep eyes open, followed by two minutes application, a fixation cross was displayed and participants were instructed to keep eyes open, followed by two minutes of rest with eyes closed. Afterwards, participants completed the N-back task with three difficulty levels (presented in of rest withrandomized eyes closed. order) Afterwards, and the Go/No-go participants task. On completed day 2 only, participants the N-back filled task in thewith TUI-II-post three difficulty questionnaire levels and (presented the in randomizedSystem order) Usability and the Scale Go/No-go after completion task. of On the day tasks. 2 only, participants filled in the TUI-II-post questionnaire and the System Usability Scale after completion of the tasks. The EEG measurements started with a baseline recording which included fixating Theon a EEG cross measurements for 2 min, resting started with closed with eyes a baseli for 2ne min, recording and tapping which on the included screen ten fixating on times. After the baseline measurement, the N-back task was performed, followed by a a crossGo/No-go for 2 min, task. resting Analysis with of the closed data from eyes the fo Go/No-gor 2 min, and task tapping was not in on the the focus screen of the ten times. After currentthe baseline manuscript. measurement, Prior to starting the theN-back Go/No-go task was task, theperformed, investigator followed ensured thatby a Go/No- go task.the electrodeAnalysis setup of the was data still functional.from the IfGo/No-go a malfunction task was was detected, not in the the EEG focus setup of was the current manuscript.corrected. Prior After to the starting EEG measurement the Go/No-go the PANAS task, wasthe administeredinvestigator a ensured second time that to the elec- evaluate changes in the affective state. In line with the work of Brouwer and colleagues [7], trode setup was still functional. If a malfunction was detected, the EEG setup was cor- the current investigation focuses on data of the N-back task as a measure of cognitive load rected.and After data the quality EEG only. measurement the PANAS was administered a second time to eval- uate changesDay 1in and the day affective 2 were constructedstate. In line in exactlywith the the samework way of Brouwer with exception and tocolleagues the [7], the currentquestionnaires. investigation On day focuses 2, only the on PANAS data of questionnaire the N-back was task administered as a measure prior of to cognitive and load and dataafter quality the measurement only. as well as the post evaluation of the TUI-II including the subscales ‘skepticism’, ‘interest’, ‘accessibility’, ‘user-friendliness’, and ‘usefulness’. In addition, all Daysubjects 1 and completed day 2 the were System constructed Usability Scale in (SUS,exactly [23 ]).the same way with exception to the questionnaires. On day 2, only the PANAS questionnaire was administered prior to and 2.2.2. Hardware In the present study, we used a consumer grade EEG device, namely the Emotiv Epoc+ with sampling rate of 128 Hz (EMOTIV Inc., San Francisco, CA, USA). One of the reasons

Sensors 2021, 21, x FOR PEER REVIEW 5 of 21

after the measurement as well as the post evaluation of the TUI-II including the subscales ‘skepticism’, ‘interest’, ‘accessibility’, ‘user-friendliness’, and ‘usefulness’. In addition, all subjects completed the System Usability Scale (SUS, [23]).

2.2.2. Hardware In the present study, we used a consumer grade EEG device, namely the Emotiv Epoc+ with sampling rate of 128 Hz (EMOTIV Inc., San Francisco, CA, USA). One of the Sensors 2021, 21, 5205 reasons to use this device is its relatively large brain area coverage (14 electrodes,5 of 21 located bilaterally over the occipital lobe (2), temporal lobes (2), parietal lobe (2), and frontal lobe (8); see Figure 2A). It is important to note that the rigid plastic frame determines the posi- tion toof use the this individual device is its electrodes. relatively large This brain is co areampensated coverage (14by electrodes,the fact that located the bilaterally saline solution- basedover electrodes the occipital and lobe wireless (2), temporal construction lobes (2), allo parietalw for a lobe very (2), fast and and frontal easy lobe mounting, (8); see which Figure2A). It is important to note that the rigid plastic frame determines the position of can be used independently by participants who are naive to the method. Importantly, the the individual electrodes. This is compensated by the fact that the saline solution-based Emotivelectrodes Epoc andhas wirelessbeen validat constructioned for allow a variety for a very of research fast and easy purposes mounting, [24], which including can be an in- vestigationused independently on simultaneous by participants collection who areof mo naivebile to EEG the method. and clinical Importantly, EEG theon Emotivpatients with epilepsy.Epoc hasResults been validatedof those forstudies a variety indicate of research lower purposes signal-to-noise [24], including ratio an in investigation the data from the mobileon simultaneousdevice compared collection to clinical-grade of mobile EEG devices, and clinical and EEG limited on patients brain coverage with epilepsy. [25]. How- ever,Results the quality of those of studies the data indicate from lower the signal-to-noiseEmotiv Epoc ratiorecordings in the data had from adequate the mobile quality to device compared to clinical-grade devices, and limited brain coverage [25]. However, the trackquality epileptiform of the data discharges from the Emotiv [26], Epocspecific recordings ERP (event-related had adequate quality potentials) to track changes epilepti- [24,25], and formfrequency discharges domain [26], specificfluctuations ERP (event-related [27], rendering potentials) it a promising changes [24 candidate,25], and frequency for the current investigation.domain fluctuations [27], rendering it a promising candidate for the current investigation.

Figure 2. (A) Electrode placement. The figure depicts electrodes that are covered by the Emotiv Epoc mobile EEG device. Figure 2.Frontal (A) Electrode electrodes placement. are depicted inThe blue figure (n = 8), depi temporalcts electrodes electrodes that in green are covered (n = 2), parietal by the electrodes Emotiv inEpoc yellow mobile (n = 2) EEG and device. Figure 8.occipital temporal electrodes electrodes in red in (n green= 2). (B(n) Mobile= 2), parietal setup. Displayelectrodes of the in mobile yellow EEG (n = device 2) and and occipital the tablet electrodes that was used in red for (n = 2). (B) Mobilestimulus setup. presentation. Display of The the figure mobile illustrates EEG andevice example and of thethe N-back tablet task.that Participantswas used werefor stimulus instructed presentation. to press the ‘button’ The figure illustrates(visible an example on the lower of the part N-back of the tablet task. screen) Participants upon detection were ofinstructed a target stimulus. to press The the rectangular ‘button’ button (visible covered on the an arealower of part of the tablet172 screen) mm × 20upon mm. detection of a target stimulus. The rectangular button covered an area of 172 mm × 20 mm. The mobile EEG device was connected to an Android tablet (Sony Xperia Z2) via .The mobile Since EEG the device primary was goal connected was to create to an a fully Android mobile tablet setup, (Sony we decided Xperia to Z2) via Bluetooth.use a tablet Since as the a display primary medium. goal was In addition,to create thea fully tablet mobile is beneficial setup, in we terms decided of the to use a tabletfacilitation as a display of use. medium. We implemented In addition, simplistic the tablet graphic is beneficial design choices in terms that allow of the for facilitation an of use.easy We navigation implemented through simplistic the experiments graphic and controldesign via choices tapping that on theallow screen. for Toan ensureeasy naviga- tion similarthrough distances the experiments to the display and screen control across participantsvia tapping and on fast the reactions, screen. the To tablet ensure was similar placed on a bracket. The screen of the tablet had a size of 295.8 cm2 with a resolution of distances1920 × to1200 the pixels display (16:10 screen ratio, across ~224 ppi participants density). and fast reactions, the tablet was placed on a bracket. The screen of the tablet had a size of 295.8 cm2 with a resolution of 1920 × 12002.2.3. pixels Software (16:10 ratio, ~224 ppi density). A custom application developed for Android 6.0 API 23 (Marshmallow) was used for 2.2.3.the Software connection to the mobile EEG. Data recording was performed in the background while the experiment interface was presented. Experiment data, such as stimuli types and task descriptions,A custom application were read from developed an XML format.for Android During 6.0 experiments, API 23 (Marshmallow) tasks were presented was used for the connectionvisually and to button the mobile presses wereEEG. recorded. Data recording The area thatwas was performed sensitive toin thethe ‘button background press’ while the experimentof the participants interface wasdefined was presented. as a rectangular Experi Androidment data, button such of size as 172 stimuli mm × types20 mm. and task descriptions,Visualization were of the read “pressed from button”an XML served format as a. During feedback experiments, to the subject to tasks avoid were multiple presented visuallyresponses and tobutton a single presses stimulus. were Event recorded. information The in area discrete that time was steps sensitive were stored to the in ‘button the European Data Format (EDF) with a resolution of 128 Hz (one sample equals 7.81 ms). With a stimulus onset interval of 1600 ms, the actual measured time between stimuli was on average 1599.36 ms with a standard deviation of 9.74 ms. Response times were recorded

Sensors 2021, 21, x FOR PEER REVIEW 6 of 21

press’ of the participants was defined as a rectangular Android button of size 172 mm × 20 mm. Visualization of the “pressed button” served as a feedback to the subject to avoid multiple responses to a single stimulus. Event information in discrete time steps were stored in the European Data Format (EDF) with a resolution of 128 Hz (one sample equals 7.81 ms). With a stimulus onset interval of 1600 ms, the actual measured time between stimuli was on average 1599.36 ms with a standard deviation of 9.74 ms. Response times were recorded with the timestamp provided by the Android system’s clock. Impedance Sensorsvalues2021, 21, 5205were regularly collected during the recording for each channel and added to the 6 of 21 EDF data files.

2.2.4. N-Back Task with the timestamp provided by the Android system’s clock. Impedance values were regularly collected during the recording for each channel and added to the EDF data files. An N-back task with three difficulty levels but identical stimuli and target rates was implemented to manipulate2.2.4. N-Back cognitive Task load (see Figure 3). The task design was adapted from Mathiak and colleaguesAn N-back [28]: task A withsequence three difficulty of 150 black levels butdigits identical (font: stimuli Roboto, and targetheight: rates was 16.2 mm) was presentedimplemented to the participants to manipulate in cognitive each condition load (see on Figure a grey3). The background task design wasin full adapted screen mode (Figurefrom 2B). MathiakOne third and of colleagues these digits [28]: A(n sequence = 50) were of 150 targets black digitsin each (font: condition. Roboto, height: 16.2 mm) was presented to the participants in each condition on a grey background in full The presented digitsscreen (1 through mode (Figure 8) were2B). Oneshown third for of these1600 digits ms each, ( n = 50) and were immediately targets in each fol- condition. lowed by the next digitThe presentedresulting digits in a (1 total through duration 8) were shownof 4 min for1600 for mseach each, condition. and immediately Partici- followed pants were instructedby to the respond next digit as resulting fast and in a accurately total duration as of possible. 4 min for each In the condition. first condition Participants were ‘zero-back’ (difficultyinstructed level 1, L1) to respond participants as fast and had accurately to press asa button possible. each In the time first the condition digit ‘zero-back’“1” (difficulty level 1, L1) participants had to press a button each time the digit “1” was shown. was shown. In the secondIn the secondcondition condition ‘three-even’ ‘three-even’ (difficulty (difficulty level level 2, L2)L2) participants participants were were instructed instructed to press ato button press a buttonwhenever whenever three three consecutive consecutive digits digits were were even.even. In In the the third third condition condition ‘two-back’‘two-back’ (difficulty (difficulty level 3, level L3) 3, the L3) participants the participants were were asked asked toto press press a buttona button whenever whenever the presentedthe presented digit was digit identical was identical to the to second the second to tolast last digit. digit.

Figure 3. Illustration of the three difficulty levels of the N-back task. Participants were instructed to press a button as fast as Figure 3. Illustration of the three difficulty levels of the N-back task. Participants were instructed to possible whenever a target appeared. Each condition contained 150 stimuli (50 targets) that appeared on the screen for 1600 press a button as fast as possible whenever a target appeared. Each condition contained 150 stimuli ms each. (A) Participants were instructed to press a button whenever the digit ‘1’ appeared on the screen. (B) Participants had(50 totargets) press a buttonthat appeared when three on digits the inscreen a row for were 1600 even. ms ( )each. Participants (A) Participants were asked to were press instructed a button when to thepress digit wasa button identical whenever to the second the todigit last digit.‘1’ appeared on the screen. (B) Participants had to press a button when three digits in a row were even. (C) Participants were asked to press a button when the digit was identical to the second to lastTwo digit. independently pseudo-randomized versions of each condition were created (version A & B). The versions were constructed in a parallel way, i.e., the same amount Two independentlyof stimuli pseudo-randomized and targets. However, vers the randomizationions of each of condition digits differed were between created conditions. Randomization of the digits was performed with the constraint that the same number is (version A & B). The notversions shown were twice inconstructed a row and with in a a parallel defined target way, rate i.e. of, the 33%. same Half amount of the participants of stimuli and targets. completedHowever, version the randomization A on day 1 and version of digits B on differed day 2, and between the other halfconditions. completed both Randomization of theversions digits inwa reverseds performed order. Furthermore,with the constraint the order inthat which the thesame three number conditions is were not shown twice in apresented row and was with randomized a defined for target each participant. rate of 33%. At the Half beginning of the ofparticipants each condition the participants were asked to focus on a fixation cross presented in the middle of the screen completed version Afor on 1 day min. 1 This and period version formed B on the day baseline 2, and for EEGthe other analyzes half (L0). completed both versions in reversed order. Furthermore, the order in which the three conditions were presented was randomized2.3. Data Analysisfor each participant. At the beginning of each condition the participants were asked2.3.1. to Software focus on a fixation cross presented in the middle of the screen for 1 min. This period formedData preprocessing the baseline and for featureEEG analyzes extraction (L0). was performed using MATLAB 2019a (The Mathworks, Natick, MA, USA) and EEGLAB 2019 [29]. Statistical analysis was conducted using the Statistical Package for the Social Science (SPSS) software, version 25 (IBM, Armonk, NY, USA).

2.3.2. Behavioral Data To validate the difficulty levels of the N-back task in the present sample, average response times for each condition across days were calculated and subjected to repeated measures ANOVA with the within-subject factor condition [L1, L2, L3]. Paired sample

Sensors 2021, 21, x FOR PEER REVIEW 7 of 21

2.3. Data Analysis 2.3.1. Software Data preprocessing and feature extraction was performed using MATLAB 2019a (The Mathworks, Natick, MA, USA) and EEGLAB 2019 [29]. Statistical analysis was con- ducted using the Statistical Package for the Social Science (SPSS) software, version 25 (IBM, Armonk, NY, USA).

2.3.2. Behavioral Data To validate the difficulty levels of the N-back task in the present sample, average response times for each condition across days were calculated and subjected to repeated measures ANOVA with the within-subject factor condition [L1, L2, L3]. Paired sample t- tests were carried out for post hoc evaluations. To further quantify the performance and minimize biases due to non-adherence to task instructions (e.g., randomly pressing the button), we calculated the ‘fraction correct’ as a measure of accuracy such as proposed by Sensors 2021, 21, 5205 7 of 21 Brouwer and colleagues [7] as the number of correct responses divided by the total num- ber of stimuli. In parallel to the reaction time task, the ‘fraction correct’ data was subjected

t-teststo a wererepeated-measures carried out for post hocANOVA evaluations. with To the further within-subjects quantify the performance factor condition and [L1, L2, L3]. minimizeTo biases assess due usability to non-adherence of the tosetup, task instructions this study (e.g., applied randomly the pressing SUS; the the composite score was button),established we calculated according the ‘fraction to [23] correct’ by calculating as a measure of th accuracye sum suchof the as proposedscores for by each individual item Brouwer and colleagues [7] as the number of correct responses divided by the total number ofand stimuli. multiplying In parallel to the the reactionsum by time 2.5. task, Items the ‘fraction were correct’rated on data a was five-point subjected toLikert a scale. This proce- repeated-measuresdure results in ANOVA a composite with the within-subjectsscore ranging factor between condition 0 [L1,and L2, 100. L3]. In addition, the composite scoreTo assessof the usability TUI-II of was the setup, used this to studyevaluate applied usab the SUS;ility theof compositethe device score [22]. was Items were rated on a established according to [23] by calculating the sum of the scores for each individual item andseven-point multiplying theLikert sum byscale. 2.5. Items were rated on a five-point Likert scale. This proce- dure results in a composite score ranging between 0 and 100. In addition, the composite score2.3.3. of theEEG TUI-II Data was Preprocessing used to evaluate usability of the device [22]. Items were rated on a seven-point Likert scale. The implemented preprocessing pipeline can be separated into six consecutive steps 2.3.3.(see EEG Figure Data 4): Preprocessing (1) band pass filtering, (2) automatic channel removal, (3) extraction, (4) Theautomatic implemented epoch preprocessing removal, pipeline (5) extraction can be separated of scaled into six power consecutive for each steps frequency band, and (see Figure4): (1) band pass filtering, (2) automatic channel removal, (3) epoch extraction, (4)(6) automatic rejection epoch of outliers removal, (5) in extractionband power. of scaled Mobile power forEEG each has frequency been shown band, and to have a lower signal- (6)to-noise rejection ratio of outliers compared in band power.to clinical Mobile grade EEG hasdevices been shown[26], and to have therefore a lower it is particularly im- signal-to-noiseportant to track ratio compared the amount to clinical of the grade removed devices [26 da], andta. thereforeAccordingly, it is particularly we collected the respective important to track the amount of the removed data. Accordingly, we collected the respective statisticsstatistics on eachon each preprocessing preprocessing step. step.

Figure 4. The continuous EEG data were preprocessed (band pass filtered at 1–40 Hz, channel rejection) and split into event-related epochs of 1500 ms each. After epoch rejection, powers of the individual frequency bands (delta-beta) were extracted.

1. Filtering. Previous research [27] established optimized frequency filtering for the Emotiv Epoc at a 128 Hz sampling rate including a high-pass filter at 1 Hz and a low-pass filter at 40 Hz. Implementing of high-pass filtering as a first step in the preprocessing pipeline is of particular importance, as the unfiltered slow drift is Sensors 2021, 21, 5205 8 of 21

heavily represented in the data. The chosen level of the low-pass filter ensures the removal of the power line. 2. Corrupted channels removal. The standard EEGLAB function (pop_rejchan) was used to remove the complete corrupted channels based on their normalized kurtosis. 3. Epoch extraction. The filtered data was divided into epochs, covering a 1500 ms period after the stimulus presentation [30]. The resting state data, recorded during a fixation cross presentation on the screen while participants were assigned no specific task, were cut into 1500 ms segments and served as a baseline condition. The following information was additionally stored for each epoch: (1) target or non-target stimulus and (2) response exhibition prior to the next stimulus display. 4. Removal of abnormal epochs. EEGLAB kurtosis-based deviation detection func- tion was used to remove epochs with artefacts. The epochs with an absolute am- plitude above 1500 or impedance below 100 (internal device-specific units) were removed subsequently. 5. Relative band powers extraction. Power measures for the EEG bands were extracted for each epoch using the following boundaries [31]: delta (1–3.5 Hz), theta (3.5–7.5 Hz), alpha (7.5–12.5 Hz), and beta (12.5–30 Hz). The resulting values were further log- arithmically scaled [32] to improve the power distributions across the bands and normalized by the similarly scaled and averaged full band (1–40 Hz) power dur- ing the baseline condition L0. Normalization was performed separately for each participant and each day. 6. Band power threshold. Epochs with any relative frequency band value exceeding the threshold of 1.3 were removed. In this work, we did not specifically handle EMG (electromyography), ECG (electro- cardiography), or EOG (electrooculography) artifacts. The most prominent contaminations caused by those artifacts are removed through frequency filtering, removal of abnormal epochs, or by power band threshold.

2.3.4. Statistical Analysis The statistical analysis was performed to investigate differences in the logarithmic band power between conditions, taking into account the inter-personal and inter-session variations. The preprocessed data was averaged across records to prevent artificially improved results due to the high number of epochs within the same (subject, day, condition, electrode) combination, which cannot be treated as independent entries. In addition, the data was averaged across the brain lobes (see Figure2A). The EEG data was analyzed using a repeated measures ANOVA with the within-subject factors condition [L0, L1, L2, L3], where L0 reflects the baseline, Frequency Band [delta, theta, alpha, beta], Lobe [Frontal, Parietal, Temporal, Occipital], and Day [Day 1, Day 2]. Our results are obtained on the data which includes all epochs, without separation into the target versus nontarget stimuli or response correctness.

2.3.5. Machine Learning In this paper, we applied artificial neural networks (ANN) to classify the experimental conditions. The ANN approach proved to be efficient in EEG applications, in particular, when combined with spectral features such as frequency bands or wavelets [33]. Standard feed-forward shallow networks, as implemented in MATLAB 2019a, were used. The classification solution can be presented as three steps: (1) data division, (2) optimization of the ANN architecture, and (3) averaging across randomizations to improve model robustness. This separation should highlight the high number of degrees of freedom which needed to be addressed before the final parameter choices were made. 1. Train/test data division. We expected high data variation in this data set. Therefore, we tested four different approaches regarding the data division into test and training sets. Firstly, participants were randomly divided into three groups of 16, 4, and 4 individuals each and the corresponding data was used as training, validation, and Sensors 2021, 21, x FOR PEER REVIEW 9 of 21

The classification solution can be presented as three steps: (1) data division, (2) optimiza- tion of the ANN architecture, and (3) averaging across randomizations to improve model robustness. This separation should highlight the high number of degrees of freedom which needed to be addressed before the final parameter choices were made. 1 Train/test data division. We expected high data variation in this data set. Therefore, we tested four different approaches regarding the data division into test and training sets. Firstly, participants were randomly divided into three groups of 16, 4, and 4 in- dividuals each and the corresponding data was used as training, validation, and test sets, respectively. The validation set served parameter adjustment purposes to avoid that the parameters are optimized to perform best on the test set only. Subject-wise Sensors 2021, 21, 5205 data division is the most desirable for the potential applications as it allows9 of 21 us to work with the new participants’ data without personal calibration. Secondly, all data from day 1 was used as a training set, the first half of the day 2 data as a validation set, and the secondtest sets, half respectively. of day 2 The as validationa test set. set This served approach parameter reflects adjustment a situation purposes in to which per- sonalavoid calibration that the parametersis required, are but optimized further to new perform input best from on the the test same set only. person Subject- can be ana- lyzedwise directly. data division Next, is all the data most desirablefrom day for 1 the and potential first applicationshalf of the as day it allows 2 data us toserved as a work with the new participants’ data without personal calibration. Secondly, all data trainingfrom test. day 1The was last used part as a training of day set, 2 was the first divided half of theinto day 2 2equal data as parts a validation for validation set, and test. Lastly,and the seconda leave-one-out half of day 2approach as a test set. was This explored approach reflectsonce the a situation parameter in which optimization was finished.personal calibration This allowed is required, us to but use further more new data input for from the the training same person of the can model be and to investigateanalyzed how directly. promising Next, all datait would from day be 1 to and skip first halfpersonal of the daycalibration 2 data served if more as a training training test. The last part of day 2 was divided into 2 equal parts for validation and data test.becomes Lastly, available. a leave-one-out approach was explored once the parameter optimization 2 Optimizationwas finished. of Thisthe allowedANN architecture us to use more (see data forFigure the training 5). Our of theinitial model input andto vector size equalsinvestigate to 4 (number how promising of the used it would frequency be to skip bands) personal × 14 calibration (number if more of channels) training = 56. We performeddata becomes a consecutive available. search through the combinations of 1 to 3 layers as well as 2. Optimization of the ANN architecture (see Figure5). Our initial input vector size with 1, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, and 90 nodes (same for each layer). The equals to 4 (number of the used frequency bands) × 14 (number of channels) = 56. networkWe performed output was a consecutive a vector searchof size through 2, whic theh assigned combinations the of input 1 to 3 layersdata to as one well of the two classes.as withBy default 1, 5, 10, those 15, 20, classes 30, 40, 50, were 60, 70,difficulty 80, and 90level nodes 0 (L0) (same and for 3 each (L3). layer). The network architectureThe network with output the best was accuracy a vector of in size differentiation 2, which assigned between the input those data two to one classes was chosenof thefor twofurther classes. experiments. By default those It is classes important were difficulty to note levelthat 0the (L0) data and 3was (L3). being per- The network architecture with the best accuracy in differentiation between those mutedtwo before classes entering was chosen the for ANN further structure experiments. in order It is important to optimize to note the that training the data process. 3 Averagingwas being across permuted multiple before randomizations. entering the ANN structureTo control in order for the to optimize effect theof the initial weighttraining randomization process. of ANNs, training in every architecture setup step was re- peated3. Averaging 30 times across and multiple the results randomizations. were averaged. To control After for the the optimal effect of thenetwork initial size was weight randomization of ANNs, training in every architecture setup step was re- fixed, 10 instances of the ANN were trained on the train set to create a voting system. peated 30 times and the results were averaged. After the optimal network size was This fixed,way, 10potentially instances of large the ANN fluctuations were trained of onthe the results train set due to create to the a votinginitial sys- weight ran- domizationtem. This were way, smoothened potentially large to fluctuations obtain more of the robust results results. due to the initial weight randomization were smoothened to obtain more robust results.

× Figure 5. FigureAfter 5.preprocessing,After preprocessing, the EEG the EEG data data vector vector (input) (input) hashas length length 56 (456 bands (4 bands14 channels).× 14 channels). Correspondingly, Correspondingly, the the number of ANN input nodes is 56, the output layer during the optimization process has 2 nodes (we classify only L0 and number of ANN input nodes is 56, the output layer during the optimization process has 2 nodes (we classify only L0 and L3 difficulty levels). The number of hidden layers vary between 1 and 3, and the number of hidden nodes is kept the same L3 difficulty levels). The number of hidden layers vary between 1 and 3, and the number of hidden nodes is kept the same for all hidden layers and takes values 1, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90. for all hidden layers and takes values 1, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90. 3. Results 3.1. Behavioral Data 3.1.1. Reaction Time Data The repeated measures ANOVA with the factor Condition [L1, L2, L3] revealed a significant main effect (F(2,46) = 12.78, p < 0.001, η 2 = 0.357; see Figure6). Post hoc paired p t-test for each pair of conditions (i.e., L1 vs. L2, L1 vs. L3, L2 vs. L3) were calculated. Sensors 2021, 21, x FOR PEER REVIEW 10 of 21

3. Results 3.1. Behavioral Data 3.1.1. Reaction Time Data The repeated measures ANOVA with the factor Condition [L1, L2, L3] revealed a significant main effect (F(2,46) = 12.78, p < 0.001, ηp2 = 0.357; see Figure 6). Post hoc paired Sensors 2021, 21, 5205 10 of 21 t-test for each pair of conditions (i.e., L1 vs. L2, L1 vs. L3, L2 vs. L3) were calculated. Re- sults indicate a significant difference between L1 and L2 (t(23) = −3.95, p = 0.001) and L1 and L3Results (t(23) indicate = −5.22, a significantp < 0.001). difference No significant between L1difference and L2 (t(23) was = − detected3.95, p = 0.001) between and L1 L2 and L3 (t(23) and= −1.01, L3 (t(23) p > = 0.2).−5.22, Onp < average, 0.001). No reaction significant times difference were was 596.9 detected ms between(±59.4 ms) L2 and for L3 condition 1, 656.9(t(23) ms = (±95.6−1.01, ms)p > 0.2). for Oncondition average, reaction2, and 675.4 times werems (±59.4 596.9 ms ms) (± 59.4for ms)condition for condition 3. 1, 656.9 ms (±95.6 ms) for condition 2, and 675.4 ms (±59.4 ms) for condition 3.

Figure 6. The behavioral analysis confirmed the experimental conditions as reflected by increasing task demand with Figure 6. Theslower behavioral responses andanalysis reduced confirmed accuracy across the theexperimental three conditions conditions (corresponding as reflected to difficulty by levels increasing L1, L2, L3). task (A) Meandemand with Scheme 1. L2,reaction L3). times(A) Mean across reaction days for eachtimes condition. across days A significant for each main condition. effect of conditionA significant emerged. main Post effect hoc of tests condition indicate a emerged. Post hoc testssignificant indicate difference a significant between diffe L1 andrence L2 and between L1 and L3.L1 ***andp = L2 0.001. and (B L1) Mean and fraction L3. *** of p correct= 0.001. responses (B) Mean across fraction days for of correct responses acrosseach condition. days for A each significant condition. main effect A significant of condition main emerged. effect Post of hoc condition tests indicate emerged. a significant Post difference hoc tests between indicate all a signifi- cant differenceconditions. between *** p all= 0.001, conditions. ** p = 0.005. *** p = 0.001, ** p = 0.005. 3.1.2. Accuracy of Responses 3.1.2. AccuracyThe ‘fraction of Responses correct’ (FC; i.e., the number of correct responses in comparison to all Theresponses) ‘fraction was correct’ calculated (FC; across i.e., days the and number entered of repeated correct measures responses ANOVA in comparison with the to all 2 responses)factor condition.was calculated The main across effect days was significant and entered (F(2,46) repeated = 41.26, measuresp < 0.001, η pANOVA= 0.642). with the Post hoc paired samples t-tests revealed a significant difference between all conditions 2 factor (condition.L1 and L2: t(23) The = main 3.23, p effect< 0.005; was L1 and significant L3: t(23) = (F 8.97(2,46), p < = 0.001; 41.26, L2 p and < 0.001, L3: t(23) ηp == 5.31 0.642)., Post hoc pairedp < 0.001). samples FC decreasedt-tests revealed across conditions—and a significant difference hence with between an increase all in conditions cognitive (L1 and L2: t(23)load = —with3.23, p a< mean0.005; FC L1 of and 0.998 L3: (± t0.003)(23) = in 8.97, condition p < 0.001; 1 (L1), L2 0.98 and (± L3:0.03) t(23) in condition = 5.31, p < 0.001). FC decreased2 (L2) and across 0.93 (± 0.03)conditions—and in condition 3 (L3). hence Accordingly, with an all increase participants in cognitive performed aboveload —with a chance level with FC scores of 0.88 and higher. The distribution of the data was skewed mean FC of 0.998 (±0.003) in condition 1 (L1), 0.98 (±0.03) in condition 2 (L2) and 0.93 to the left—in particular for L1. According to the Shapiro-Wilk test, the assumption of (±0.03)normality in condition was violated. 3 (L3). However, Accordingly, based on aall sufficiently participants large sampleperformed size and above the relative chance level with FCrobustness scores of the0.88 repeated and higher. measures The ANOVA distri andbution t-statistics of the against data was non-normality skewed to if thethe left—in particularother for assumptions L1. According are met, weto decidedthe Shapiro-Wilk to report the resultstest, the [34– assumption36]. Four data pointsof normality can was ± violated.be considered However, outliers based in on L2 baseda sufficiently on a criterion large of sample3SD. These size data and points the representedrelative robustness individuals with a higher number of mistakes across conditions. Since the FC scores of of thethe repeated respective measures individuals ANOVA were still and clearly t-statistics above chance, against we did non-normality not exclude them if fromthe other as- sumptionsanalysis. are In met, addition, we decided test-statistics to report were not the heavily results influenced [34–36]. by Four these values.data points can be con- sidered outliers in L2 based on a criterion of ±3SD. These data points represented individ- 3.1.3. Usability uals with a higher number of mistakes across conditions. Since the FC scores of the re- In the current investigation, participants indicated on average a SUS composite score spective individuals were still clearly above chance, we did not exclude them from anal- of 79.6 (±11.2) indicating subjectively perceived good usability [37]. Notably, more than ysis. In70% addition, of the participants test-statistics had a scorewere higher not heavily than 80% influenced (n = 17). by these values.

Sensors 2021, 21, x FOR PEER REVIEW 11 of 21

3.1.3. Usability Sensors 2021, 21, 5205 In the current investigation, participants indicated on average a SUS composite score11 of 21 of 79.6 (±11.2) indicating subjectively perceived good usability [37]. Notably, more than 70% of the participants had a score higher than 80% (n = 17). According to the TUI-II, participants had low levels of anxiety with regard to testing According to the TUI-II, participants had low levels of anxiety with regard to testing new technology (technology anxiety: 1.7 ± 0.63) and were curious to test it (curiosity: 4.2 new technology (technology anxiety: 1.7 ± 0.63) and were curious to test it (curiosity: ± 1.3) prior to the training. Upon completion of the training, participants indicated that 4.2 ± 1.3) prior to the training. Upon completion of the training, participants indicated that they have little skepticism with regard to the setup (2.1 ± 0.7) and had medium levels of they have little skepticism with regard to the setup (2.1 ± 0.7) and had medium levels of interest (5.2 ± 0.9), accessibility (4.5 ± 0.9), and usefulness (4.5 ± 1.0). The highest ratings interest (5.2 ± 0.9), accessibility (4.5 ± 0.9), and usefulness (4.5 ± 1.0). The highest ratings were obtained for user-friendliness (5.9 ± 0.9), indicating that participants experienced the were obtained for user-friendliness (5.9 ± 0.9), indicating that participants experienced the use of the mobile EEG setup as convenient. use of the mobile EEG setup as convenient.

3.2.3.2. EEG EEG DataData 3.2.1.3.2.1. Statistical Statistical ResultsResults ResultsResults ofof thethe repeatedrepeated measures ANOVA ANOVA were were organized organized by by lobes lobes to to increase increase com- com- prehensibilityprehensibility of of the the data.data. ThisThis approach was ch chosenosen due due to to high high intra-lobe intra-lobe consistency consistency of of thethe data. data. Importantly, Importantly, eight eight electrodes electrodes represent represent the the frontal frontal lobes, lobes, whereas whereas the the parietal, parietal, tem- poral,temporal, and occipitaland occipital lobes arelobes represented are represent by twoed by electrodes two electrodes each. The each. repeated The repeated measures ANOVAmeasures revealed ANOVA a significantrevealed a significant three-way thr interactionee-way interaction between Lobe, between Band, Lobe, and Band, Condition and (FCondition(27,594) = ( 14.42,F(27,594)p < = 0.001) 14.42, indicatingp < 0.001) indicating band-specific band-specific effects. In effects. addition, In addition, significant signif- main effectsicant main of Band effects (F(3,66) of Band = 19.28, (F(3,66)p < = 0.001), 19.28, Lobep < 0.001), (F(3,66) Lobe = 17.59,(F(3,66)p <= 0.001)17.59, p and < 0.001) Condition and (FCondition(3,66) = 5.15, (F(3,66)p < 0.005) = 5.15, emerged. p < 0.005) No emerged. significant No effect significant for Day effect was foundfor Day (F (1,22)was found = 0.49, p(>F(1,22) 0.4). To= 0.49, increase p > 0.4). comprehensibility To increase comprehensibility of the results, we of groupedthe results, the we findings grouped by the frequency find- bandsings by (see frequency Figure7 bandsand Table (see A1Figure in the 7 and Appendix Table A1A). in the Appendix A).

FigureFigure 7. 7.Marginal Marginal means means with with confidence confidence intervals intervals inin three-factorthree-factor ANOVA of power measures. The The panels panels illustrate illustrate the the logarithmicallylogarithmically scaled scaled power power measure measure in in the the different different EEG EEG frequencyfrequency bandsbands ((A): theta, ( (B):): delta, delta, ( (CC):): alpha, alpha, (D (D):): beta. beta. On On thethex-axis, x-axis, the the different different conditions conditions are are mapped mapped with with increasing increasing cognitive cognitive load load (L0: (L0: baseline; baseline; L1: L1: detection; detection; L2: L2: one-back; one-back; L3: 2-back).L3: 2-back). Data areData averaged are averaged for electrodes for electrodes over the over four the cerebral four cerebral lobes (red:lobes frontal; (red: frontal; yellow: yello temporal;w: temporal; cyan: occipital;cyan: occipital; orange: orange: parietal). Stars in the same color highlight significant post hoc comparisons across conditions for the respective parietal). Stars in the same color highlight significant post hoc comparisons across conditions for the respective lobe (e.g., a blue star on the line between L1 and L2 indicates a significant difference between these conditions for the frontal lobe with * p < 0.01). The robust replication of the primary hypotheses using the mobile technology can be observed as: increase of frontal theta (red curve and stars in Panel A) and occipital alpha (blue in Panel C) with increasing cognitive load (from L0 to L4) decreased with higher cognitive load. Further, a similar pattern emerges for the delta band and potentially the beta band as well. * p < 0.01, a.u.: arbitrary unit. Sensors 2021, 21, 5205 12 of 21

Post hoc t-tests confirmed an increase of frontal theta relative band power for all comparisons between conditions apart from L1L2 (p < 0.01). For occipital and parietal theta, comparisons between baseline L0 and L2/L3 revealed significant differences (p < 0.01). In addition, occipital alpha relative band power was significantly reduced for the comparisons between L1 and L2/L3 (p < 0.01). In addition, we explored the relative power in beta and delta frequency bands. Sig- nificant differences emerged for frontal delta (all comparisons apart from L2–L3; p < 0.01) and occipital delta (all comparisons apart from L1–L2 and L2–L3; p < 0.01). For parietal delta only the comparison between baseline L0 and L2/L3 revealed significant differences (p < 0.01). Overall, relative delta frequency band power increased with task difficulty. For parietal beta, significant differences were found for L0–L3 and L1–L3 (p < 0.01; see Table A1 in the AppendixA), reflecting an increase of relative band power with task difficulty. In summary, we could confirm the pre-registered primary endpoints in terms of a reduction of occipital alpha and an increase of frontal theta with increasing cognitive load. Effects similar to the theta band emerged in the delta and—to a lesser extent—in the beta frequency domains.

3.2.2. Classification of the Conditions with Machine Learning ANN Optimization for Different Data Division Setups To investigate the fitting properties of ANN, we tested different network architectures (number of layers and nodes) for 56 features (4 bands, 14 electrodes) and the 3 different options of the data division (random participant assignment, first day only chosen as a train set, half of the second day added to the train set). Only very small differences emerged between different architectures, which may be attributed to efficient internal MATLAB optimization algorithms. The differences between versions of train and test data separation were more substantial and in line with our expectations. Namely, the optimal result in accuracy (86.24%) was achieved using 1.5 days of the experiment as a train set for the remaining 0.5 day as test set.

Performance in Different Recognized Classes Applying the optimal network architecture for each train/test separation, we investi- gated the pairwise separation of the conditions: L0 vs. L1, L0 vs. L2, L0 vs. L3, L1 vs. L2, and L2 vs. L3. Additionally, we increased the number of ANN output classes to four (the rest of the architecture remains the same), and classified all four conditions simultaneously. The average accuracy is displayed in Figure8 in reference to chance level (orange lines at 25% for all four and 50% for two conditions). The highest accuracy was obtained for L0 and L3 after training with 1.5 day data (86.24%). For better illustration, the receiver operating characteristic curves (ROC) for train and test data are presented on Figure9. Sensitivity and specificity of the test data are well-balanced with the values 0.83 and 0.86 correspondingly, precision of 0.78 and recall of 0.83. This result confirmed that the most discriminative work- load levels led to the highest difference in the separating features. The lowest accuracy of 40% emerged in classification of all four workload levels with randomized individuals, still remaining above chance level. The same four-condition discrimination accuracy improved to 45% when training data came from the same individuals but another day and to 53% using within-session calibration (1.5–0.5 day data split). The confusion matrix for the four conditions’ classification reveals that the errors are rather evenly distributed and the closer together the conditions are, the higher the chance of confusions (Figure 10). Sensors 2021, 21, x FOR PEER REVIEW 13 of 21

Sensors 2021, 21, 5205 13 of 21 Sensors 2021, 21, x FOR PEER REVIEW 13 of 21

Figure 8. Accuracy of the ANN classifier for different train/test data divisions (training data from 1.5 day, training data from 1 day, training and validation data from separate subjects). The accuracy

is shown for the separation of all four conditions (25% chance level, orange bar) and between each pairFigureFigure of 8. conditions 8.Accuracy Accuracy of theof ANNthe (50%). ANN classifier classifier The for different classification for different train/test train/test data accura divisions data divisions (trainingcy profits data(training from from data from using data from the same day (1.5 day1.51.5 day,condition) day, training training data data fromand from 1 day, from1 day, training training the and validationsame and vali dation datasubject from data separate from(1 day separate subjects). co subjects). Thendition) accuracy The accuracyas well as from a stronger difference isis shown shown for for the the separation separation of all of four all conditionsfour conditions (25% chance (25% chance level, orange level, bar) orange and betweenbar) and each between each betweenpairpair of of conditions conditions the cognitive (50%). (50%). The The classification classification load conditions accuracy accura profitscy profits from(e.g., from using L0–L3 using data fromdata versus thefrom same the daysameL1–L2). day (1.5 This pattern suggests that inter- and(1.5day intra-subject day condition)condition) and and from variance thethe same same subject subjectaffect (1 day(1 thday condition)e coperformancendition) as well as as well from as a strongerbutfrom alsoa differencestronger that difference the network seem to rely on relevant betweenbetween the the cognitive cognitive load load conditions conditions (e.g., (e.g., L0–L3 L0–L3 versus versus L1–L2). L1–L2). This pattern This suggestspattern suggests that inter- that inter- cuesandand intra-subjectsince intra-subject the variance varianceseparation affect affect the performance th eis performance improving but also but that also thewith that network the the network seem load to relyseem difference. on to relevant rely on relevant cuescues since since the the separation separation is improving is improving with thewith load the difference. load difference.

Figure 9. ROC (receiver operating characteristic) curves for two-classes discrimination for the train (blue) and test (red) data. AUC (area under the curve) is correspondingly 0.98 and 0.92 for the train and test data.

The confusion matrix for the four conditions’ classification reveals that the errors are rather evenly distributed and the closer together the conditions are, the higher the chance of confusions (Figure 10). FigureFigure 9. 9.ROC ROC (receiver (receiver operating characteristic) operating curves characteristic) for two-classes discrimination curves for the for train two-classes discrimination for the train (blue) and test (red) data. AUC (area under the curve) is correspondingly 0.98 and 0.92 for the train (blue)and test and data. test (red) data. AUC (area under the curve) is correspondingly 0.98 and 0.92 for the train and test data.

The confusion matrix for the four conditions’ classification reveals that the errors are rather evenly distributed and the closer together the conditions are, the higher the chance of confusions (Figure 10).

Sensors 2021, 21, x FOR PEER REVIEW 14 of 21 Sensors 2021, 21, 5205 14 of 21

Figure 10. Confusion matrix for four conditions discrimination (classes 0, 1, 2, 3). Overall, an accuracy ofFigure 48.4% almost 10. doubles Confusion the random guessmatrix baseline for (25%) four and the conditions confusion between discrimination neighboring (classes 0, 1, 2, 3). Overall, an accu- classesracy is higherof 48.4% than between almost more distinct doubles ones. the random guess baseline (25%) and the confusion between neigh- Databoring Subsets: classes Target/Non-Target, is higher Correct/Incorrect than between more distinct ones. It can be expected that the EEG data differentiates depending on the presented stimu- lus being a target or non-target, and possibly based on correctness of the subject’s response. Therefore,Data Subsets: we tested different Target/Non-T data subsets includingarget, epochs Correct/Incorrect with the target stimulus (no assumption on the response), epochs with target stimulus and correct response, non-target stimulus (noIt can assumption be expected on the response), that and withthe correct EEG response data (no diffe assumptionrentiates on depending on the presented stim- the stimulus type). Our data indicates that performance is the worst for target stimulus andulus correct being responses a combined target and or the bestnon-target, results were obtained and for possibly all available epochs. based on correctness of the subject’s re- The classification is performed for conditions L1 and L3 since L0 neither includes the presentationsponse. of Therefore, stimuli nor the recording we tested of responses. different Suspecting that data those su resultsbsets are due including epochs with the target stim- to the data quantity and not its intrinsic properties, we randomly subsampled the larger dataulus subsets (no to theassumption size of the subset withon targetthe stimuliresponse), and correct epochs responses. However,with target stimulus and correct response, thenon-target tendency remains stimulus the same and overall(no assumption accuracy drops because on of the the reduced response), size of and with correct response (no as- the training set. sumption on the stimulus type). Our data indicates that performance is the worst for tar- Leave-One-Out getFor stimulus the optimized sizeand of thecorrect neural network, responses we applied combined a leave-one-out approach. and the best results were obtained for all Whileavailable this approach epochs. has limited The suitability classification for the optimization is procedure,performed it can be for suc- conditions L1 and L3 since L0 neither cessfully used for further investigations. In particular, it shows a potential in collecting moreincludes data and allowsthe topresentation gain better understanding of stimuli of the interpersonal nor the data re variability.cording of responses. Suspecting that those On Figure 11, the results for leaving the subject x (x = 1, 3, ..., 25) out are presented. The resultedresults average are is due 81.12%, to which the is data substantially quantity higher than and 74.31% not on theits test intrinsic set in properties, we randomly subsam- pled the larger data subsets to the size of the subset with target stimuli and correct re- sponses. However, the tendency remains the same and overall accuracy drops because of the reduced size of the training set.

Leave-One-Out For the optimized size of the neural network, we applied a leave-one-out approach. While this approach has limited suitability for the optimization procedure, it can be suc- cessfully used for further investigations. In particular, it shows a potential in collecting more data and allows to gain better understanding of the interpersonal data variability. On Figure 11, the results for leaving the subject x (x = 1, 3, ..., 25) out are presented. The resulted average is 81.12%, which is substantially higher than 74.31% on the test set in 16- 4-4 subject-wise data division. The highest results on Figure 11 (green) go as high as 96%. In contrast, the lowest result for the subject 22 (red) yields 61%. The variation of the results appears very large.

Sensors 2021, 21, 5205 15 of 21

Sensors 2021, 21, x FOR PEER REVIEW 16-4-4 subject-wise data division. The highest results on Figure 11 (green) go as high as 15 of 21 96%. In contrast, the lowest result for the subject 22 (red) yields 61%. The variation of the results appears very large.

Figure 11. FigureThe results 11. The of results cross of validation cross validation with with a leave-one-out a leave-one-out approach.approach. The The classifier classifier separates separates L0 and L0 L3 inand the L3 1.5 in days the 1.5 days of trainingof data training condition. data condition. Subject Subject 2 was 2 wasremoved removed during during the the initial initial datadata preparation preparatio stagen stage due to due insufficient to insufficient data quality. data quality. The averageThe score average was score 81.1% was 81.1%and all and subjects all subjects achieved achieved accuracy accuracy above above chance chance level. level.

4. Discussion4. Discussion Technological advances in the last decade have led to considerable changes in the Technologicalhealth-care system, advances offering novelin the approaches last decade to assessment have led and to treatment.considerable The aimchanges of in the health-carethe present system, investigation offering was novel threefold: approaches we wanted to toassessment assess (1) if aand mobile treatment. EEG setup The aim of the presentcan be usedinvestigation to track cognitive was threefold: load in healthy we wanted participants, to assess (2) if (1) the if different a mobile levels EEG of setup can be usedeffort to cantrack be classifiedcognitive based load on in machine healthy learning participants, approaches, (2) if and the (3) different the ease oflevels use of effort perceived by the participants. The mobile EEG solution with tablet-based interaction can can be classified based on machine learning approaches, and (3) the ease of use perceived detect changes in cognitive load as reflected by alterations in different frequency bands. by theImportantly, participants. an increase The mobile in frontal EEG theta andsolution a decrease with in tablet-based occipital alpha wasinteraction confirmed can detect changesas robustly in cognitive observed load by clinicalas reflected and research by alterations EEG devices. in Indifferent addition, frequency the integrated bands. Im- portantly,machine an learningincrease models in frontal could theta discriminate and a abovedecrease chance in leveloccipital and between alpha was the most confirmed as robustlydistinct observed conditions by withclinical an accuracy and research of 86%. AEEG majority devices. of participants In addition, indicated the integrated good ma- usability and the data quality was at least in an acceptable range. Accordingly, the set-up chinemay learning be a feasible models method could to trackdiscriminate distinct levels above of mental chance effort level or cognitive and between load. the most dis- tinct conditions with an accuracy of 86%. A majority of participants indicated good usa- bility4.1. and Behavioral the data Data quality was at least in an acceptable range. Accordingly, the set-up may be a feasibleFirst method of all, we to investigated track distinct the feasibility levels of of mental the implemented effort or taskcognitive to induce load. differ- ent levels of mental effort as reflected by the behavioral data. We expected significant differences in the fraction of correct responses as well as mean reaction times reflecting 4.1. Behavioralthe increasing Data difficulty level across conditions. Behavioral outcomes largely confirmed Firstdifferent of all, difficulty we investigated levels of condition the feasibility 1–3. Accordingly, of the implemented reaction times increasedtask to induce across different levelsconditions; of mental however, effort as only reflected comparisons by the between behavioral condition data. 1 (L1) We with expected condition significant 2 (L2) differ- and 3 (L3) were significant. As a second indicator of task difficulty, the fraction of cor- encesrect in the responses fraction (FC) of as correct a measure responses of accuracy as waswell calculated. as mean Allreaction comparisons times betweenreflecting the in- creasingconditions difficulty were level significant. across While conditions. these results Behavioral are in line outcomes with our expectations, largely confirmed larger differ- ent difficultydifferences levels between of conditionscondition (i.e., 1–3. more Accordingly, distinct difficulty reaction levels) times may increased have facilitated across condi- tions;data however, interpretation only ofcomparisons the EEG signal. between This is an importantcondition aspect 1 (L1) that with should condition be considered 2 (L2) and 3 in upcoming investigations. Based on previous research, we did not integrate a more diffi- (L3) were significant. As a second indicator of task difficulty, the fraction of correct re- cult condition such as a 3-back trial to prevent feelings of frustration and drop-out [7,38]. sponses (FC) as a measure of accuracy was calculated. All comparisons between condi- tions were significant. While these results are in line with our expectations, larger differ- ences between conditions (i.e., more distinct difficulty levels) may have facilitated data interpretation of the EEG signal. This is an important aspect that should be considered in upcoming investigations. Based on previous research, we did not integrate a more diffi- cult condition such as a 3-back trial to prevent feelings of frustration and drop-out [7,38]. Instead, we chose the current design as a trade-off between distinct difficulty levels and feasibility for participants.

4.2. Cognitive Load & EEG Spectral Power EEG spectral power can be used as a proxy of cognitive load. In particular, it has been shown that alpha and theta spectral power are meaningful indicators of mental effort [7]. Across lobes and conditions, we observed a drop in the alpha spectral power. However,

Sensors 2021, 21, 5205 16 of 21

Instead, we chose the current design as a trade-off between distinct difficulty levels and feasibility for participants.

4.2. Cognitive Load & EEG Spectral Power EEG spectral power can be used as a proxy of cognitive load. In particular, it has been shown that alpha and theta spectral power are meaningful indicators of mental effort [7]. Across lobes and conditions, we observed a drop in the alpha spectral power. However, this drop was only significant for the occipital lobe. In previous studies, decreased occipital alpha activity has been associated with higher cognitive effort [30] such as indicated by an increased level of visual scanning [39]. In addition, Gundel and Wilson [39] detected an increase in frontal theta activity that was associated with mental effort. Our findings are in line with this observation: we reported an increase in theta—and beta—bands across conditions. Previously, an increase in both bands has been associated with higher cognitive load (for review see [40]). In particular, theta spectral power is indicative of the allocation of cognitive resources rendering it a useful predictor of cognitive load [30,40]. In line with our observations, this holds particularly true for frontal regions [41]. Beta has been linked to states of increased alertness and engagement [42]. While it is often located most prominently at occipital and temporal sites, significant effects in our investigation emerged in the parietal lobe. Differences in the observed data pattern may result from specific aspects of the task design. Following this line of interpretation, it has been shown that different tasks used to induce workload result in differences in the data structure [30]. For delta, we observed a continuous increase in the spectral power with higher cognitive load. This is in line with observations linking delta band spectral power to tasks necessitating internal concentration [43]. Accordingly, our data largely confirms results of previous investigations, and thus supports the feasibility of mobile EEG setups to track mental effort.

4.3. Classification with Mahine Learning The applied machine learning models showed promise in the context of potential practical applications to discriminate between cognitive load levels. The obtained results reached an accuracy level of 86% for discrimination between most distant conditions. This is in line with the observations by Hogervorst and colleagues [5]. In parallel to our findings, the authors observed best performance for the most distinct conditions reflecting rest and highest task load. Exploring different train/test data divisions showed that personal calibration im- proves the results substantially; however, adding session calibration is the most promising setup. On the other hand, the leave-one-out approach revealed that increasing the training data set further improves the accuracy from 74% (16-4-4 division) to 81% on average. In addition, we observed very high variance in the leave-one-out setup ranging from 61% to 96%. One possible explanation is that the low classification accuracy in some cases results from the higher difficulty of condition 3 (L3) of the N-back task. Our results are consistent with those reported by Brouwer and colleagues [7]. Their machine learning models were designed to distinguish the 0-back from 2-back conditions. The best performing models resulted in 80–90% range of accuracy, similarly revealing large personal variation and reduced accuracy when conditions were closer together. Despite the design differences, our results are in line with previous investigations and thus support the potential of mobile EEG devices to discriminate cognitive states associated with different levels of cognitive load. Interestingly, filtering data by target or non-target marking and by the response correctness declined the performance of the classifier. The differences may be related to the high variation due to the data set size and should be further investigated. While our research was focused on the application of mobile EEG sensors for workload quantification and the comparison of the results to preceding investigations with similar task designs, it is important to track the development of the more general machine-learning methods applied to EEG-based workload classification. Our results are consistent with earlier projects such as [44] that used similar methods (e.g., shallow neural networks). Sensors 2021, 21, 5205 17 of 21

Importantly, the accuracy reported for cross-task classification is rather low, which is in line with the high data variability we observed. More recent publications suggest that larger data sets in combination with more complex deep-learning algorithms, such as recurrent and convolutional networks or even their combinations [45,46] may be promising alternatives that need further investigation. Moreover, deep learning algorithms are a promising candidate in terms of handling cross-task classification [47]. In this context, EEG data collected during workload-related task has become openly available [48,49] and hence pave the way for investigations on a larger scale that can be used to evaluate the reproducibility of our findings.

4.4. Data Quality The analysis was conducted using a fully automatic pre-processing pipeline. Follow- ing Kutafina et al. [27], as an overall index of data quality, the amount of data rejected on each step of the analysis was recorded. The raw (only filtered and epoched) data set contain 90 epochs of 1500 ms for the baseline measurement (L0), and 150 epochs of the same duration for each task condition (L1–L3). After additional processing steps, approximately 14% of the data acquired on Day 1 and 10% of the data acquired on Day 2 were rejected. The amplitude threshold accounted for approximately two thirds of the rejected data. Our results indicate that the data rejection rate is within an acceptable range.

4.5. Usability Participants indicated on average good usability according to the SUS. In addition, according to the TUI-II, participants showed low levels of anxiety to test the setup and were curious about the technology. While levels of skepticism were low after completion of the cognitive tasks, participants perceived the setup as user-friendly (i.e., easy to understand and use), indicated that they are interested in exploring new technologies, found the setup accessible and expected a benefit from using it for certain tasks. Apart from the quantitative assessment, we also evaluated the device handling from the experimenter perspective. Participants were able to attach the electrodes independently, used the saline solution, and mounted the device correctly. The feedback on the recording software regarding electrode impedance facilitates the process. However, the participants expressed concerns about correct electrode placement, despite a rigid EEG frame. Therefore, attaching the electrodes is a possible draw-back of the setup. It requires good manual precision and can be difficult for EEG-naive participants. This obstacle may be particularly prominent in patients who are elderly or lacking digital literacy. We observed that during the first session, presence and support by the experimenter was beneficial to facilitate handling of the mobile setup. Therefore, a supervised training session should be considered when planning home-based use of the device. Of note, while the rigid plastic construction has many advantages, such as structured electrode positioning, it caused some discomfort in individual participants. This aspect has already been discussed by Ekandem and colleagues [50] who stress the importance of considering the ergonomics of modern low-cost brain computer interfaces— in particular for prolonged use. Our qualitative assessment of user comfort highlights the need to investigate these aspects also for short-duration usage (i.e., ~1 h). Navigation of the tablet, on the other hand, was convenient due to simplistic design and instructions.

4.6. Limitations Generalizability of the data is limited based on the participant selection. All partici- pants were university students; hence, it is likely that they are exposed to technology on a daily basis. Results with regard to usability but also neural activation patterns may differ for a more diverse audience including older individuals or digitally illiterate. While the classification approach was able to discriminate with an accuracy 86% between the most distinct classes, performance may be limited due to the sample size. It has to be determined if, with a larger training set and added personal calibration, the accuracy of discrimination Sensors 2021, 21, 5205 18 of 21

becomes sufficient to provide, for instance, objective assessments of home-based cognitive computer trainings.

5. Conclusions In the current investigation, we tested a fully mobile setup for cognitive training tasks combined with EEG recording to detect cognitive load. The setup was based on a mobile EEG device combined with a tablet to display the cognitive task to allow for easy mounting and performing self-testing. Analysis of the behavioral data confirmed differences in task performance depending on task difficulty. Furthermore, we found a decrease of occipital alpha frequency band power and an increase of frontal theta frequency band power upon higher difficulty of the task confirming our main hypothesis. In addition, automatic cognitive effort classification revealed that the machine learning approach discriminates between the most distinct levels of cognitive load with an accuracy of 86%. Our findings suggest feasibility of the fully mobile setup to detect distinct levels of cognitive load such as reflected by band power changes. In addition, subjectively-rated usability is adequate with an initial in-person training session of the electrode mounting. Future investigations are needed to evaluate results in more diverse samples including a wider age range and patient groups.

Author Contributions: J.Z., E.K., S.M.J., K.M. and A.H. designed the experiment, J.Z., E.K., S.M.J. and K.M. supervised the work. A.B. adjusted the data collection platform. A.H. recruited the participants, collected and annotated the data. J.Z., E.K., R.P., A.B., A.H. and B.H. analyzed the data, E.K. and J.Z. wrote the manuscript. All authors have read and agreed to the published version of the manuscript. Funding: This work was funded by the Excellence Initiative of the German federal and state govern- ments (G:(DE-82)ZUK2-SF-OPSF424), by the Faculty of Applied Mathematics AGH UST statutory tasks within subsidy of Polish Ministry of Science and Higher Education, the German Research Foun- dation (Deutsche Forschungsgemeinschaft—DFG, 269953372/GRK2150) and the German Ministry for Education and Research (BMBF; APIC: 01EE1405A-C, 01DN18026). The authors report no further financial disclosures or potential conflicts of interest. Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Independent Ethics Committee of the RWTH Aachen Faculty of Medicine (EK 284/18, date of approval: 26 November 2018). Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Data Availability Statement: The data presented in this study are available on request from the corresponding author. The data are not publicly available due to restrictions based on the informed consent statement. The code for EEG data processing and machine learning is available from https://github.com/rwth-imi/mEEG_cognitive_effort_tracking (accessed on 29 July 2021). Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Sensors 2021, 21, 5205 19 of 21

Appendix A

Table A1. Statistics on power measures in the frequency bands.

Baseline Comparisons Active Comparisons L0–L1 L0–L2 L0–L3 L1–L2 L2–L3 L1–L3 Theta Frontal t = −5.8, p < 0.001 t = −14.3, p < 0.001 t = −8.4, p < 0.001 t = −2.5, p = 0.02 t = −3.0, p = 0.007 t = −4.7, p < 0.001 Temporal t = −1.8, p > 0.05 t = −1.7, p > 0.05 t = −2.3, p = 0.03 t = 0.4, p > 0.1 t = −1.1, p > 0.1 t = −0.8, p > 0.1 Occipital t = −2.5, p = 0.02 t = −3.2, p = 0.004 t = −3.2, p = 0.004 t = 0.5, p > 0.1 t = −1.4, p > 0.1 t = −1.4, p > 0.1 Parietal t = −2.0, p > 0.05 t = −3.8, p = 0.001 t = −3.1, p = 0.005 t = −1.5, p > 0.1 t = −0.7, p > 0.1 t = −1.9, p > 0.05 Delta Frontal t = −5.5, p < 0.001 t = −10.4, p < 0.001 t = −8.1, p < 0.001 t = −3.1, p = 0.005 t = −2.5, p = 0.02 t = −4.7, p < 0.001 Temporal t = 0.7, p > 0.1 t = 0.6, p > 0.1 t = −2.3, p = 0.03 t = −0.1, p < 0.1 t = −2.6, p = 0.02 t = −2.3, p = 0.03 Occipital t = −3.1, p = 0.005 t = −3.9, p = 0.001 t = −5.9, p < 0.001 t = −0.7, p > 0.1 t = −2.3, p = 0.03 t = −3.0, p = 0.006 Parietal t = −1.6, p > 0.1 t = −3.5, p = 0.002 t = −4.9, p < 0.001 t = −1.6, p > 0.1 t = −1.5, p > 0.1 t = −2.9, p = 0.01 Alpha Frontal t = 0.8, p > 0.1 t = 1.4, p > 0.1 t = 1.4, p > 0.1 t = 0.5, p > 0.1 t = 0.8, p > 0.1 t = 1.1, p > 0.1 Temporal t = 1.8, p > 0.05 t = 2.2, p = 0.04 t = 1.6, p > 0.1 t = 0.8, p > 0.1 t = 0.1, p > 0.1 t = 0.6, p > 0.1 Occipital t = 0.8, p > 0.1 t = 2.7, p = 0.01 t = 2.5, p = 0.02 t = 3.1, p = 0.005 t = 1.5, p > 0.1 t = 3.0, p = 0.006 Parietal t = 0.3, p > 0.1 t = 0.5, p > 0.1 t = 1.1, p > 0.1 t = 0.2, p > 0.1 t = 1.0, p > 0.1 t = 1.1, p > 0.1 Beta Frontal t = −0.7, p > 0.1 t = −0.9, p > 0.1 t = −1.2, p > 0.1 t = −0.4, p > 0.1 t = −0.5, p > 0.1 t = −0.9, p > 0.1 Temporal t = 0.1, p > 0.1 t = −0.3, p > 0.1 t = −2.3, p = 0.03 t = −0.3, p > 0.1 t = −2.1, p = 0.05 t = −2.1, p = 0.05 Occipital t = −0.5, p > 0.1 t = −0.4, p > 0.1 t = −2.6, p = 0.02 t = −0.0, p > 0.1 t = −1.9, p > 0.05 t = −2.3, p = 0.03 Parietal t = 0.6, p > 0.1 t = −1.5, p > 0.1 t = −4.9, p < 0.001 t = −1.8, p > 0.1 t = −2.7, p = 0.01 t = −4.2, p < 0.001 Sensors 2021, 21, 5205 20 of 21

References 1. Watson, H.A.; Tribe, R.M.; Shennan, A.H. The role of medical apps in clinical decision-support: A literature review. Artif. Intell. Med. 2019, 100, 101707. [CrossRef] 2. Lee, Y.-Y.; Hsieh, S. Classifying Different Emotional States by Means of EEG-Based Functional Connectivity Patterns. PLoS ONE 2014, 9, e95415. [CrossRef] 3. Saeed, S.M.U.; Anwar, S.M.; Khalid, H.; Majid, M.; Bagci, U. EEG Based Classification of Long-Term Stress Using Psychological Labeling. Sensors 2020, 20, 1886. [CrossRef][PubMed] 4. Cassani, R.; Estarellas, M.; San-Martin, R.; Fraga, F.J.; Falk, T.H. Systematic Review on Resting-State EEG for Alzheimer’s Disease Diagnosis and Progression Assessment. Dis. Markers 2018, 2018, 5174815. Available online: https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC6200063/ (accessed on 14 April 2021). [CrossRef][PubMed] 5. Hogervorst, M.A.; Brouwer, A.-M.; van Erp, J.B.F. Combining and comparing EEG, peripheral physiology and eye-related measures for the assessment of mental workload. Front. Neurosci. 2014, 8, 322. Available online: https://www.frontiersin.org/ articles/10.3389/fnins.2014.00322/full (accessed on 14 April 2021). [CrossRef] 6. Jann, K.; Dierks, T.; Boesch, C.; Kottlow, M.; Strik, W.; Koenig, T. BOLD correlates of EEG alpha phase-locking and the fMRI default mode network. NeuroImage 2009, 45, 903–916. [CrossRef] 7. Brouwer, A.-M.; Hogervorst, M.A.; van Erp, J.B.F.; Heffelaar, T.; Zimmerman, P.H.; Oostenveld, R. Estimating workload using EEG spectral power and ERPs in the n-back task. J. Neural Eng. 2012, 9, 045008. [CrossRef] 8. Ismail, L.E.; Karwowski, W. Applications of EEG indices for the quantification of human cognitive performance: A systematic review and bibliometric analysis. PLoS ONE 2020, 15, e0242857. [CrossRef][PubMed] 9. Ryu, K.; Myung, R. Evaluation of mental workload with a combined measure based on physiological indices during a dual task of tracking and mental arithmetic. Int. J. Ind. Ergon. 2005, 35, 991–1009. [CrossRef] 10. Smith, M.E.; Gevins, A.; Brown, H.; Karnik, A.; Du, R. Monitoring Task Loading with Multivariate EEG Measures during Complex Forms of Human-Computer Interaction. Hum. Factors 2001, 43, 366–380. [CrossRef][PubMed] 11. Almahasneh, H.; Chooi, W.-T.; Kamel, N.; Malik, A.S. Deep in thought while driving: An EEG study on drivers’ cognitive distraction. Transp. Res. Part F Traffic Psychol. Behav. 2014, 26, 218–226. [CrossRef] 12. Dehais, F.; Dupres, A.; Flumeri, G.D.; Verdiere, K.; Borghini, G.; Babiloni, F.; Roy, R. Monitoring Pilot’s Cognitive Fatigue with Engagement Features in Simulated and Actual Flight Conditions Using an Hybrid fNIRS-EEG Passive BCI. In Proceedings of the 2018 IEEE Intenational Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 544–549. 13. Mills, C.; Fridman, I.; Soussou, W.; Waghray, D.; Olney, A.M.; D’Mello, S.K. Put your thinking cap on: Detecting cognitive load using EEG during learning. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, Vancouver, BC, Canada, 13–17 March 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 80–89. Available online: https://doi.org/10.1145/3027385.3027431 (accessed on 14 April 2021). 14. van de Ven, R.M.; Murre, J.M.J.; Veltman, D.J.; Schmand, B.A. Computer-Based Cognitive Training for Executive Functions after Stroke: A Systematic Review. Front. Hum. Neurosci. 2016, 10, 150. Available online: https://www.frontiersin.org/articles/10.338 9/fnhum.2016.00150/full?#B14 (accessed on 14 April 2021). [CrossRef][PubMed] 15. Twamley, E.W.; Thomas, K.R.; Gregory, A.M.; Jak, A.J.; Bondi, M.W.; Delis, D.C.; Lohr, J.B. CogSMART Compensatory Cognitive Training for Traumatic Brain Injury: Effects Over 1 Year. J. Head Trauma Rehabil. 2015, 30, 391–401. [CrossRef][PubMed] 16. Subramaniam, K.; Luks, T.L.; Fisher, M.; Simpson, G.V.; Nagarajan, S.; Vinogradov, S. Computerized Cognitive Training Restores Neural Activity within the Reality Monitoring Network in Schizophrenia. Neuron 2012, 73, 842–853. [CrossRef] 17. Hussain, I.; Park, S.J. HealthSOS: Real-Time Health Monitoring System for Stroke Prognostics. IEEE Access 2020, 8, 213574–213586. [CrossRef] 18. Murphy, M.M. Telehealth Factors for Predicting Hospital Length of Stay. J. Gerontol. Nurs. 2018, 44, 16–20. [CrossRef] 19. Gorgens, K.A. Structured Clinical Interview for DSM-IV (SCID-I/SCID-II). In Encyclopedia of Clinical Neuropsychology; Kreutzer, J.S., DeLuca, J., Caplan, B., Eds.; Springer: New York, NY, USA, 2011; pp. 2410–2417. Available online: https://doi.org/10.1007/ 978-0-387-79948-3_2011 (accessed on 7 July 2021). 20. Schmidt, K.-H.; Metzler, P. Wortschatztest: WST; Beltz: Weinheim, Germany, 1992. 21. Watson, D.; Clark, L.A.; Tellegen, A. Development and validation of brief measures of positive and negative affect: The PANAS scales. J. Pers. Soc. Psychol. 1988, 54, 1063–1070. [CrossRef] 22. Kothgasser, O.D.; Felnhofer, A.; Hauk, N.; Kastenhofer, E.; Gomm, J.; Kryspin-Exner, I. Technology Usage Inventory—Manual; FFG: Wien, Austria, 2012. 23. Brooke, J. SUS—A Quick and Dirty Usability Scale. 1996. Available online: http://hell.meiert.org/core/pdf/sus.pdf (accessed on 29 July 2021). 24. Badcock, N.A.; Mousikou, P.; Mahajan, Y.; de Lissa, P.; Thie, J.; McArthur, G. Validation of the Emotiv EPOC® EEG gaming system for measuring research quality auditory ERPs. PeerJ 2013, 1, e38. [CrossRef] 25. Melnik, A.; Legkov, P.; Izdebski, K.; Kärcher, S.M.; Hairston, W.D.; Ferris, D.P.; König, P. Systems, Subjects, Sessions: To What Extent Do These Factors Influence EEG Data? Front. Hum. Neurosci. 2017, 11, 150. Available online: http://www.ncbi.nlm.nih. gov/pmc/articles/PMC5371608/ (accessed on 29 July 2021). [CrossRef] Sensors 2021, 21, 5205 21 of 21

26. Titgemeyer, Y.; Surges, R.; Altenmüller, D.-M.; Fauser, S.; Kunze, A.; Lanz, M.; Malter, M.P.; Nass, R.D.; von Podewils, F.; Remi, J.; et al. Can commercially available wearable EEG devices be used for diagnostic purposes? An explorative pilot study. Epilepsy Behav. 2020, 103, 106507. [CrossRef] 27. Kutafina, E.; Brenner, A.; Titgemeyer, Y.; Surges, R.; Jonas, S. Comparison of mobile and clinical EEG sensors through resting state simultaneous data collection. PeerJ 2020, 8, e8969. [CrossRef] 28. Mathiak, K.; Hertrich, I.; Zvyagintsev, M.; Lutzenberger, W.; Ackermann, H. Selective influences of cross-modal spatial-cues on preattentive auditory processing: A whole-head magnetoencephalography study. NeuroImage 2005, 28, 627–634. [CrossRef] 29. Delorme, A.; Makeig, S. EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods 2004, 134, 9–21. [CrossRef][PubMed] 30. Scharinger, C.; Soutschek, A.; Schubert, T.; Gerjets, P. Comparison of the Working Memory Load in N-Back and Working Memory Span Tasks by Means of EEG Frequency Band Power and P300 Amplitude. Front. Hum. Neurosci. 2017, 11, 6. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5263141/ (accessed on 5 September 2018). [CrossRef] 31. Zschocke, S.; Hansen, H.-C. Klinische Elektroenzephalographie; Springer: Berlin/Heidelberg, Germany, 2011; p. 657. 32. Smulders, F.T.Y.; Oever, S.T.; Donkers, F.C.L.; Quaedflieg, C.W.E.M.; van de Ven, V. Single-trial log transformation is optimal in frequency analysis of resting EEG alpha. Eur. J. Neurosci. 2018, 48, 2585–2598. [CrossRef][PubMed] 33. Brenner, A.; Kutafina, E.; Jonas, S.M. Automatic Recognition of Epileptiform EEG Abnormalities. Stud. Health Technol. Inform. 2018, 247, 171–175. [PubMed] 34. Berkovits, I.; Hancock, G.R.; Nevitt, J. Bootstrap Resampling Approaches for Repeated Measure Designs: Relative Robustness to Sphericity and Normality Violations. Educ. Psychol. Meas. 2000, 60, 877–892. [CrossRef] 35. Blanca, M.J.; Alarcón, R.; Arnau, J.; Bono, R.; Bendayan, R. Non-normal data: Is ANOVA still a valid option? Psicothema 2017, 29, 552–557. 36. Sullivan, L.M.; D’Agostino, R.B. Robustness of the t test applied to data distorted from normality by floor effects. J. Dent. Res. 1992, 71, 1938–1943. [CrossRef] 37. Bangor, A.; Kortum, P.; Miller, J. Determining what individual SUS scores mean: Adding an adjective rating scale. J. Usability Stud. 2009, 4, 114–123. 38. Ayaz, H.; Willems, B.; Bunce, B.; Shewokis, P.; Izzetoglu, K.; Hah, S.; Deshmukh, A.; Onaral, B. Cognitive Workload Assessment of Air Traffic Controllers Using Optical Brain Imaging Sensors. Adv. Underst. Hum. Perform. Neuroergonom. Hum. Factors Des. Spec. Popul. 2010, 21, 21–31. 39. Gundel, A.; Wilson, G.F. Topographical changes in the ongoing EEG related to the difficulty of mental tasks. Brain Topogr. 1992, 5, 17–25. [CrossRef][PubMed] 40. Klimesch, W. EEG-alpha rhythms and memory processes. Int. J. Psychophysiol. 1997, 26, 319–340. [CrossRef] 41. Esposito, F.; Aragri, A.; Piccoli, T.; Tedeschi, G.; Goebel, R.; Di Salle, F. Distributed analysis of simultaneous EEG-fMRI time-series: Modeling and interpretation issues. Magn. Reson. Imaging 2009, 27, 1120–1130. [CrossRef] 42. Kami´nski, J.; Brzezicka, A.; Gola, M.; Wróbel, A. Beta band oscillations engagement in human alertness process. Int. J. Psychophysiol. 2012, 85, 125–128. [CrossRef] 43. Harmony, T. The functional significance of delta oscillations in cognitive processing. Front. Integr. Neurosci. 2013, 7, 83. [CrossRef] 44. Baldwin, C.L.; Penaranda, B.N. Adaptive training using an artificial neural network and EEG metrics for within- and cross-task workload classification. NeuroImage 2012, 59, 48–56. [CrossRef][PubMed] 45. Kwak, Y.; Kong, K.; Song, W.-J.; Min, B.-K.; Kim, S.-E. Multilevel Feature Fusion With 3D Convolutional Neural Network for EEG-Based Workload Estimation. IEEE Access 2020, 8, 16009–16021. [CrossRef] 46. Saadati, M.; Nelson, J.; Ayaz, H. Convolutional Neural Network for Hybrid fNIRS-EEG Mental Workload Classification. In Advances in Neuroergonomics and Cognitive Engineering; Ayaz, H., Ed.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2020; pp. 221–232. 47. Zhang, P.; Wang, X.; Zhang, W.; Chen, J. Learning Spatial-Spectral-Temporal EEG Features With Recurrent 3D Convolutional Neural Networks for Cross-Task Mental Workload Assessment. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 31–42. [CrossRef] 48. Shin, J.; von Lühmann, A.; Kim, D.-W.; Mehnert, J.; Hwang, H.-J.; Müller, K.-R. Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset. Sci. Data 2018, 5, 180003. [CrossRef][PubMed] 49. Lim, W.L.; Sourina, O.; Wang, L.P. STEW: Simultaneous Task EEG Workload Data Set. IEEE Trans. Neural Syst. Rehabil. Eng. 2018, 26, 2106–2114. [CrossRef][PubMed] 50. Ekandem, J.I.; Davis, T.A.; Alvarez, I.; James, M.T.; Gilbert, J.E. Evaluating the ergonomics of BCI devices for research and experimentation. Ergonomics 2012, 55, 592–598. [CrossRef][PubMed]