Overview of Theory, Techniques and

Applications http://www.cs.unibo.it./difelice/ Context Aware Systems Prof. Marco Di Felice Department of and Engineering University of Bologna Affective Computing q Affective computing is an emerging field of research that aims to enable intelligent systems to recognize, feel, infer and interpret human (-aware computing). ² Interdisciplinary research area: computer science (, IA, NLP, ..), , , social science. ² Include (not covered in this presentation) and (focus on mobile applications here).

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 2 https://www.affectiva.com/product/affectiva-automotive-ai/ Affective Computing q USE CASES: SOCIAL MONITORING (I) Driver state monitoring ² Monitor levels of driver fatigue and distraction to enable appropriate alerts and interventions that correct dangerous driving. ² Monitor driver to enable interventions or route alternatives that avoid road . ² Address handoff challenge between driver and car in semi-autonomous vehicles.

Occupant Experience Monitoring ² Personalize content recommendations ² Adapt environmental conditions

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 3 [1] M. Chen, Y. Zhang, Y. Li, S. Mao, V. C. M. Leung: EMC: Emotion-aware mobile cloud computing in 5G. IEEE Network 29(2): 32-38 (2015) [2] Yong Feng, Qiana Yin, Zhang Xiao, , Yu, Limei Peng, "EARS: Emotion-aware recommender system based on hybrid information fusion", Information Fusion, 46, March 2019, 141-146 Affective Computing CHEN2_LAYOUT.qxp_Layout 1 3/13/15 2:07 PM Page 33 q USE CASES: RECOMMENDER SYSTEMS (II) Y. Qian et al.

Data collection (Historical data and data from social networks) ments when they are not available or when it is not conve- nient for the user to send his/her requests. Emotion-aware applications have great potential to Emotion Emotion-aware Mobile user recognition bring QoE provisioning to a new level and change the lifestyle of many people, which, however, also demands 5G network Emotion-aware more advanced technologies to yield effective solutions. service push recommender Fortunately, with MCC technology, computationally inten- sive tasks can be offloaded to the cloud instead of only by Remote cloud mobile devices [7]. Through the extensive use of cloud systems computing and cloud-assisted resources, mobile devices Data collection can break the shackles of their limited resources, the (Data from wearable models of mobile applications can be improved to enhance devices and user experience, and a user-centric paradigm can be smartphone) developed, which is also the core objective of 5G [8]. An important aspect of QoE provisioning is to offer personalized mobile service based on the user’s emotional Data collection Data preprocessing variations. This article proposes a novel style of applica- (multimedia emotional data) tions enabled by emotion-aware mobile cloud computing Sensing Computing resource (EMC) in 5G. EMC is design to provide emotion-aware resource mobile services by recognizing users’ emotionalArchitecture changes through cloud computing and big data analysis. EMC Architecture infers emotion based on affective computing, which is a complex process requiring a paradigmproposed shift in user model- in [1] Local cloudlet ing, for example, mode of operation, expression character- proposed in [2] istics, attitudes, preferences, cognitive styles, background Figure 1. The proposed EMC architecture. knowledge, and so on. The performance of affective com- Fig. 1. System architecture of EARS. puting largely depends on the quality of collected data, AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS explicit information that can be processed through various approved the users, we can take some privacy preserving measures, such as which could be very large in volume and variety. The approaches for recommender systems, such as CF, SVD, and matrix anonymity, which protects the user’s privacy by hiding the user’s sen- 4 more types and larger scale of emotional data, the higher the quickly and efficiently, according to their physical andfactorization emo-[21]. However,MARCO DI FELICE social network data and user review sitive information, such as ID, name and so on. In fact, such a proces- data are implicit information whose features should be extracted by sing does not affect the performance of the recommendation system accuracy of emotional analysis results. Hence, in order to pro- tional status. available approaches. In particular, social information can be ex- proposed in this paper, because the two processes are vertical. We will vide accurate and timely emotion-aware services, affective To solve the challenging issues that arise in the EMCtracted archi- by from social network data [22], and not discuss the data privacy preserving here. This work will be carried emotional information can be extracted from user review data by out in the future. computing is resource-intensive based on the analysis of emo- tecture, we investigate a comprehensive emotion-awaresentiment system analysis [23]. tional big data in the forms of text, video, social data, facial with the support of the latest technology advances,3. such as and Models: After information fusion, the raw data are 3.2. Explicit feedback analysis features, and physiological signals, among others. mobile cloud computing and 5G. The main contributionspreprocessed, of and valuable information is extracted for re- commendations. Available algorithms and models are used to pro- On the basis of the mentioned above, we can provide the Recently, most existing healthcare systems only provide this work include: vide personalized services based on the hybrid information. following model: Assume that the user’s choice behavior is determined ff care for the user’s physiological status while EMC also takes •Enabling computation-intensive affective computing4. Recommendation in Services: In di erent scenarios, various re- by the user’s “selectivity” of the product. commendation services are appropriate: (i) -based appli- Definition 1 assumes that user i and product j in potential product K- into account the user’s mental status. The existing healthcare mobile applications by MCC and big data analysis. cations that list the most popular items [24]; (ii) knowledge-based dimensional feature space are represented as a set of vectors applications that discover users’ interests according to historical systems exhibit the following features: •Providing users with resource cognition-based emotion- UUUUUUUUiiiikjjjjk(,12 ,, ), (, 12 ,, ). User characteristics and data [25]; and (iii) prediction-based applications that recognize product= characteristics⋯ = together determine⋯ the behavior of the user’s •Wearable and mobile devices are used to collect the user’s aware feedback by utilizing the abundant cloud resources fi users’ demands through advanced and arti cial choice, and thus Aij represents the absolute extent of the tendency of the physiological information. and the broadband bandwidth supported by 5G. [26]. product user ij: K •They only collect the user’s physical status, but are not •Identifying a novel EMC partitioning design trade-offIt is worth to noting that in the process of data collection and in- AUVUVij i· j ik jk formation fusion, the data we used may involve user s privacy in- aware of the user’s stress levels, unhealthy emotion, and guarantee users’ QoE while achieving optimized resource ’ ==k 1 (1) formation, such as emotional information. In order to protect privacy of ∑= even mental illness conditions. allocation under dynamic mobile network environments. However, the absolute tendency itself does not fully reflect the user’s •The existing mode based on wearable devices for collecting The remainder of this article is organized as follows. We physiological information will give a negative psychological first present the EMC architecture and then present a collab- implication that the user is in poor health. orative local cloudlet architecture for data collection and pre- In particular, for users in a mood of and depres- processing, as well as a proposed remote cloud for emotion sion, such a “conscious” way to collect and present their phys- recognition and service push. Next we introduce elastic emo- iological information may lead to even more serious mental tion-aware computing by joint cloud, communications, and illness. device resource optimization. Then we conclude the article. In contrast, with emotion care, the above problems can be alleviated. As examples, EMC can serve various scenarios as follows. Architecture • For elderly people, EMC can collect their physical condi- The proposed EMC architecture is shown in Fig. 1, which tions, as well as perceive their mental status, which allevi- includes three main components: a mobile terminal, a local ates their loneliness and other negative emotions. cloudlet, and a remote cloud. • For those working in a closed environment over a long peri- od, such as deep-sea or space exploration, EMC can be uti- EMC Infrastructure Functional Components lized to perceive their emotions and help adjust their In EMC, infrastructure functional components, which consist physical and mental state to ensure successful completion of mobile terminals, a local cloudlet, and a remote cloud, provide of their missions. hardware support for data collection and affective feedback. • For people from social , EMC can sense Mobile terminal: Affective computing based on physiologi- their abnormal mental state and help them escape from the cal signals typically utilizes learning classification algorithms shadows of social phobia. to analyze users’ physical status, such as electromyography • For patients leaving the hospital who bear dual physical and (EMG), electrocardiogram (ECG), electroencephalogram psychological pressures, EMC can customize an individual (EEG), and so on [9]. With integrated sensors and installed rehabilitation strategy and guide them to recover more plug-ins, mobile terminals collect the physiological emotional

IEEE Network • March/April 2015 33 [1] Chih-Hung Wu, Yueh-Min Huang and Jan-Pan Hwang, "Review of affective computing in education/learning: Trends and challenges”, British Jourrnal of (2015) Vol 47 No 6 2016 1304–1323 Affective Computing

q8 British USE CASES: Journal of EducationalAdvancements Technology and trends of affectiveSMART EDUCATION computing research 1311 (III) INSTRUMENTS USED in AFFECTIVE COMPUTING STUDIES [1]

Figure 3: Proportion of physiological signals and instruments used in affective computing (AC) studies. EEG, electroencephalography; EMG, electromyography; fMRI, functional magnetic resonance imaging https://www.technologyreview.com/s/516606/facial-analysis-software-spots-struggling-students/

The growing trend of instruments in ACAFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS in education/learning Figure 3 shows various instruments used in AC in education/learning studies. The conventionalMARCO DI FELICE 5 questionnaire was still the most widely used in past studies (26%). Additionally,skin conductance response (16%), (11%), heartbeat (9%), electroencephalography (EEG) (6%), verbalization (6%), electromyography (EMG) (4%) and speech (3%) have also become popular. Other instruments used were the recorded by the system, classroom observations, face-to-face conversations, interactive software agent and photoplethysmography (PPG). Figure 4 and Table 5 provide a trend of the top seven instruments used in AC in education/ learning. Except for the conventional questionnaire, physiological signals (especially heartbeat and skin conductance response) and facial expressions have steadily prevailed since 2011. This shows that various physiological signals and sensing technologies (eg, facial expression) have been widely adopted. Among physiological signals, the most popular is skin conductance response (16%). We selected the top five instruments used in AC studies in various research domains, as shown in Table 6. Research articles have contributed to six broad categories (science, social science, lan- guage & art, engineering & computer science, others and nonspecified) in AC educational issues. Questionnaires were widely used in those research domains. Additionally, skin conductance response is a popular physiological signal used in science and social science education. Some AC studies belonging to other categories did not clearly state which learning domains they tried to

© 2015 British Educational Research Association VC 2015 British Educational Research Association [1] https://robots.ieee.org/robots/octavia/ Affective Computing q USE CASES: SOCIAL ROBOTS (IV)

“Octavia is a social humanoid robot with an expressive face and dexterous hands. It's designed to understand how humans perceive, think, and act, using her knowledge to interact naturally with people.”[1]

CREATOR Naval Research Laboratory & Xitome Design

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 6 Eugenia Politou, Efthimios Alepis, Constantinos Patsaki, "A survey on mobile affective computing", Computer Science Review, 25, 2017, 79-100 Affective Computing q How to describe the emotions q How to model affective data q Which emotions can be recognized q How to perform emotion detection q Research issues q Demo: Affective Tool

Soujanya Poria, Erik Cambria, Rajiv Bajpai, Amir Hussain, "A review of affective computing: From unimodal analysis to multimodal fusion", Information Fusion (37), 2017 AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 7 Affective Computing q Terminology

² Core à “Most elementary accessible affective that need not to be directed at anything” (Russell and Burrett). ² Emotions à elicited by something, reactions to something. ² Moods à last longer than emotions, are less specific, less intensive, less likely to be triggered by a particular event. ² Sentiments à thoughts, opinions, idea based on the about a situation, or a way of thinking about something,

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 8 Affective Computing q PROBLEM 1: How to describe the emotions? ² Categorical model Six basic emotions (culturally universal)

https://www.paulekman.com ² Facial Action Coding System (FACS) à anatomically based system for describing all observable facial movement for every emotion. AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 9 http://emotiondevelopmentlab.weebly.com/circumplex-model-of-affect.html Affective Computing q PROBLEM 1: How to describe the emotions? ² Dimensional model Circumplex Model of Affect (proposed by Russell) All affective states arise from two fundamental neurophysiological systems:

² Valence (-displeasure), x axis ² (alertness), y axis

Each state is a linear combination of the two basic dimensions with different intensities (degrees).

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 10 http://emotiondevelopmentlab.weebly.com/circumplex-model-of-affect.html Affective Computing q PROBLEM 1: How to describe the emotions? ² Personality model Big Five model (proposed by Goldberg) ² Hierarchical model of personality traits. ² Five broad categories represent personality at the highest level of abstraction. ² Each dimension summarises a large number of specific personality characteristics. ² Rating tools available (e.g. IPIP, NEO-FFI)

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 11 Affective Computing q How to describe the emotions q How to model affective data q Which emotions can be recognized q How to perform emotion detection q Research issues q Demo: Affective Tool

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 12 http://emotiondevelopmentlab.weebly.com/circumplex-model-of-affect.html Affective Computing q PROBLEM 2: How to model the emotions? EmotionML (https://www.w3.org/TR/emotionml/#s1.1) Emotion Markup Language, proposed by the W3C working group

² Manual annotation of material involving , such as annotation of videos, of speech recordings, of faces, of texts, etc ² Automatic recognition of emotions from sensors, including physiological sensors, speech recordings, facial expressions. ² Generation of emotion-related system responses, which may involve reasoning about the emotional implications of events. AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 13 http://emotiondevelopmentlab.weebly.com/circumplex-model-of-affect.html Affective Computing q PROBLEM 2: How to model the emotions? EmotionML (https://www.w3.org/TR/emotionml/#s1.1) Main elements: ² à root of the XML document ² à single emotion annotation ² à Description of an emotion or a related state according to a vocabulary set. ² àDescription of an emotion or a related state according to an emotion dimension vocabulary. ² à Description of an emotion or a related state according to an emotion action tendency vocabulary. AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 14 http://emotiondevelopmentlab.weebly.com/circumplex-model-of-affect.html Affective Computing q PROBLEM 2: How to model the emotions? EmotionML (https://www.w3.org/TR/emotionml/#s1.1)

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 15 Affective Computing q How to describe the emotions q How to model affective data q Which emotions can be recognized q How to perform emotion detection q Research issues q Demo: Affective Tool

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 16 [1] https://www.cl.cam.ac.uk/~cm542/papers/Ubicomp10.pdf Affective Computing q PROBLEM 3: Which emotions can be recognized? DISTINCT STATES MODEL RECOGNITION EmotionSense à a mobile sensing framework for recognising user’s emotions based on built-in smartphone sensors [1].

² Infer participants’ emotional states in the Ekman’s basic set ² Emotion recognition based on speech techniques ² Neighbour awareness via Bluetooth scanning ² Position awareness via GPS sampling ² Activity awareness via accelerometer sampling

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 17 https://www.cl.cam.ac.uk/~cm542/papers/Ubicomp10.pdf Affective Computing q PROBLEM 3: Which emotions can be recognized?

140 60 100 Emotion narrow class Emotion broad class 130 55 90 DISTINCT STATES MODEL RECOGNITION 120 50 80 110 45 70 def update(value, currentInterval): 100 40 Table 1. Emotion clustering if value == 60 1: 90 ACCURACY 35 samplingInterval 50 = MIN_INTERVAL 80 Broad emotion Narrow emotions 30 elif value == 40 0: 70 30 60 25 Happy Elation, , Happy samplingInterval = min(2*currentInterval, MAX_INTERVAL) Energy consumption [joules] Emotion recognition accuracy [%] 20 50 20 Sad return samplingInterval Emotion recognition latency [seconds] 10 40 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Sample length [seconds] Sample length [seconds] Sample length [seconds] Anger , Dominant, Hot anger

140 50 Local computation Local computation Neutral Neutral normal, Neutral 100 conversation, Neutral distant, Figure 3. A back-off function for updating sampling interval. Remote computation 45 Remote computation 120 Figure 8. Emotion 40 recognition accuracy vs au- Figure 9. Emotion recognition latency vs au- Figure 10. Emotion recognition energy con- Neutral tete, , Passive90 100 dio sample length. 35 dio sample length. sumption vs audio sample length. 80 30 80 LOCAL to external changes. The 25 parameters MIN_INTERVAL and 70 60 set_location_sampling_interval MAX_INTERVAL play 20 a crucial role in the back-off func-vs REMOTE foreach 60 40 15 We also note that 10 distinguishing between some narrow emo- Energy consumption [joules] 50 20 Table 2. matrix for broad emotions. Speaker recognition accuracy [%] tion. In order to find the optimum values of these parame-COMPUTATION facts.fact($factName, $value) 5 Speaker recognition latency [seconds] tions in a group is difficult given the similarity of the utter- Emotion [%] Happy Sad Fear Anger Neutral check $factName == ’Activity’ 40 0 ters, we conducted several 0 benchmark tests for each of the 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8ances 9 10 11 12 corresponding 13 14 15 16 0 1 2 3 to 4 5 them: 6 7 8 9 10 for 11 12 a 13 sample 14 15 16 length of 5 sec- facts.fact($actionName, $currentInterval)Sample length [seconds] sensors.Sample length We [seconds] will present the resultsSample oflength [seconds] some of these tests as Happy 58.67 4 0 8 29.33 check $actionName == ’LocationInterval’ onds, the narrow emotion “happy” matches “interest” with Sad 4 60 0 8 28 Figure 4. Speaker recognition accuracy vs au- Figure 5. Speakermethodological recognition latency vs audioexampleFigure in 6. the Speaker next recognition section. energy con- $interval = update($value, $currentInterval)AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS a of 0.28. Grouping these increases the accu- Fear 8 4 60 8 20 assert dio sample length. sampleMARCO DI FELICE length. racy of classificationsumption vs audio which sample length. can be observed from Figure18 8. facts.fact(’action’, ’LocationInterval’, $interval) Anger 6.66 2.66 9.34 64 17.33 The usage of a limited number of broader emotional classes Neutral 6 5.33 0 4 84.66 We then compare the efficiency of classifying eachEVALUATION audio 100 sample either locally on the phone or remotely on power- first presentis also an advocated evaluation by of social EmotionSense by means [10]: of in general, ful server reached via the 3G network. We can observe from classifying 80 own emotions using narrow categories when fill- several micro-benchmark tests to study the system perfor- Figure 2. An example rule to setFigures sampling5 and rate6 that of remote GPS computation sensor. is more efficient in ing self-report 60 questionnaires is also difficult. marks) and silence model. The trace file is a list of events terms of both latency and energy consumption. Themance use of and to tune the parameters of the adaptation mech- 40 (tagged as missable and unmissable) with timestamps. The the 3G network is acceptable and telephone contractsanisms. can be In particular, we discuss the choice of the optimal purchased easily for experiments, although sometimes costs 20 main goal of benchmarks is to find optimal values for the note that it is also hard for a person involved in an experi- DatasetSpeaker recognition accuracy [%] for Sensor Benchmarks might be an issue, especially if the scale of the planned ex- sample length value 0 for speaker and emotion recognition and MIN_INTERVAL and MAX_INTERVAL parameters discussed ment to distinguish exactly amongperiment the is very emotions large. However, belonging by avoiding to use 3G data Trace files with 0 ground-truth 0.02 0.04 0.06 0.08 information 0.1 0.12 for all sensors were a generic methodology forBrownian the noise selection amplitude of the optimal val- to the same class in a questionnairetransmission, and for the software this reason can be distributed broad to participants generated based on the data collected from 12 users for 24 in the previous section. who want to use their own SIM cards (with their ownues phone for the parameters of the rules. These initial experiments classes are commonly used. Thenumber details and contacts) of each on the grouping experiment isphone, without im- Figurehours. 7. Effect In of order Brownian tonoise extract on speaker recognition the microphone accuracy for sensor trace, au- involveda 12 sample users. length of We 5 seconds. then describe the results of a larger Figures 11 and 12 show the effect of increasing the value given in Table 1. pacting on their bills (with transmission of our potentially dio samples of 5 seconds length were recorded continuously large quantity of data): this can be very important in thescale con- deploymentwith a sleep involving period 18 of users 1 second for 10 between days to consecutive evaluate record- of MIN_INTERVAL on percentage of missed events and text of the experiment. Since no personal data (such asthe voice prototypeSpeechings. in and Co-location a Transcripts realistic data setting [17 for]. The and the advantage Bluetooth demonstrate of using sensor its use- trace is queried energy consumption for the microphone sensor rule. Fig- Adaptation Framework recordings, which are very sensitive) are sent to a back-end this library is that it is difficult for non professionals to de- ures 13 and 14 show the effect of increasing the value of server, privacy issues related to the transmission to afulness remote forcontinuously social science. with a sleep duration of 3 seconds between suc- liver emotional utterances. An alternative is to use “natural” MAX_INTERVAL on percentage of missed events and en- The adaptation framework is basedserver on can Pykebe avoided [25 by], adopting a knowledge- the local computation ap- speechcessive recordings queries. (i.e., taken The from accelerometer everyday life situations sensor is sampled con- based inference engine writtenproach. in Python. In the case of It local takes computation, a set of the process of com- withouttinuously acting). for However, movement it is difficult information to determine appro- with a gap of 1 second. ergy consumption for the microphone sensor rule. We can paring n voice samples with all the pre-loaded modelsPerformance is per- priate reference Benchmarks labels, required to evaluate performance on observe that all these plots exhibit asymptotic behavior. Based facts as inputs and derives additionalformed on facts a Nokia through 6210 mobile forward phone which is equipped The data from these sensors were processed and trace files this speech, since many natural utterances are emotionally on these plots and energy saving as the main motivation, we chaining rules. It can also be usedwith to ARM prove 11 369MHz goals processor. using back- Instead, with respectWe ran to a seriesambiguous.with event of The micro-benchmarks use information of a pre-existing were library to also testgenerated. allowed the performance us In the traces, the the remote computation case, an audio sample to beof classi- the differenttoevents avoid explicitly components generated annotating from and collected the mechanisms data data with of emotional each of our sensor system. can be of two set MIN_INTERVAL to 45 seconds and MAX_INTERVAL ward chaining rules. However,fied these is sent over are the not 3G network necessary using the inHTTP Connection labels. We used a total of 14 narrow emotions, which were to 100 seconds. We performed similar experiments for the module of PyS60 to a powerful back-end server (IntelThe Xeon datatypes, used for viz., benchmarking “unmissable” the and adaptation “missable” rules events. were An unmiss- our system and were removed when we adapted Pyke to the then grouped into 5 broad emotion categories. For each nar- other sensors. As a result, for the Bluetooth sensor, we set Octa-core E5506 2.13GHz processor, and 12 GB RAM).collected An rowable from emotion, event 12 we users is used an a event over total of 24 of 25interest test hours. samples observed Each per narrow user in carried the environment Nokia Symbian 60 platform inaudio order sample to of reduce 5 seconds the length memory has a size of about 78KB. MIN_INTERVAL to 30 seconds and MAX_INTERVAL to a Nokia 6210emotionthat should mobile per sample not phone, length, be missedresulting which in continuously by a total the of sensor. 350 test monitored A missable event footprint. We have defined adaptationThe energy rulesconsumption which shown drive in the the results is end-to-end samples per sample length. 100 seconds; for the accelerometer, we set MIN_INTERVAL consumption including all computation, and radio transmis-and recordedindicates the outputs that no of interesting the accelerometer, external phenomenon the micro- has hap- behavior of the entire EmotionSensesion costs. system. We measured Each the energy of these consumption using the to 10 seconds and MAX_INTERVAL to 30 seconds; for the phone, andFigurespened the8, Bluetooth and9, and 10 theshow corresponding sensors. the emotion recognitionWe then sensor accuracy, used can these sleep data during this rules sets the sampling intervalNokia of a Energy sensor Profiler. based on the data latency,time. and Finally, energy consumption the accuracy with respect of a rule to the is sam- measured in terms of GPS sensor, we set MIN_INTERVAL to 180 seconds and as a traceple to length, benchmark respectively. the As the different sample length components increases the and tune MAX_INTERVAL to 800 seconds. extracted from it. EmotionSenseWe also Manager conducted ainstantiates test to evaluate the the effect of noise on the percentage of events missed by the corresponding sen- the variousaccuracy parameters improves, converging of our system.to about 71% We for the used broad a different Pyke inference engine, and periodicallyspeaker recognition invokes accuracy. it to process We used Audacity [2], an emotionssor. An for event sample lengthsis said greater to be than missed 5 seconds. when Based there is an unmiss- open source cross-platform sound editor, to inject noisedata into set foron the benchmarking speaker and emotion the recognition speaker accuracy and results emotion (Fig- recogni- Experiment facts and generate actions, andvoice it in samples. turn updates Audacity providesthe tasks an easy for way to add a par- able event recorded in the trace file while the corresponding tion subsystemsures 4 and as8), we discussed used a sample later length in of this 5 seconds section. in the We also After evaluating the accuracy of the system by means of the the sensor monitors. ticular type of noise into voice samples. We injected Brown- EmotionSensesensor monitor system which is sleeping. is the point where Also, the the conver- percentage of missed ian noise into all the test samples for their entire lengthexplored with genceevents the trade-offs becomes is a evident. relative in performing The value confusion with matrix local respect for computation broad to the percentage on of micro-benchmark tests, we conducted a social psychology amplitudes ranging from 0 to 0.1, in increments of 0.02. Fig- emotions for a sample length of 5 seconds is shown in Ta- experiment to evaluate the usefulness of the EmotionSense ure 7 shows the effect of Brownian noise on speakerthe recog- phonesmissed and remote events computation for the lowest on possible a back-end sampling server. rate; for this An example of a rule used in the EmotionSense system is ble 2. Among non-neutral emotions, anger has the highest system for social scientists. The data extracted by means given in Figure 2. The rule updatesnition accuracy. the As value expected, of the location accuracy decreases as the accuracyreason, out the of all. results This is of confirmed all the in plots [4], where for the accuracy au- start with zero amplitude of noise increases. thorspercentage show that intense of missed emotions events. (like anger) are easier to of the EmotionSense system running on the mobile phones sampling interval based on the data from the accelerometer Speaker Recognitiondetect than emotional valence. They also mention that the were compared to information provided by participants by sensor. It gets the fact ActivityEmotionand Recognition the current location In this subsection,emotions that weare similar present in intensity, the results like anger of and speaker fear recog- means of traditional questionnaires. In order to benchmark the emotion recognition subsystem, (panic),Given are the hard space to distinguish: constraints the same we can be present, observed as an example, the sampling interval LocationIntervalwe used both test andfrom trainingKnowledge data from the Emotionalnition Prosody subsystemintuning our confusion of benchmarks. the matrix. microphone Voice sensor: samples in this from case, 10 users an unmissable Base, and then updates it based on a function (update()). were usedevent for this corresponds test using to approximately some audible 10 voice minutes data being of heard Overview of the Experiment The idea is to provide a simple interface to add rules in or- data for trainingin the environment, the speaker-dependent and a missable models. event A separate, corresponds to si- The experiment was conducted for a duration of 10 days der to change the behavior of the system. In the Emotion- held-out datasetlence. Thesewas used events to test are the generated accuracy by of comparing the speaker a recorded involving 18 users. Users were members of the local De- Sense system, the function update() is based on a back- recognitionaudio component. sample with We two varied pre-loaded the sample models, length the from background partment of Computer Science. Since the system did not off mechanism as shown in Figure 3. If the user is moving 1 to 15 secondsGMM (same and each as sample that used was for classified speaker against recognition 14 bench- require any user-phone interaction, the fact that the partici- then the sampling interval is set to a minimum value other- possible models. We used 15 samples per user per sample wise it is increased by doubling each time until it reaches a length, resulting in a total of 150 test samples per sample maximum value. The sampling interval stays at this max- length. Figure 4 shows the speaker recognition accuracy imum as long as user is idle, but, as soon as movement with respect to the sample length. As the sample length was is detected, it is set to a minimum value. In addition to increased the accuracy improved, converging at around 90% the GPS sensor, we have similar rules for microphone sen- for sample lengths greater than 4. From Figure 5, it can be sor, Bluetooth sensor, and accelerometer sensor. This way, seen that this corresponds to a latency of 55 seconds in the users can write very simple functions to adapt the system case of local computation on the phone. [1] https://arxiv.org/pdf/1410.5816.pdf Affective Computing q PROBLEM 3: Which emotions can be recognized? STRESS CONDITION RECOGNITION ² Automatic recognition of daily stress as a 2-class classification problem (non-stressed vs stressed) [1] ² Inputs: (i) people activities, as detected through their smartphones; (ii) weather conditions; (iii) personality traits (Big Five Model is used). Ø Fetures extracted from call and sms logs and from Bluetooth hits, e.g.: (i) the amount of calls, of sms and of proximity interactions; (ii) the diversity of calls, of sms, and of proximity interactions; and (iii) regularity in user behaviors. AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 19 tiles, quantiles corresponding to 0.5, 1, 1.5 and 2 standard ality traits [37]. Concerning regularity features, we mea- https://arxiv.org/pdf/1410.5816.pdf deviations (applying Chebyshev’s inequality), variance and sured the time elapsed between calls, the time elapsed be- standard deviation functions. Moreover, for each basic fea- tween sms exchanges and the time elapsed between call and ture we calculated the same functions as above for 2 and 3 sms. More precisely, we consider both the average and the days backward-moving window to account for the possibility variance of the inter-event time of one’s call, sms and sum that past events influenced the current stress state. thereof (call+sms). Noticeably, in fact, even when two users In the following subsections we will describe more in detail have the same inter-event time for both call and sms, that the 25 call and sms basic features and the 9 proximity basic quantity can be di↵erent for their sum. features. Diversity measures how evenly an individual’s time is dis- tributed among others. In our case, the diversity of user Affective Computing4.1 Call and Sms Features behavior is addressed by means of three kinds of features: The features reported in Table 2 fall under four broad (i) entropy of contacts, (ii) unique contacts to interactions categories: (i) general phone usage, (ii) active behaviors, ratio, (iii) number of unique contacts, all computed both on (iii) regularity, and (iv) diversity. calls and on sms. In particular, the entropy of an individual q is the ratio between his/her total number of contacts and PROBLEM 3: Which emotions can be recognized?the relative frequency at which he/she interacts with them. Table 2: List of Basic Features The more one interacts equally often with a large number General Phone Usage of contacts, the higher the entropy will be. For entropy cal- 1. Total Number of Calls (Outgoing+Incoming) STRESS CONDITION RECOGNITION 2. Total Number of Incoming Calls culation, we applied Miller-Madow correction [35], which is 3. Total Number of Outgoing Calls explained in Equation 1. 4. Total Number of Missed Calls p 5. Number of SMS received mˆ 1 HˆMM(✓) ✓ML,i log ✓ML,i + , (1) 6. Number of SMS sent ⌘ 2N Diversity i=1 7. Number of Unique Contacts Called X wherem ˆ is a number of bins with nonzero ✓-probability. Table 6: Model Metrics Comparison for Feature Subsets 8. Number of Unique Contacts who Called The likelihood function is given as the product of probabil- Model Accuracy Kappa Sensitivity Specificity F1 9. Number of Unique Contacts Communicated with (Incoming+Outgoing) 10. Number of Unique Contacts Associated with Missed Calls ity density functions P (✓)=f(x1; ✓)f(x2; ✓) f(xn; ✓) for 11. Entropy of Call Contacts ··· Our Multifactorial Model 72.28 37.52 52.72 83.35 57.89 a random sample X1, ,Xn . ✓ML is the maximum like- Baseline Majority Classifier 63.84 0.00 100.00 0.00 0.00 12. Call Contacts to Interactions Ratio ··· 13. Number of Unique Contacts SMS received from lihood estimate of ✓, which maximizes P (✓). Miller-Madow Weather Only 36.16 0.00 100.00 0.00 0.00 14. Number of Unique Contacts SMS sent to correction was applied, dealing with the data quality prob- Personality Only 36.16 0.00 100.00 0.00 0.00 15. Entropy of SMS Contacts lems, to get bias-corrected empirical entropy estimate. Bluetooth+Call+Sms 48.59 6.80 73.80 34.32 50.94 16. Sms Contacts to Interactions Ratio Active Behaviors 4.2 Proximity Features Personality+Weather 43.55 2.96 81.90 21.83 51.20 17. Percent Call During the Night Personality+Bluetooth+Call+Sms 46.40 7.01 83.17 25.57 52.88 18. Percent Call Initiated Starting from the Bluetooth hits collected, we filtered out 19. Sms response rate all the cases with RSSI < 0. From the filtered Bluetooth Weather+Bluetooth+Call+Sms 49.60 -5.45 38.45 55.91 35.55 20. Sms response latency proximity data we extracted the following basic Bluetooth 21. Percent SMS Initiated proximity features (Table 3). In this case, the extracted fea- Regularity 22. Average Inter-event Time for Calls (time elapsed between two events) 23. Average Inter-event Time for SMS (time elapsed between two events) nificant e↵ects of other meteorological variables –SOURCE humidity, COMPARISON toward wellbeing and sustainable living, and24. its Variance scope Inter-event can Time for Calls (time elapsed between two events) Table 3: List of Basic Bluetooth Proximity Features 25. Variance Inter-event Time for SMS (time elapsed between two events) General Bluetooth Proximity visibility, and wind speed – for predicting daily stress were be extended to di↵erent areas of applicability. Providing 1. Number of Bluetooth IDs 2. Times most common Bluetooth ID is seen also found. people with a tool capable of gathering rich dataFeatures about for general real phone usage consist of: the total num- 3. Bluetooth IDs accounting for n% of IDs seen Regarding the mobile phone data, it is interesting to noteAFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS life, and transforming them into meaningfulber insights of outgoing, about incoming and missed calls and the total 4. Bluetooth IDs seen for more than k time slots MARCO DI FELICE number of sent/received sms. Moreover, we also computed 20 5. Time interval for which a Bluetooth ID is seen the contribution of proximity features. Out of the selected stress levels, paves the way to a new generationthe following of context- ratios: outgoing to incoming calls, missed to 6. Entropy of Bluetooth contacts Diversity 32 features, 11 features are proximity ones, 6 comes from aware technologies that can target therapists,(outgoing enterpises + incoming) and calls, sent to received sms. 7. Contacts to interactions ratio call data and 6 from sms data. In particular, an interesting common citizens. Then, we captured the active behaviors of an individual Regularity computing the following features: (i) percentage of calls 8. Average Bluetooth interactions inter-event time predictive role is played by the number of time intervals for This technology can inform the design ofdone automatic during the night, sys- (ii) percentage of initiated calls dur- (time elapsed between two events) ing the night, (iii) the sms response rate, (iv) the sms re- 9. Variance of the Bluetooth interactions inter-event time which an id is seen. The results obtained using proximity tems for the assessment and treatment of . (time elapsed between two events) sponse latency, and (v) the percentage of initiated sms. In features seem to confirm previous findings in social psychol- With such a tool, therapists could monitorparticular, and record we consider pa- a text from a user (A) to be a re- tures fall under three broad categories: (i) general proximity sponse to a text received from another user (B) if it is sent ogy: in particular, the relevant role played by face-to-face tients’ daily stress levels, access longitudinal data, identify information, (ii) diversity, and (iii) regularity. As for call within an hour after user A received the last text from user and sms, we applied Miller-Madow correction for entropy interactions and by interactions with strong ties in deter- recurrent or significant stressors and modulateB. The treatment response rate ac- is the percentage of texts people re- calculation. mining the stress level of a subject [27]. For sure, this result cordingly. spond to. The latency is the median time it takes people to requires further investigation. In addition, two features cap- In work environments, where stress hasanswer become a text. a Note seri- that by definition, latency will be less or equal to one hour. 5. METHODOLOGY turing the entropy in proximity interactions are among the ous problem a↵ecting productivity, leading toDiversity occupationaland regularity have been shown to be impor- We formulated the automatic recognition of daily stress as selected ones. This finding seems to confirm results available issues and causing health diseases, our systemtant couldfor the characterization be ex- of di↵erent facets of human a binary classification problem (“not stressed” vs “stressed”), behavior. In particular, entropy, used as a measure of diver- with labels 0 for“not stressed”and label 1 for“stressed”. The in the social psychological literature about the associations tended and employed for early detection ofsity, stress-related has been successfully applied to predict mobility [47], two classes included all the cases with scores <= 4 and scores between stress and the richness/diversity of social interac- conflicts and stress contagion, and for supportingspending patterns balanced [28, 46], online behavior [44] and person- > 4, respectively. The sizes of the resulting two classes are tions [20]. Further confirmation to this conclusion comes workloads. Awareness is a first but crucial step to moti- from the similalry important role played by entropy-based vate people to change their behaviour and take informed call and sms features. and concrete steps toward a healthy lifestyle and appropri- The remaining selected features related to sms interac- ate stress coping strategies. Mobile applications developed tions are (i) the latency in replying to a text message, de- on the basis of our technology could provide feedback to in- fined as the median time to answer a text message and (ii) crease people’s awareness of their stress levels, alerts when the amount of sms communications (outgoing+incoming). they reach a warning threshold, and suggest stress manage- ment and techniques when appropriate. However, our study has also some limitations. We can list 8. IMPLICATIONS AND LIMITATIONS the following ones: (i) our sample comes from a population Stress has become a major problem in our society. Ubiq- living in the same environment. Our subjects were all mar- uitous connectivity, information overload, increased mental ried graduate students living in a campus facility of a major workload and time pressure are all elements contributing to US university; and (ii) the non-availability of proximity data increase general stress levels. While in some cases people concerning the interaction with people not participating in may realize that they are undergoing stressful situations, the data collection, a fact that is common to many other severe and may be more dicult to detect. relevant studies and that has been also pointed out by [40]. Moreover, stress may be considered the norm in a mod- The first problem is at least partially attenuated by the large ern and demanding society. Nonetheless, while slightly in- variability of the sample in terms of provenance and cultural creased stress levels may be functional for productivity, pro- background (in our sample we have subjects from 16 coun- longed and severe stress can be at the source of several physi- tries and from all the continents), which can be expected to cal dysfunctions like headache, sleep or immunological disor- correspond to a wide palette of interaction behaviors that ders, unhealthy behaviours such as smoking and bad eating ecaciously counterbalance the e↵ects of living-place homo- habits, as well as of psychological and relational problems. geneity. Beside manifest social costs, stress also entails considerable financial costs for our economies, which are estimated by the World Health Organization in 300 billion dollars a year for 9. CONCLUSION American enterprises, and 20 billion euro for Europe ones, The goal of this paper was to investigate the automatic in terms of absenteeism and low productivity. recognition of people’s daily stress from three di↵erent sets Our technology provides a cost-e↵ective, unobtrusive, widely of data: a) people activity, as detected through their smart- available and reliable tool for stress recognition. It detects phones (data pertaining to transitory properties of indi- daily stress levels with a 72.28% accuracy combining real viduals); b) weather conditions (data pertaining to tran- life data from di↵erent sources, such as personality traits, sitory properties of the environment); and c) personality social relationships (in terms of calls, sms and proximity traits (data concerning permanent dispositions of individ- interactions), and weather data. The development of a re- uals). The problem was modeled as a 2-way classification liable stress recognition system is a first but essential step one. The results convincingly suggest that all the three S. Butt, J.G. Phillips / Computers in Human Behavior 24 (2008) 346–360 351

3. Results

The data from the questionnaires was entered into SPSS (version 12.0). An alpha level of 0.05 was chosen to perform analyses. A small amount of random missing values was found throughout the data. One person only received calls, and so zero was the appropri- ate value for outgoing calls and any percents made. One person did not respond to the ‘‘unwanted calls’’, writing in the margin ‘‘what does this mean’’. This was left as missing. Three participants did not respond to the question on SMS use, and seventeen did not [1] Sarah Butt, James G. Phillips, "respond to thePersonality question and self on ringreported tone or mobile wallpaper.phone use", FourComputers participants in Human failed toBehavior provide 24 (2008) an answer to a specific item on the SEI. For these cases, the mean value for that specific item for the sample was substituted for the corresponding item mean. The procedure is conservative, as the individual becomes more like the mean response in the sample. A missing value was also found for one participant for one question on the NEO-FFI. This was dealt with according to the questionnaire manual (Costa & McCrae, 1992), and was assigned a neutral value. Affective3.1. Analysis of the Computing independent variables Totals, means and standard deviations were calculated for , Extraver- sion, Agreeableness, Conscientiousness and SEI, in conjunction with the minimum, maximum, and skew for each of the independent variables. The descriptive statistics for both personality measures are analogous to the normative data outlined in the q PROBLEM 3: manuals (Coopersmith,Which 1989; Costa emotions & McCrae, 1992 can be ). Table 1 illustratesrecognized the totals, ? means, standard deviations, minimum and maximum values, and skew for the indepen- dent variables. Table 1 illustrates that the mean for SEI is moderately high suggesting that the sample BIG FIVE PERSONALITY TRAITS averaged a healthy level of self-esteem.RECOGNITION Out of the five NEO-FFI scales the highest mean was for conscientiousness while the lowest mean was for Neuroticism. This indicates that participants were relatively conscientious and emotionally stable. Means for agreeableness ² and extraversion were comparable with published norms (Costa & McCrae, 1992). PersonalityAn inspection traits can of the skew ofexplain the independent patterns variables of mobile revealed thatphone agreeableness usage was negatively skewed, hence this scale was inverted by subtracting it from its maximum value (e.g. plus oneextraverted before transformation., neurotic A square, disagreeable root transformation, and wasunconscientious used on the agreeable- ness variable. Due to the inversion of agreeableness this variable has now been renamed individuals‘‘disagreeableness’’ spent to help more time interpretation.writing and receiving SMS) [1].

Table 1 Totals, means, standard deviations, minimum and maximum values, and skew for SEI, neuroticism, extraversion, openness, agreeableness and conscientiousness total scores Predictors nM SD Min. Max. Skew Neuroticism 112 20.73 8.04 2 38 .12 Extraversion 112 30.26 6.57 14 43 .14 À Agreeableness 112 31.82 6.14 8 46 1.0 À Conscientiousness 112 34.5 6.54 15 48 .45 À SEI 112 75.32 16.5 32 100 .58 À (N = 112).

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 21 [1] Natasha Jaques, Sara Taylor, Asaph Azaria, Asma Ghandeharioun, Akane Sano, , "Predicting students’ from , phone, mobility, and behavioral data", 2015 International Conference on Affective Computing and Intelligent Interaction (ACII) Affective Computing q PROBLEM 3: Which emotions can be recognized? HAPPINESS OR INDIVIDUAL STATE RECOGNITION ² Build an AI system that can automatically detect when a college student is becoming vulnerable to [1].

² Input sources Ø Physiological data: (EDA) (a measure of physiological stress), and 3-axis accelerometer (a measure of steps and physical activity) Ø Survey data: questions related to academic activity, sleep, drug and alcohol use… Ø Phone data: phone call, SMS, and usage patterns Ø Location data: coordinates logged throughout the day

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 22 [1] Natasha Jaques, Sara Taylor, Asaph Azaria, Asma Ghandeharioun, Akane Sano, Rosalind Picard, "Predicting students’ happiness from physiology, phone, mobility, and behavioral data", 2015 International Conference on Affective Computing and Intelligent Interaction (ACII) Affective Computing q PROBLEM 3: Which emotions can be recognized? HAPPINESS OR INDIVIDUAL STATE RECOGNITION ² Classify each student as: happy or sad.

² Test different sources/ML classificationTABLE III algorithms[1]. CLASSIFIER, PARAMETER SETTINGS AND ACCURACY RESULTS FOR EACH MODALITY

Classification Accuracy Modality Dataset Size # Features Classifier Parameter Settings Validation Baseline Test Physiology 933 426 SVM C=100.0, RBF kernel, = .0001 68.37% 51.79% 64.62% Survey 1110 32 SVM C=100.0, RBF kernel, = .01 71.26% 50.86% 62.50% Phone 1072 289 RF Num trees = 40, Max depth = infinite 66.67% 51.98% 55.95% Mobility 905 15 SVM C=100.0, RBF kernel, =1 69.95% 53.65% 65.10% All 768 200 SVM C=0.1, Linear kernel 72.84% 53.94% 68.48%

classifier and parameter settings that were found to optimize characteristic suggests the system is more sensitive to detecting validation accuracy. We foundAFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS that the SVM and RF classifiers sadness, which is desirable if it is to be used to detect when tended to produce the best results on this dataset. AccuracyMARCO DI FELICE on to intervene if a student is becoming unhappy. 23 the held-out test set (i.e. the proportion of samples in which VIII. DISCUSSION AND LIMITATIONS the classifier’s prediction matches the true label) provides an estimation of the results we can expect on novel data; therefore Although the classifiers trained on each modality were we can conclude that our best model would be able to identify able to achieve results exceeding the baseline, performance students that are unhappy with 68.48% accuracy. differed across modalities. Interestingly, mobility offers high Note that the size of the dataset involving all features performance with few features; given the features found to be is reduced due to missing data. Considering the volume of the most valuable for mobility, it would appear that whether or data collected and the length of the study, some modalities not a person spends time outdoors and deviates from normal are missing data for certain participants on certain days; for routine is strongly related to whether they will feel happy on example, if a participant forgets to wear their sensor. Therefore that day. Physiology also offered relatively high performance, when we combine all the modalities and restrict our focus to suggesting that wearable devices which can monitor a person’s only those days/participants for which data from each modality physiology throughout the day may be a promising way to is available, the dataset shrinks. This could make the ‘all’ detect changes in happiness, especially if those devices are dataset vulnerable to overfitting; therefore we applied the same capable of monitoring sleep quality. feature selection techniques and found a reduced set of 200 A limitation of this work is that it does not consider indi- features to be most effective. vidual differences; for example, extracurricular activities could make some students happy or be stressful for other students. In Ensemble classification offers an alternative approach to future work we would like to model these complexities using training a single classifier on all of the available features. Multi-Task Learning (MTL) [36]. Another limitation is that Rather, the predictions from several classifiers are integrated, we do not model long-term effects of behaviors on happiness; often in a weighted majority vote [35]. We built an ensemble for example, drinking alcohol may affect a participant’s mood classifier which combines the predictions of the best classifier over the long term. from the best modalities by weighting their predictions ac- cording to the classifier’s validation accuracy. We found that IX. CONCLUSION using the best three modalities produced the highest validation This work has demonstrated that physiological, behavioral, accuracy. The ensemble allows us to deal with missing data phone and mobility data can all be used successfully to in a more robust way; if a modality is missing data for a model happiness. We have contributed to the literature on given sample day, then that classifier simply abstains from wellbeing by examining not only which features provide the the vote. Each modality is able to maintain the maximum most information about happiness and how they affect it, but amount of training data, while the ensemble combines data also by investigating the relationship between happiness and from several modalities without losing information. The best other components of wellbeing, such as health, stress, and accuracy achieved by the ensemble classifier on the held-out energy. The best accuracy obtained by our models on novel test set was 70.17%. Table IV shows the confusion matrix for data, 70.2%, may be sufficient to guide interventions intended the predictions made by the ensemble classifier on the held- to prevent depression, especially if these interventions are only out test set. It is slightly more likely to falsely predict that a triggered after the classifier detects a consistent pattern of student is sad when she is actually happy, rather than falsely unhappiness over several days or weeks. predict that a student is happy when she is actually sad. This ACKNOWLEDGMENT

TABLE IV We would like to thank Dr. Charles Czeisler, Dr. Elizabeth CONFUSION MATRIX FOR ENSEMBLE CLASSIFIER Klerman, and Conor O’Brien for their help in running the user study on which this work is based. This work was supported Predicted Happy Sad by the MIT Media Lab Consortium, the Robert Wood Johnson Happy 77 40 Foundation Wellbeing Initiative, NIH Grant R01GM105018, Actual Sad 31 90 Samsung, and Canada’s NSERC program.

978-1-4799-9953-8/15/$31.00 ©2015 IEEE 227 [1] Eugenia Politou, Efthimios Alepis, Constantinos Patsaki, "A survey on mobile affective computing", Computer Science Review, 25, 2017 Affective Computing q PROBLEM 3: Which emotions can be recognized? 94 E. Politou et al. / Computer Science Review 25 (2017) 79–100 automatic recognition [190] and, recently, research on this Table 3 direction has been commenced [191]. Affect recognition objective. 3. While many studies are employing the six ‘‘basic’’ emotions Objective Percentage proposed by Ekman (anger, disgust, fear, , sadness and Stress 35.7% Emotion 33.3% ), there is a plethora of affect computing studies E. Politou et al. / Computer Science Review 25 (2017) 79–100 93 that extend the set of Ekman’s emotions by including more Wellbeing and user behaviour 14.3% Personality 11.9% distinct states, such as bored, contented,Table 2 excited, nervous, MoodYet, the problem of data fusion for multimodal 4.8% affect recog- relaxed or upset, and even furtherUsed emotions representation more models. applicable nition continues to puzzle researchers and, for the time being, to affect-sensitive learning environments, such as confu- forms another barrier that limits the full integration of all pos- sion, , , and [192]. Hence, it over the past few years.sible The modalities majority in a of homogeneous the studies system were [73]. mul- Nevertheless, it is evident that broadening the affective space of categorical AFFECT RECOGNITION OBJECTIVES timodal, since more thanshould one be data highlighted inputs that, were apart used from toa small infer number the of studies approaches for affective computing should be further exam- affect information of participants,like [145In the LITERATURE OF AFFECTIVE ], all efforts albeit for some solving data the problem did not of origi-multimodal fusion ined. nate from smartphone modalitiesdo concentrate but on from the traditional additional quite inputs ‘‘intrusive’’ such modalities of 4. Lastly, it is apparent that the most commonly used repre- COMPUTING [1] as wearable chests or webvideo information and audio [13,16 (described,17,180–182], in ignoring, greyed as Gunes out and Hung sentation for affective states in computing is the binary or boxes in Table 1). The frequencydescribe in [183 distribution], the ‘‘new kids of in each the block’’ referenced such as bio signals triadic representation that uses two or three–class states AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS smartphone or external modality is illustrated in Fig. 2 where model (like happy, not happy and neutral) due to its mod- and ‘‘the myriad ways of mobile sensing technology24 and smartphone course, they may pay a modest sum in cashit or can credits be but seen consideringMARCO DI FELICE that the most commonly employed data source elling convenience in machine learning systems. However, sensor revolution provide (location sensing, acceleration sensing etc.) that many participants have real emotion-relatedfor inferring needs, yet affect few of across all modalities were users’ phone calls. psychology researchers argue that this simplistic represen- readily in our pockets’’. Hence, the multimodal fusion of smartphone them usually learn anything from participatingFurther,. Due touser security questionnaires and or data annotation techniques were tation is not suitable for real world affective recognition captured data is a research path open for exploration. privacy constraints, in many cases researchersadopted hide in all the the majority indi- of the studies (78.6%) in order to obtain environments. Therefore, approaches using rough and fuzzy ground truth data for evaluating the detection results. To the extent sets [193,194] could potentially fillvidual this participants’ gap as such information. methods But Picard wonders if there was a that smartphone usage data4.6. Resource are concerned, constraints as illustrated in Fig. 3, can be adopted by machine learningway to algorithms protect participants’ [195] privacy and and at the same time let them the most commonly used mobile source was by far the call logs in fact have been already used bylearn some about researchers themselves,in without the increasing scientific workload from user’s smartphones (26.4%),On top of followedthe aforementioned by the challenges, SMS logs any (22%) algorithms pre- field [196–199]. Interestingly enough,beyond the the current approach one. Otherwise, of soft the lack of incentives for users and the data collected fromserving the privacy application or other logs security used solutions in the smart- as well as utilising sets [200] has not yet receivedto the participate expected in mobile focus affective from sensing can lead to a lack of phones (14.3%). In termseffective of embedded multimodal sensor techniques data, theand existingmost widely complex technical the research community, despiteparticipation its potential and, therefore, to further data [172]. used sensor in the referenced studies was the GPS sensor (27.6%), improve the affective modelling. As proposed by Rana et al. in [173], a possible strategy to gain solutions should be implemented in an energy and power saving is transparency. This could be achievedfollowed by visually by the demon- accelerometerapproach by sensor keeping (22.4%) in mind and that the mobile Bluetooth phones have limited There are also some other minor issuesstrating on theaffective end-to-end modelling application – from(14.5%). data collection to data operation time and processing power. Additionally, mobile affec- The inference of emotion, mood, personality trait, wellbeing, that have not yet been adequately addressedprocessing and by to the data affective dissemination – verifying that privacy is tive applications should be implemented with a higher portability behaviour or the much broad mental state of depression, were the computing community, like the appropriatepreserved evaluation and access of control affective is provided in every step of the ex- to address interoperability issues, since mobile devices may not recognition goal of the included studies and at the same time, the systems and the influence of context in affectperiment. detection Additionally, [192 some]. participants would only participate only be equipped with different mobile platforms and operation main criterion for listing them in this survey. Table 3 illustrates when they can identify and understand the validity and relevance environments [184], but with different sensor capabilities as well. the degree to which each affect objective occupied the referenced 4.8. Cultural differences of the affective application [174]. Hence, outlining and demon- Consequently, given the diversity of mobile devices in availability literature. While a largeand proportion capabilities, of modelling the studies, and predicting almost the 35.7%, energy and pro- strating how the proposed study or applicationwere about could inferring transform user’s stress levels, as it was expected given In affect recognition, a potential ethical limitation in research cessing requirements to accomplish a particular task remains a or assist participants in their everyday lifethe or place in treating of stress negative as a dominant factor in our modern society studies comes from the fact that of emotions and complex issue [175]. emotional states, contributes significantlyaffecting to the added our value expression of the and behaviour, the recognition of emotion personality are not universal but they are highly dependent on affective computing research. and mood states occupied academics to the same extend (38.1%). the cultural and conceptual framework [58,59,201]. Additionally, Inferring personality of participants’4.7. Affect modelling was and the representation subject of about 12% previous studies have shown that some cultures regard emotions of the referenced papers, whereas only a small proportion targeted as feelings of individuals whereas other cultures regard them in- 4.5. Multimodal fusion at wellbeing and human behaviour inference. While the affect rep- separable from the feelings of a group [57]. According to these per- Hitherto, immense research has been undertaken by scholars in resentation model used varied significantly among the referenced spectives, a reasonable challenge is how an affective recognition forming sound theoretical foundations and systematic guidelines studies, it can be observed that most researchers were in favour system should be designed and built in a culturalAffective transparent information, like way. emotions and personality traits, can for emotion modelling [185]. Despite the continuous effort of affec- be inferred from various communicationof channels distinct such states as facial model,tive i.e. computing Ekman’s scientists six basic for transitioning states model affective with computing into some deviations, and the binary or triadic representation of a single 4.9. Cost and body expressions, speech, text and embedded smartphone a more scientific discipline by validating existing emotion models state, i.e. happy vs. unhappy or high vs. medium vs. high stress. Still, sensors or biosensors. However, a challenging issue that impacts and developing systematic guidelines [186] and standards [187] for some dimensional models were also employed but they consist a Last but not least, a – for the timethe of effectiveness being – barrier of affective pre- technology is the fusion of these affect modelling, the theoretical foundations of affective comput- minority. venting the wide spread of affective applicationsmodalities for is building the fact competent that multimodal affective recogni- ing are yet under continuous development [188]. Considering the As far as the evaluation of the outcome is concerned, most of smartphones, together with their accompaniedtion systems. mobile A multimodal telecom- affective recognition system is a sys- currently available models for representing the affect information, the studies measured the recognition rate in terms of accuracy, munication services (e.g., 3G, 4G), aretem still that, relatively via multiple expensive, inputs, retrieves affective information from and examining those referenced in Section 3 and in Table 1, Table 2 a measurement that showed a satisfactory degree of successful something that undermines the first requirementvarious types of for sources ubiquitous and associates input data with a finite set recognition results. However,is constructed the number to summarise of subjects the models involved used in this in survey. By computing as it has been specified by itsof affective pioneer states. Weiser, The problems that is in regard to the implementation each study and the durationtaking a in closer which look the in Table data 2, one collection can observe process the following: to be cheap [100], and potentially may affectof an effective research system results with multimodal due fusion are plenty and they took place vary dramatically from 1 subject and few days to 119 to insufficient participation. Therefore,mainly unless concern devices the great and variations their of data in terms of structure 1. The number of studies employing a multi-dimensional af- subjects and 17 months. On top of that, while most of the experi- telecommunication services are providedand content,to the the participants varied velocity at of data reception, the different fective model representation for recognising affect under- ments took place in-the-wild, that is a non-controlled environment the cost of the research projects, this barrier will always cause sampling rate and quality of receivedwith data and real the life continuous data, some experimentsperforms significantly have been the carried studies employing out in the distinct-state problems of sample selectivity and will restrict the extended ap- growing of data size [175]. limited setting of a controlledapproaches laboratory. or binary Hence, and accuracy triadic classification cannot models to plication of the study in-the-wild. While there are several techniques ofbe fusion, considered such as feature- as a commonrepresent evaluator their factor affective across space. all This included is due to, on the one level fusion, decision-level fusion, andstudies. data-level It fusion should [176 be], mentionedhand, though, the effectiveness that few of categorising studies did input not samples into a 5. Discussion and conclusions choosing an optimal fusion type is challenging.provide recognition Computer sci- results atnumber all, while of affective other classes studies and, on evaluated the other, the difficulty entists have shown that, by applying moderntheir models machine in learning terms of MSEof or applying F-measure, dimensional or in models terms in affective of a single recognition sys- In this survey we attempted to document all research works techniques, the combination of differentaccuracy classifiers has for been every more affectivetems. state. As Thus, described the in need a survey for by a Gunes common et al. [73], there and experiments in mobile affective computing, and particularly effective than single classifiers, as longstandardised as they are sufficiently approach for evaluatingis a number of the questions effectiveness that need to of be affect examined before in smartphone affect recognition, whichdiverse have [177 been]. On that undertaken ground, research about multimodal fusion dimensional affect models become suitable in automatic for the effective combination of acquired data, has been blooming affect recognition systems. over the past years [24]. Indeed, several studies have commenced 2. From Table 2, it is apparent that the absence of appraisal- regarding the multimodal fusion of distinct modalities for affect based models in automatic emotion recognition is an open recognition [17,178] and many system architectures have been issue for research due to the fact that appraisal-based ap- proposed in order to provide viable multimodal real time emotion proach requires complex, multicomponential and sophisti- recognition [176,179]. Recently, some cloud based big data ap- cated measurements of change [73,189]. Despite this com- proaches have been also proposed in an attempt to provide viable plexity, some advocate that the use of appraisal models in solution for this challenging issue [2]. affective computing will provide a number of benefits for Affective Computing q How to describe the emotions q How to model affective data q Which emotions can be recognized q How to perform emotion detection q Research issues q Demo: Affective Tool

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 25 Affective Computing q PROBLEM 4: How does emotion recognition work?

² Unimodal features (single input source) q Visual modality q Audio modality q Textual modality ² Multi-modal features (multiple input source)

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 26 Affective Computing q PROBLEM 4: How does emotion recognition work? ² Facial expressions are the primary cues for understanding emotions and sentiments. ² FACS (Facial Action Coding System) à of facial expressions in terms of Actions Units (AU), i.e. contraction or relaxation of one or http://mplab.ucsd.edu/grants/project1/ more muscles. research/face-detection.html

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 27 O. D. Lara and M. A. Labrador, A survey on Human Activity Recognition using Wearable Sensors, IEEE Communications Survey & Tutorials, 15(2), 2013 Affective Computing q HAR systems involve two steps: training and classification CONTEXT DATA + CATEGORY

SENSOR SAMPLING FEATURE EXTRACTION TRAINING SET TRAINING

CONTEXT DATA AFFECTIVE SENSOR SAMPLING FEATURE EXTRACTION STATE

TESTING CLASSIFICATION MODEL ACTIVITY-AWARE COMPUTING: APPLICATIONS AND TECHNIQUES MARCO DI FELICE 28 https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 Affective Computing q PROBLEM 4: How does emotion recognition work?

² Class of deep neural networks Convolutional Neural Network (CNN) ² Connectivity pattern between neurons resembles the organization of the animal visual cortex. ² A ConvNet is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters.

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 29 Affective Computing q PROBLEM 4: How does emotion recognition work?

² Unimodal features (single input source) q Visual modality q Audio modality q Textual modality ² Multi-modal features (multiple input source)

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 30 Affective Computing q PROBLEM 4: How does emotion recognition work? ²Acustic parameters are dependent on personality traits. q Mel Frequency Cepstral Coefficients (MFCC) Short-term power spectrum of a voice or of a sound q Spectral Flux Measure of how quick the power spectrum is changing q Spectral Centroid Center of mass of the power spectrum, brightness of the sound q Pause duration and Pitch

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 31 stream processing performance to certain degree. And a novel pitch probability distribution, which is obtained by drawing the histogram, is proposed as a local dynamic prosodic feature, since pitch plays an important role in the expression of emotion and the histogram can reflect the distribution of the values to a certain degree. Firstly, the pitch histogram and other acoustic features are extracted from Affective Computingeach segment of the utterance. After that, an optional processing of principal components analysis (PCA) is adopted for feature selection. Finally, these selected features are fed into an SVM classifier and the predicted class of emotion are obtained. The proposed framework for speech emotion recognition is illustrated in q PROBLEM 4: How does emotionFig. 1. Several comparative recognition experiments are designed to validate work? the effectiveness of the proposed method. Based on the comparison of the experimental results, we can conclude that the combination of segmentation and the pitch probability distribution ² OpenSMILE ( features, which considers the local dynamic information, achieves better results.

https://www.audeering.com/ Feature opensmile/) extraction

² Modular and flexible feature Dimensionality extractor for signal processing and reduction

machine learning applications Time-based segmentation Classification ² Emotion recognition from audio Fig. 1. Overview of the proposed method parameters. The rest of this paper is organized as follows. In Section 2, the detailed method is provided, in which the time-based segmentation and the proposed novel pitch AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS probability distribution features obtained by drawing the histogram are introduced. MARCO DI FELICE Experimental conditions and results are presented in Section 3. Discussion32 and conclusion are given in Section 4.

2 Time-based segmentation and local dynamic pitch probability distribution feature extraction

2.1 Time-based segmentation

The Relative Time Intervals (RTI) approach [4] is utilized for time-based speech segmentation. In addition, traditional Global Time Intervals (GTI) approach is adopted for comparison, which simply means using the whole utterance without segmentation and is usually used in traditionalmethods. Figure 2 shows the illustration of applying GTI and RTI approaches for the utterances with different Affective Computing q PROBLEM 4: How does emotion recognition work?

² Unimodal features (single input source) q Visual modality q Audio modality q Textual modality ² Multi-modal features (multiple input source)

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 33 Affective Computing q PROBLEM 4: How does emotion recognition work? Sentiment analysis systems can be categorized into two main classes:

Knowledge-based systems ² Use of linguistic patterns to understand the sentence structure based on its lexical dependency tree. ² Employ semantic analysis of text which allows the aggregation of conceptual and affective information (bag of concepts). Statistics-based systems ² Assume the availability of a large dataset annotated with polarity or emotion labels; employ statistical and/or machine-learning agorithms

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 34 110 S. Poria et al. / Information Fusion 37 (2017) 98–125

Affective Computing

q PROBLEM 4: How does affectionthe is recognition work? 110 S. Poria et al. / Information Fusion 37 (2017) 98–125 car old

old very

but expensive rather not

it expensive

is

the is AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 35 car old old very old old very very but but but expensive rather not expensive not rather expensive not rather it expensive

is

Fig. 8. Example of how sentic patterns work on the sentence “The car is very old but it is rather not expensive”. commonsense concepts, for tasks such as opinion holder detec- Fig. 8 shows how linguistic patterns can function like logic gates in tion [142] , knowledge expansion [149] , subjectivity detection [150] , olda circuit. Onevery of the earliest works in the study of linguistic pat- event summarization [151] , short text message classification [152] , terns for sentiment analysis was carried out by [161] , where a cor- old very sarcasm detection [153] , Twitter sentiment classification [154] , de- pus and some seed adjective sentiment words were used to find ception detection [155] , user profiling [156] , emotion additional sentiment adjectivesbut in the corpus. Their technique ex- [157] , and business intelligence [158] . ploited a set of linguistic rules on connectives (‘and’, ‘or’, ‘but’, ‘ei- but ther/or’, ‘neither/nor’) to identify sentiment words and their orien- expensive not rather tations. In this way, they defined the idea of sentiment consistency . 4.3.2. Use of linguistic patterns expensive not rather Whilst machine learning methods, for supervised training of the Kanayama et al. [144] extended the approach by introduc- sentiment analysis system, are predominant in literature, a num- ing definitions of intra-sentential (within a sentence) and inter- ber of unsupervised methods such as linguistic patterns can also sentential (between neighboring sentences) sentiment consistency. be found. In theory, sentence structure is key to carry out senti- Negation plays a major role in detecting the polarity of sentences. ment analysis, as a simple change in the word order can flip the In [162] , Jia et al. carried out an experiment to identify negations polarity of a sentence. Linguistic patterns aim to better understand in text usingFig. 8. Example linguisticof how senticcluespatterns andwork showedon the sentence a significant“The car is very performanceold but it is rather not expensive”. sentence structure based on its lexical dependency tree, which can improvement over the state of the art. However, when negations be used to calculate sentiment polarity. commonsenseare concepts,implicit, for taskse.g., suchcannot as opinionbe holderrecognized detec- byFig.an 8 showsexplicit how linguisticnegation patterns can function like logic gates in

In 2014, [159] proposed a novel sentiment analysis frame-tion [142] , knowledgeidentifier, expansionsarcasm [149] , detectionsubjectivity detectionneeds to[150]be , considereda circuit. Oneas ofwell. the earliest works in the study of linguistic pat- work which incorporates computational intelligence, linguistics,event summarization In [151][163] , short, three text messageconceptual classificationlayers, [152]each , ternsof forwhich sentimentconsists analysis ofwas carried out by [161] , where a cor- and commonsense computing [160] in an attempt to bettersarcasm un- detection8 textual [153] , Twitterfeatures, sentimentwas classificationproposed [154]to , de-grasp pusimplicit and somenegations. seed adjectiveA sentiment words were used to find derstand sentiment orientation and flow in natural languageception text. detectionmethod [155] , userto exploitprofiling [156]discourse , emotionrelations visualizationfor theadditionaldetection sentimentof adjectivestweets in the corpus. Their technique ex- [157] , and business intelligence [158] . ploited a set of linguistic rules on connectives (‘and’, ‘or’, ‘but’, ‘ei- ther/or’, ‘neither/nor’) to identify sentiment words and their orien- 4.3.2. Use of linguistic patterns tations. In this way, they defined the idea of sentiment consistency . Whilst machine learning methods, for supervised training of the Kanayama et al. [144] extended the approach by introduc- sentiment analysis system, are predominant in literature, a num- ing definitions of intra-sentential (within a sentence) and inter- ber of unsupervised methods such as linguistic patterns can also sentential (between neighboring sentences) sentiment consistency. be found. In theory, sentence structure is key to carry out senti- Negation plays a major role in detecting the polarity of sentences. ment analysis, as a simple change in the word order can flip the In [162] , Jia et al. carried out an experiment to identify negations polarity of a sentence. Linguistic patterns aim to better understand in text using linguistic clues and showed a significant performance sentence structure based on its lexical dependency tree, which can improvement over the state of the art. However, when negations be used to calculate sentiment polarity. are implicit, e.g., cannot be recognized by an explicit negation In 2014, [159] proposed a novel sentiment analysis frame- identifier, sarcasm detection needs to be considered as well. work which incorporates computational intelligence, linguistics, In [163] , three conceptual layers, each of which consists of and commonsense computing [160] in an attempt to better un- 8 textual features, was proposed to grasp implicit negations. A derstand sentiment orientation and flow in natural language text. method to exploit discourse relations for the detection of tweets 110 S. Poria et al. / Information Fusion 37 (2017) 98–125

the is

car old

old very

but expensive rather not

it expensive

Affective Computing is q PROBLEM 4: How does emotion recognition work?

old very old very but but

expensive not rather expensive not rather

Fig. 8. Example of how sentic patterns work on the sentence “The car is very old but it is rather not expensive”.

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS commonsense concepts, for tasks such as opinion holder MARCO DI FELICEdetec- Fig. 8 shows how linguistic patterns can function like logic gates 36 in tion [142] , knowledge expansion [149] , subjectivity detection [150] , a circuit. One of the earliest works in the study of linguistic pat- event summarization [151] , short text message classification [152] , terns for sentiment analysis was carried out by [161] , where a cor- sarcasm detection [153] , Twitter sentiment classification [154] , de- pus and some seed adjective sentiment words were used to find ception detection [155] , user profiling [156] , emotion visualization additional sentiment adjectives in the corpus. Their technique ex- [157] , and business intelligence [158] . ploited a set of linguistic rules on connectives (‘and’, ‘or’, ‘but’, ‘ei- ther/or’, ‘neither/nor’) to identify sentiment words and their orien- 4.3.2. Use of linguistic patterns tations. In this way, they defined the idea of sentiment consistency . Whilst machine learning methods, for supervised training of the Kanayama et al. [144] extended the approach by introduc- sentiment analysis system, are predominant in literature, a num- ing definitions of intra-sentential (within a sentence) and inter- ber of unsupervised methods such as linguistic patterns can also sentential (between neighboring sentences) sentiment consistency. be found. In theory, sentence structure is key to carry out senti- Negation plays a major role in detecting the polarity of sentences. ment analysis, as a simple change in the word order can flip the In [162] , Jia et al. carried out an experiment to identify negations polarity of a sentence. Linguistic patterns aim to better understand in text using linguistic clues and showed a significant performance sentence structure based on its lexical dependency tree, which can improvement over the state of the art. However, when negations be used to calculate sentiment polarity. are implicit, e.g., cannot be recognized by an explicit negation In 2014, [159] proposed a novel sentiment analysis frame- identifier, sarcasm detection needs to be considered as well. work which incorporates computational intelligence, linguistics, In [163] , three conceptual layers, each of which consists of and commonsense computing [160] in an attempt to better un- 8 textual features, was proposed to grasp implicit negations. A derstand sentiment orientation and flow in natural language text. method to exploit discourse relations for the detection of tweets Eugenia Politou, Efthimios Alepis, Constantinos Patsaki, "A survey on mobile affective computing", Computer Science Review, 25, 2017, 79-100 Affective Computing q PROBLEM 4: How does recognition work?

SENSOR DATA SOURCES IN AFFECTIVE COMPUTING STUDIES [1]

DATA SOURCES IN AFFECTIVE COMPUTING AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS STUDIES [1] MARCO DI FELICE 37 Affective Computing q PROBLEM 4: How does emotion recognition work?

² Unimodal features (single input source) q Visual modality q Audio modality q Textual modality ² Multi-modal features (multiple input source)

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 38 Affective Computing q PROBLEM 4: How does emotion recognition work? Early fusion ² It fuses the features extracted from various modalities, such as visual features, text, audio as a single vector which is then processed Decision-level fusion ² The features of each modality are examined and classified indipendently and results are fused as a decision vector. Rule-based fusion ² Multimodal information is fused by statistical rule based methods such as linear weighted fusion, majority voting or custom-defined rules.

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 39 S. Poria et al. / Neurocomputing 174 (2016) 50–59 53

4.3. EmoSenticNet each video was segmented into some parts and each of the sub segments was of few seconds duration. Every segment was We also used EmoSenticNet [55], an extension of SenticNet annotated as either 1, 0, or 1 denoting positive, neutral or À containing about 13,741 common-sense knowledge concepts, negative sentiment. including those concepts that exist in the WNA list, along with Using a MATLAB code, we converted all videos in the dataset to their affective labels in the set anger, joy, disgust, sadness, surprise, image frames. Subsequently, we extracted facial features from fear. In order to build a suitable for emotive each image frame. To extract facial characteristic points (FCPs) reasoning, we applied the so-called blending technique to Con- from the images, we used the facial recognition software Luxand ceptNet and EmoSenticNet. Blending is a technique that performs FSDK 1.7. From each image, we extracted 66 FCPs; see examples in inference over multiple sources of data simultaneously, taking Table 2. The FCPs were used to construct facial features, which are advantage of the overlap between them. defined as distances between FCPs; see examples in Table 3. It linearly combines two sparse matrices into a single matrix, GAVAM [56] was also used to extract facial expression features in which the information between two initial sources is shared. from the face. Table 4 shows the extracted features from facial Before performing blending, we represent EmoSenticNet as a images. In our experiment we used the features extracted by FSDK directed graph similar to ConceptNet. For example, the concept 1.7 along with the features extracted using GAVAM. If a segment of birthday party was assigned an emotion joy. We considered these a video has n number of images, then we extracted features from as two nodes, and added the assertion HasProperty on the edge each image and take average of those feature values in order to directed from the node birthday party to the node joy. Next, we compute the final facial expression feature vector for a segment. converted the graphs to sparse matrices in order to blend them. We used an ELM classifier to build the sentiment analysis model After blending the two matrices, we performed Truncated Sin- from the facial expressions. 10-fold cross validation was carried gular Value Decomposition (TSVD) on the resulting matrix to out on the dataset producing 68.60% accuracy. discard the components that represented relatively small varia- tions in the data. Only 100 significant components of the blended matrix were retained in order to produce a good approximation of the original matrix. The number 100 was selected empirically: the original matrix was found to be best approximated using 100 Table 2 components. Some relevant facial characteristic points (out of the 66 facial characteristic points detected by Luxand).

4.4. Overview of the experiment Features Description

0 Left eye First, we present an empirical method used for extracting the 1 Right eye key features from visual and textual data for sentiment analysis. 24 Left eye inner corner Then, we describe a fusion method employed to fuse the extracted 23 Left eye outer corner 38 Left eye lower line features for automatically identifying the overall sentiment 35 Left eye upper line expressed by a video: 29 Left eye left iris corner 30 Left eye right iris corner 54 S. Poria et al. / Neurocomputing 174 (2016) 50–59 In YouTube dataset each video was segmented into several 25[1] Soujanya Poria, Erik RightCambria eye inner, Newton Howard, Guang-Bin Huang, corner Amir Hussain, "Fusing audio, visual and textual clues  26 Right eye outer corner parts. According to the frame rate of the video, we first Table 4 41for sentiment analysis from Right eyemultimodal lower line contentFeatures", Neurocomputing extracted using GAVAM from the facial 174 (2016) 50–59 features. converted each video segment into images. Then, for each 40 Right eye upper line Features video segment we extracted the facial features from all images 33 Right eye left iris corner S. Poria et al. / Neurocomputing 174 (2016) 50–59 55 and took the average to compute the final feature vector. 34 Right eye right iris corner The time of occurrence of the particular frame in milliseconds The displacement of the face w.r.t. X-axis. It is measured by the displacement of the normal to the frontal view of the face in the X-direction 13 Left eyebrow inner corner 7.3. Direct nominal objects elements: the head verb(h), the dependent verb(d), and Similarly, the audio and textual features were also extracted The displacement of the face w.r.t. Y-axis the (optional) complement of the dependent verb (t). 16 Left eyebrow middle The displacement of the face w.r.t. Z-axis from each segment of the audio signal and text transcription of This complex rule dealt with direct nominal objects of a verb. Once these elements had all been identified, the con- 12 Left eyebrow outer corner The angular displacement of the face w.r.t. X-axis. It is measured by the angular displacement of the normal to the frontal view of the face with the X-axis the video clip, respectively. The angular displacement of the face w.r.t. Y-axis cept (h,d,t) was extracted. 14Affective Right eyebrow Computing inner corner The angular displacement of the face w.r.t. Z-axis Trigger: when the active token was head verb of a direct object Example: in (6), like was the head of the open clausal complements Next, we fused the audio, visual and textual feature vectors to 17 Right eyebrow middle dependency relation with praise as the dependent. The  dependency relation. form a final feature vector which contained the information of 54 Mouth top Behavior: if a word h was in a direct nominal object relationship complement of the dependent verb praise was movie. 55 Mouth bottom with a word t, then the concept h t was extracted. both audio, visual and textual data. Later, a supervised classifier 6. Extracting features from audio data Voice quality: Harmonics to noise ratio in theÀ audio signal. Example: in (3) the system extracted the concept (see, movie). (6) Paul likes to praise good movies. PLP: The Perceptual Linear Predictive Coefficients of the audio was employed on the fused feature vector to identify the q  PROBLEM 4: How doesWe automatically emotion extracted audio features from each annotatedrecognitionsegment were calculated using work? the openEAR toolkit. overall polarity of each segment of the video clip. On the other (3) Paul saw the movie in 3D. So, in this example, the concept (like,praise,movie) was segment of the videos. Audio features were also extracted using a extracted. hand, we also carried out an experiment on decision-level 30 Hz frame-rate and a sliding window of 100 ms. To compute the (see,in,3D) was not treated at this stage since it will later be fusion, which took the sentiment classification result from Multimodal sentiment analysis, features,consisting we used the open in source softwareharvesting OpenEAR [57]. sentimentstreated using the standard from Web rule for prepositional attachment. 7.7. Modifiers Specifically, this toolkit automatically extracts pitch and voice 7. Sentiment analysis of textual data fi 3 individual modalities as inputs and produced the nal intensity. Z-standardization was used to perform voice normal- 7.7.1. Adjectival, adverbial and participial modification videos via a model that uses audio, visual and textual modalities7.4.Identifying Adjective [1]. sentiments and clausal in complements text is a challenging Rules task, mainly sentiment label as an output. Table 3 ization. The voice intensity was thresholded to identify samples The rules for items modified by adjectives, adverbs or parti- with and without voice. Using openEAR we extracted 6373 because of the ambiguity of words in text, complexity of meaning, Some important facial features used for the experiment. These rules dealt with verbs which had as complements, either ciples, all share the same format. features. These features includes several statistical measures, e.g., and the interplay of various factors such as irony, politeness, an adjective or a closed clause (i.e., a clause, usually finite, with its max and min values, standard deviation, and variance, of some key writing style, as well as the variability of language from person 5. Extracting features from visual data own subject). Trigger: these rules were activated when the active token was Features feature groups. Some of the usefulTEXT FEATURES key features extracted by to person and from culture to culture. In this work, we followed modified by an adjective, an adverb or a participle. openEAR are described below: the sentic computing paradigm [58], which considers text as (linguistic patterns) Trigger: when the active token was head verb of one of the Behavior:if a word w was modified by a word t then the concept Humans are known to express emotions in a number of ways, Distance between right eye and left eye expressing both semantics and sentics, i.e., denotative and con- complement relations. (t,w) was extracted. Mel frequency cepstral coefficients: MFCC were calculated based notative information commonly associated with real-world Distance between the inner and the outer corner of the left eye  Behavior: if a word h was in a direct nominal object relationship Example: in (7) the concept bad, loser was extracted. including, to a large extent, through the face. Facial expressions on the short time Fourier transform (STFT). First, log-amplitude objects, actions, events, and people. As we conducted concept- fi fi Distance between the upper and the lower line of the left eye level sentimentwith a analysis, word t, thenconcept the extraction concept h fromt was text extracted. was the play a signi cant role in the identi cation of emotions in a Ø ofMFCC the magnitude spectrum was taken, followed by grouping À Example smells fi (7) Paul is a bad loser. Distance between the left iris corner and the right iris corner of the left eye and smoothing the fast Fourier transform (FFT) bins according fundamental: in step (4), of the experiment.was the head Below, of awe clausalrst describe complement the multimodal stream. A facial expression analyzer automatically Distance between the inner and the outer corner of the right eye concept extractiondependency algorithm relation from with textbad[59]as, followed the dependent. by a descrip- fi Ø toSpectral the perceptually Centroid motivated Mel-frequency scaling. The identi es emotional clues associated with facial expressions, and Distance between the upper and the lower line of the right eye Jaudio tool provided the first five of 13 coefficients, which tion of feature extraction methods based on the extracted concepts fi fi Distance between the left iris corner and the right iris corner of the right eye were found to produce the best classification result. for(4) concept-level This meal smells sentiment bad. analysis. 7.7.2. Prepositional phrases classi es facial expressions in order to de ne sentiment cate- Ø Spectral Flux fi Spectral centroid: Spectral Centroid is the center of gravity of the Although prepositional phrases do not always act as modi ers, Distance between the left eyebrow inner and the outer corner  gories and to discriminate between them. We use positive, magnitude spectrum of the STFT. Here, M n denotes the 7.1.In thisSubject example, noun rule the concept (smell,bad) was extracted. we introduced them in this section as the distinction does not Distance between the right eyebrow inner and the outer corner Ø Beath Histogram i½ Š negative and neutral as sentiment classes in the classification magnitude of the Fourier transform at frequency bin n and affect their treatment. Distance between top of the mouth and bottom of the mouth Trigger7.5.: Negation when the active token was found to be the syntactic problem. In the annotations provided with the YouTube dataset, Ø frame . i. The centroid is used to measure the spectral shape. A higher… value of the centroid indicates brighter textures with subject of a verb. Trigger: the rule was activated when the active token was greater frequency. The spectral centroid is calculated as follows: BehaviorNegation:if a word is alsoh awas crucial in a component subject noun of relationshipnatural language with text a recognized as typing a prepositional dependency rela- which usuallyword t flthenips the the meaning concept t of ah sentence.was extracted. This rule was used tion. In this case, the head of the relation was the n nM n À i 0 i Exampleto identify: in (1), whethermovie awas word in awas subject negated. relation with boring. VISUAL FEATURES Ci n¼ ½ Š element to which the PP attached, and the dependent AUDIO FEATURES ¼ i 0 Mi n P ¼ ½ Š was the head of the phrase embedded in the PP. P (1)Trigger The movie: when is boring. in text, a word was negated. Behavior:instead of looking for the complex concept formed by the Spectral flux: Spectral flux is defined as the squared difference Behavior h negation marker t head and dependent of the relation, the system used the  : if a word was negation by a , then the between the normalized magnitudes of successive windows: Here the conceptconcept(boring-movie)t h was extracted.was extracted. preposition to build a ternary concept. AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS n 2 À Fi n 1 Nt n Nt 1 n where Nt n and Nt 1 n are the Example: in (5), like was the head of the negation dependency Example: in (8), the parser yields a dependency relation typed ¼ ¼ ð ½ ŠÀ À ½ ŠÞ ½ Š ÀÀ ½ Š 40 normalizedMARCO DI FELICE magnitudes of the Fourier transform at the current 7.2. Joint subjectrelation noun with andnot adjectiveas the complement dependent. rule Here, like was prep_with between the verb hit and the noun hammer P frame t and the previous frame t 1, respectively. The spectral negated by the negation marker not. ( the head of the phrase embedded in the PP). À ¼ flux represents the amount of local spectral change. Trigger: when the active token was found to be the syntactic Beat histogram: It is a histogram showing the relative strength subject of a verb and the verb was in an adjective (8) Bob hit Marie with a hammer.  (5) I do not like the movie. of different rhythmic periodicities in a signal, and is calculated complement relation with an adverb. as the auto-correlation of the RMS. BehaviorAccording:if a to word the ruleh was described in a subject above, noun the concept relationship(not, with like) a Therefore, the system extracted the complex concept (hit, Beat sum: This feature is measured as the sum of all entries in word t, and the word t was in an adjective complement  was extracted. with, hammer). the beat histogram. It is a very good measure of the importance relationship with a word w, then the concept w h was À of regular beats in a signal. 7.6. Openextracted. clausal complements fi Strongest beat: It is de ned as the strongest beat in a signal, in Example: in (2), flower was in a subject relation with smells and 7.7.3. Adverbial clause modifier  fi beats per minute and is found by nding the strongest bin in Opensmells clausalwas complements in an adjective are complement clausal complements relationship of with a verb This kind of dependency concerned full clauses that act as the beat histogram. that dobad not. have their own subject, meaning they (usually) share modifiers of a verb. Standard examples involved temporal clauses Pause duration: Pause direction is the percentage of time the their subjects with that of the matrix clause. The corresponding and conditional structures.  fl speaker is silent in the audio segment. (2)rule The wasower complex smells in bad. the same way as the one for direct objects. Pitch: This is computed using the standard deviation of the  Trigger: the rule was activated when the active token was a verb pitch level for a spoken segment. Here the concept (bad-flower) was extracted. Trigger: when the active token was the head of the relation. modified by an adverbial clause. The dependent was the Behavior: as for the case of direct objects, the algorithm tried to head of the modifying clause. determine the structure of the dependent of the head Behavior:if a word t was an adverbial clause modifier of a word w, verb. Here the dependent was itself a verb, therefore, then the concept (t w) was extracted. À the system tried to establish whether the dependent Example: in (9), the complex concept (play,slow) was extracted. verb had a direct object or a clausal complement of its own. In a nutshell, the system was dealing with three (9) The machine slows down when the best games are playing. [1] Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, Amir Hussain, "Fusing audio, visual and textual clues for sentiment analysis from multimodal content", Neurocomputing 174 (2016) 50–59

Affective Computing S. Poria et al. / Neurocomputing 174 (2016) 50–59 57 Table 5 Table 7 Results of feature-level fusion. Comparison of classifiers. q PROBLEM 4: How does emotionCombination of modalities recognition work? Precision Recall Classifier Recall (%) Training time Accuracy of the experiment carried out on Textual Modality 0.619 0.59 SVM 77.03 2.7 min Accuracy of the experiment carried out on Audio Modality 0.652 0.671 ELM 77.10 25 s Two fusion schemes are investigatedAccuracy in [1]: of the experiment carried out on Video Modality 0.681 0.676 ANN 57.81 2.9 min Experiment using only visual and text-based features 0.7245 0.7185 Result obtained using visual and audio-based features 0.7321 0.7312 ² Early fusion (i.e. combine all featuresResult obtained using within audio and text-based a single featuresvector 0.7115 0.7102) Accuracy of feature-level fusion of three modalities 0.782 0.771 ² Linear weighted S.fusion, with Poria et al. / Neurocomputingequal 174 (2016) 50 –weights59 for each source 57 a small difference in accuracy between the ELM and SVM classifiers. In terms of training time, the ELM outperformed SVM and ANN Table 5 Table 7 Results of feature-level fusion. ComparisonTable 6 of classifiers. by a huge margin (Table 7). As our eventual goal is to develop a Results of decision-level fusion. real-time multimodal sentiment analysis engine, so we preferred Combination of modalities Precision Recall Classifier Recall (%) Training time the ELM as a classifier which provided the best performance in Combination of modalities Precision Recall terms of both accuracy and training time. Accuracy of the experiment carried out on Textual Modality 0.619 0.59 SVM 77.03 2.7 min Accuracy of the experiment carried out on Audio Modality 0.652 0.671 ELMExperiment using only visual 77.10 and text-based features 0.683 25 s 0.6815 Accuracy of the experiment carried out on Video Modality 0.681 0.676 ANNResult obtained using visual 57.81 and audio-based features 0.7121 2.9 min 0.701 Experiment using only visual and text-based features 0.7245 0.7185 Result obtained using audio and text-based features 0.664 0.659 11. Conclusion and future work Result obtained using visual and audio-based features 0.7321 0.7312 Accuracy of decision-level fusion of three modalities 0.752 0.734 Result obtained using audio and text-based features 0.7115 0.7102 We have presented a multimodal sentiment analysis frame- Accuracy of feature-level fusion of three modalities 0.782 0.771 work, which includes sets of relevant features for text and visual a small difference in accuracy between the ELM and SVM data, as well as a simple technique for fusing the features extracted classiResultsfiers. for feature-level fusion are shown in Table 5, from which it from different modalities. In particular, our textual sentiment AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS Incan terms be seen of training that our time, method the outperforms ELM outperformed[1] by 16.00% SVM in and terms ANN of 41 analysis module has been enriched by sentic-computing-based MARCO DI FELICEby aaccuracy. huge margin (Table 7). As our eventual goal is to develop a Table 6 features, which have offered significant improvement in the Results of decision-level fusion. real-timeTable multimodal 6 shows the sentiment experimental analysis results engine, of decision-level so we preferred fusion. the ELMTables as 5 a and classi 6 showfier thewhich experimental provided results the best obtained performance when only in performance of our textual sentiment analysis system. Visual Combination of modalities Precision Recall termsaudio of both and accuracy text, visual and and training text, audio time. and visual modalities were features also play a key role to outperform the state-of-the-art. used for the experiment. It is clear from these tables that the As discussed in the literature, gaze- and smile-based facial Experiment using only visual and text-based features 0.683 0.6815 accuracy improves dramatically when audio, visual and textual expression features are usually found to be very useful for Result obtained using visual and audio-based features 0.7121 0.701 fi Result obtained using audio and text-based features 0.664 0.659 11. Conclusionmodalities are and used future together. work Finally, Table 5 also shows experi- sentiment classi cation. Our future research will aim to incorpo- Accuracy of decision-level fusion of three modalities 0.752 0.734 mental results obtained when either the visual or text modality rate gaze and smile features, for facial-expression-based sentiment fi Weonly, have was presented used in the a experiment. multimodal sentiment analysis frame- classi cation, in addition to focusing on the use of audio modality work, which includes sets of relevant features for text and visual for the multimodal sentiment analysis task. Furthermore, we will Results for feature-level fusion are shown in Table 5, from which it data,10.1. as well Feature as a simple analysis technique for fusing the features extracted explore the possibility of developing a culture- and language- fi can be seen that our method outperforms [1] by 16.00% in terms of from different modalities. In particular, our textual sentiment independent multimodal sentiment classi cation framework. accuracy. analysisWe module also analyzed has been the enriched importance by of sentic-computing-based each feature used in the Finally, we will strive to improve the decision-level fusion process fi Table 6 shows the experimental results of decision-level fusion. features,classifi whichcation havetask. The offered best signi accuracycant was improvement obtained when in the all using a cognitively-inspired fusion engine. Subsequently, we will performance of our textual sentiment analysis system. Visual Tables 5 and 6 show the experimental results obtained when only features were used together. However, GAVAM features were work on reducing the time complexities of our developed meth- features also play a key role to outperform the state-of-the-art. audio and text, visual and text, audio and visual modalities were found to be superior in comparison to the features extracted by ods, in order to get closer to the ambitious goal of developing a As discussed in the literature, gaze- and smile-based facial used for the experiment. It is clear from these tables that the Luxand FSDK 1.7. Using only GAVAM features, we obtained an real-time system for multimodal sentiment analysis. Hence, expression features are usually found to be very useful for accuracy improves dramatically when audio, visual and textual accuracy of 57.80% for the visual features-based sentiment analysis another aspect of our future work will be to effectively analyze sentiment classification. Our future research will aim to incorpo- modalities are used together. Finally, Table 5 also shows experi- task. However, for the same task, 55.64% accuracy was obtained and appropriately address the system's time complexity require- rate gaze and smile features, for facial-expression-based sentiment fi mental results obtained when either the visual or text modality when we only used the features extracted by Luxand FSDK 1.7. ments in order to create an ef cient and reliable multimodal classification, in addition to focusing on the use of audio modality only, was used in the experiment. For the audio-based sentiment analysis task, MFCC and Spectral sentiment analysis engine. for the multimodal sentiment analysis task. Furthermore, we will Centroid were found to produce a lower impact on the overall explore the possibility of developing a culture- and language- 10.1. Feature analysis accuracy of the sentiment analysis system. However, exclusion of References independent multimodal sentiment classification framework. those features led to a degradation of accuracy for the audio-based Finally, we will strive to improve the decision-level fusion process We also analyzed the importance of each feature used in the sentiment analysis task. We also experimentally evaluated the role [1] L.-P. Morency, R. Mihalcea, P. Doshi, Towards multimodal sentiment analysis: fi using a cognitively-inspired fusion engine. Subsequently, we will classi cation task. The best accuracy was obtained when all of certain audio features like time domain zero crossing, root mean harvesting opinions from the web, In: Proceedings of the 13th International work on reducing the time complexities of our developed meth- Conference on Multimodal Interfaces, ACM, Alicante, Spain, 2011, pp. 169–176. features were used together. However, GAVAM features were square, compactness, but did not obtain a higher accuracy using any ods, in order to get closer to the ambitious goal of developing a [2] E. Cambria, N. Howard, J. Hsu, A. Hussain, Sentic blending: scalable multi- found to be superior in comparison to the features extracted by modal fusion for continuous interpretation of semantics and sentics, In: IEEE real-timeof them. system for multimodal sentiment analysis. Hence, Luxand FSDK 1.7. Using only GAVAM features, we obtained an In the case of text-based sentiment analysis, we found that SSCI, Singapore, 2013, pp. 108–117. accuracy of 57.80% for the visual features-based sentiment analysis another aspect of our future work will be to effectively analyze [3] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and concept-gram features play a major role compared to SenticNet- – task. However, for the same task, 55.64% accuracy was obtained and appropriately address the system's time complexity require- applications, Neurocomputing 70 (1) (2006) 489 501. mentsbased in order features. to createIn particular, an effi SenticNet-basedcient and reliable features multimodal mainly [4] G.-B. Huang, E. Cambria, K.-A. Toh, B. Widrow, Z. Xu, New trends of learning in when we only used the features extracted by Luxand FSDK 1.7. helped detect associated sentiments in text in an unsupervised computational intelligence, IEEE Comput. Intelli. Mag. 10 (2) (2015) 16–17. sentiment analysis engine. [5] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a For the audio-based sentiment analysis task, MFCC and Spectral way. Note that our aim is to develop a multimodal sentiment Centroid were found to produce a lower impact on the overall review, Neural Netw. 61 (2015) 32–48. analysis system where sentiment will be extracted from text in an [6] S. Decherchi, P. Gastaldo, R. Zunino, E. Cambria, J. Redi, Circular-ELM for the accuracy of the sentiment analysis system. However, exclusion of References unsupervised way using SenticNet as a knowledge base. reduced-reference assessment of perceived image quality, Neurocomputing those features led to a degradation of accuracy for the audio-based 102 (2013) 78–89. sentiment analysis task. We also experimentally evaluated the role [1] L.-P. Morency, R. Mihalcea, P. Doshi, Towards multimodal sentiment analysis: [7] E. Cambria, P. Gastaldo, F. Bisio, R. Zunino, An ELM-based model for affective 10.2. Performance comparison of different classifiers analogical reasoning, Neurocomputing 149 (2015) 443–455. of certain audio features like time domain zero crossing, root mean harvesting opinions from the web, In: Proceedings of the 13th International Conference on Multimodal Interfaces, ACM, Alicante, Spain, 2011, pp. 169–176. [8] E. Principi, S. Squartini, E. Cambria, F. Piazza, Acoustic template-matching for square, compactness, but did not obtain a higher accuracy using any [2] E. Cambria,On the N. same Howard, training J. Hsu, and A. Hussain, test sets, Sentic we blending: ran the scalableclassification multi- automatic emergency state detection: an ELM based algorithm, Neurocomput- of them. modal fusion for continuous interpretation of semantics and sentics, In: IEEE ing 149 (2015) 426–434. experiment using SVM, ANN and ELM. ELM outperformed ANN by In the case of text-based sentiment analysis, we found that SSCI, Singapore, 2013, pp. 108–117. [9] S. Poria, E. Cambria, A. Hussain, G.-B. Huang, Towards an intelligent framework [3] G.-B.12% Huang,in terms Q.-Y. of accuracy Zhu, C.-K. (see Siew,Table Extreme 7). However, learning machine: we observed theory only and for multimodal affective data analysis, Neural Netw. 63 (2015) 104–116. concept-gram features play a major role compared to SenticNet- applications, Neurocomputing 70 (1) (2006) 489–501. based features. In particular, SenticNet-based features mainly [4] G.-B. Huang, E. Cambria, K.-A. Toh, B. Widrow, Z. Xu, New trends of learning in helped detect associated sentiments in text in an unsupervised computational intelligence, IEEE Comput. Intelli. Mag. 10 (2) (2015) 16–17. [5] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a way. Note that our aim is to develop a multimodal sentiment review, Neural Netw. 61 (2015) 32–48. analysis system where sentiment will be extracted from text in an [6] S. Decherchi, P. Gastaldo, R. Zunino, E. Cambria, J. Redi, Circular-ELM for the unsupervised way using SenticNet as a knowledge base. reduced-reference assessment of perceived image quality, Neurocomputing 102 (2013) 78–89. [7] E. Cambria, P. Gastaldo, F. Bisio, R. Zunino, An ELM-based model for affective 10.2. Performance comparison of different classifiers analogical reasoning, Neurocomputing 149 (2015) 443–455. [8] E. Principi, S. Squartini, E. Cambria, F. Piazza, Acoustic template-matching for On the same training and test sets, we ran the classification automatic emergency state detection: an ELM based algorithm, Neurocomput- ing 149 (2015) 426–434. experiment using SVM, ANN and ELM. ELM outperformed ANN by [9] S. Poria, E. Cambria, A. Hussain, G.-B. Huang, Towards an intelligent framework 12% in terms of accuracy (see Table 7). However, we observed only for multimodal affective data analysis, Neural Netw. 63 (2015) 104–116. Affective Computing q How to describe the emotions q How to model affective data q Which emotions can be recognized q How to perform emotion detection q Research issues q Demo: Affective Tool

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 42 Affective Computing q Open Research Issues ² Privacy ² Issues might irise not only regarding users of the emotion-aware systems but also others in its vicinity. ² Data might not only describe the current affective state of the individuals but also his/her predicted future state. ² Implementation on resource-constrained devices ² Theoretical foundations of affective computing ² Build cultural transparent systems

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 43 Affective Computing q How to describe the emotions q How to model affective data q Which emotions can be recognized q How to perform emotion detection q Research issues q Demo: Affective Tool

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 44

empathetic and provide a much more natural user identified. If the confidence of the landmark detection is interface. below a threshold then the bounding box is ignored. The facial landmarks, head pose and intraocular The face is one of the richest channels of expression. It distance for each face are exposed in the SDK. communicates both emotions and social signals. The Facial Action Coding System (FACS) [3] is the most Facial Actions comprehensive and widely used objective taxonomyempath for etic and Histogramprovide a ofmuch Oriented more Gradient natural (HOG)user features [2] areidentified. If the confidence of the landmark detection is coding facial behavior. Facial action units (AUs) interface.are the extracted from the image region of interest defined bybelow a threshold then the bounding box is ignored. Affective Computingbuilding blocks of facial expressions. Manual coding of the facial landmark points. Support Vector Machine The facial landmarks, head pose and intraocular FACS is extremely laborious and time consuming.The face is one of(SVM) the classifiersrichest channels, trained of on expression 10,000s of. manuallyIt codeddistance for each face are exposed in the SDK. Furthermore, it is impractical for any real-time orcommunicates bothfacial emotions images collected and social from signals. around Thethe world, are used scaled application. In the past, systems for automated to provide scores from 0 to 100 for each facial action. Facial Action Coding System (FACS) [3] is the most Facial Actions coding of facial expressions have been constrained due For details of the training and testing scheme see [8]. comprehensive and widely used objective taxonomy for Histogram of Oriented Gradient (HOG) features [2] are to the limited availability of training data.https Using:// a webwww.affectiva.com- coding facial behavior. Facial action units (AUs) are the extracted from the image region of interest defined by based framework, similar to [6], we collected videos of Emotion Expressions hundreds of thousands of individuals. These videosbuilding blocks ofThe facial emotion expressions. expressions Manual (Anger, coding Disgust, of Fear, Joy,the facial landmark points. Support Vector Machine https://www.affectiva.com/werewp-content coded by expert/uploads/2017/03/McDuff_2016_Affdex.pdf FACS coders to provide a FACSrich is extremelySad laboriousness, Surprise and timeand ) consuming. are based on (SVM) classifiers, trained on 10,000s of manually coded dataset of facial expression examples. We trainedFurthermore, it combinationsis impractical of for facial any actions. real-time This or coding was built facialon images collected from around the world, are used state-of-the-art facial action and emotion classifiersscaled application.the InEMFACS the past, [4] emotionalsystemsempath for facialetic automated and action provide codinga much moresystem. naturalto provide user scores identified.from 0 Ifto the 100 confidence for each of the facial landmark action detection. is using this data. In this paper we present the AFFDEXcoding of facial expressionsThe emotion expressionshave beeninterface. constrainedare given a similardue score fromFor details of the belowtraining a threshold and testingthen the bounding scheme box see is ignored. [8]. The facial landmarks, head pose and intraocular kit (SDK). The SDK providesto the an limited availability0 (absent) of to training 100 (present). data. Using a web- The face is one of the richest channels of expression. It distance for each face are exposed in the SDK. easy interface for processing multiple faces withinbased a framework, similar to [6], wecommunicates collected videos both emotions of and social Emotionsignals. The Expressions video or live stream in real-time. The SDK has crosshundreds- of thousandsGender andof individuals. Glasses Facia Thesel Action videos Coding System (FACS) [3] Theis the emotionmost expressionsFacial Actions (Anger, Disgust, Fear, Joy, platform capabilities. In addition to facial actioncomprehensive and emotion and widely expression used objective taxonomy for Histogram of Oriented Gradient (HOG) features [2] are were coded by expert FACS coders tocoding provide facial behavior. a rich Facial action unitsSa (AUs)dness, are the Surprise extracted and Contempt) from the image are region based of interest on defined by dataset of facialclassifiers expression the examples. SDK hasbuilding classifiers We blocks trained offor facial determi expressions.ning Manualcombinations coding of of facialthe facial actions. landmark This points. coding Support wasVector built Machine on Automated Facial Coding state-of-the-art genderfacial action and whether and emotion theFACS person is extremelyclassifiers is wearing laborious glasses.and time consuming. the EMFACS [4] emotional(SVM) classifiers facial, trained action on 10,000s coding of manually system. coded Our system for automated facial coding has four main Furthermore, it is impractical for any real-time or facial images collected from around the world, are used Oriented Gradient (HOG) features using this data. In this paper we presentscaled application.the AFFDEX In the past, systems Thefor automated emotion expressionsto provide scoresare given from 0 toa 100similar for each score facial fromaction. components: 1) Face and facial landmark detection, 2) The classifiers have two operating modes: static and Face and Figure 1: landmarkAutomated facial detection coding extraction softwareClassification development via RNNs kit (SDK) and CNNs. Thecoding SDK of providesfacial expressions an have been constrained0 (absent) due to 100For (present). details of the training and testing scheme see [8]. pipeline. 1) Detection of face(s) face texture feature extraction, 3) facial action causal. The static classifiersto the limited allow availability classification of training of data. Using a web- easy interface for processing multiple faces within a and localization of the key facial classification and 4) emotion expression modelling. single images. The causalbased classifiersframework, similar leverage to [6] ,temporal we collected videos of Emotion Expressions landmarks on each face. 2) AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS hundreds of thousands of individuals. These videos The emotion expressions (Anger, Disgust, Fear, Joy, video or live stream in real-time.45 The SDK has cross- Gender and Glasses Figure 1 showsMARCO DI FELICE an overview. information available werein video coded seque by expertnces FACS to coders further to provide a rich Sadness, Surprise and Contempt) are based on Extraction of texture features platform capabilities. In addition to facial action and emotion expression using HOG. 3) Classification of increase the accuracydataset of the of facial facial expressionexpression examples. We trained combinations of facial actions. This coding was built on facial actions. 4) Modeling of Face and Facial Landmark Detection measures. state-of-the-art facial action and emotionclassifiers classifiers the SDKthe has EMFACS classifiers [4] emotional for facialdetermi actionning coding system. prototypic emotions using using this data. In this paper we present the AFFDEX The emotion expressions are given a similar score from Face detection is performed using the Viola-JonesAutomated face Facial Coding gender and whether the person is wearing glasses. EMFACS. software development kit (SDK). The SDK provides an 0 (absent) to 100 (present). detection algorithm [10]. Landmark detection isOur then system for automated facial codingeasy interfacehas four for processmain ing multiple faces within a applied to each facial bounding box and 34 landmarkscomponents: 1) Face and facial landmarkvideo or detectionlive stream in, real2) -time. The SDKThe has classifiers cross- haveGender two and operating Glasses modes: static and Figure 1: Automated facial coding face texture feature extraction, 3) facialplatform action capabilities. causal. The staticIn classifiers addition to facial allow action classification and emotion expression of pipeline. 1) Detection of face(s) classifiers the SDK has classifiers for determining and localization of the key facial classification and 4) emotion expressionAutomated modelling Facial. Coding single images. Thegender causal and whetherclassifiers the person leverage is wearing temporal glasses. landmarks on each face. 2) Figure 1 shows an overview. Our system for automated facial coding hasinformation four main available in video sequences to further Extraction of texture features Figure 1: Automated facial coding components: 1) Face and facial landmark detection, 2) The classifiers have two operating modes: static and using HOG. 3) Classification of pipeline. 1) Detection of face(s) face texture feature extraction, 3) facial increaseaction the accuracycausal . of The the static facial classifiers expression allow classification of facial actions. 4) Modeling of Face and andFacial localization Landmark of the key facialDetection classification and 4) emotion expression measures.modelling. single images. The causal classifiers leverage temporal landmarks on each face. 2) prototypic emotions using Figure 1 shows an overview. information available in video sequences to further Face detectionExtraction is of performed texture features using the Viola-Jones face EMFACS. using HOG. 3) Classification of increase the accuracy of the facial expression detection facialalgorithm actions. 4) [Modeling10]. Landmarkof Facedetection and Facial is La thenndmark Detection measures. prototypic emotions using Face detection is performed using the Viola-Jones face applied toE MeachFACS. facial bounding box and 34 landmarks detection algorithm [10]. Landmark detection is then applied to each facial bounding box and 34 landmarks

Performance Evaluation interactions. Below are some of the emerging The system was tested on an independent set of applications that the SDK supports: 10,000 images to verify the generalizability of algorithms. We did not control the lighting or pose of Video Conferencing. Non-verbal cues are critical to the participants. Figure 2 shows example frames. effective communication. In remote interactions these Affective Computing cues are lost. Systems that could recover affective AFFDEX SDK signals would make remote communication easier. We have created a Software Development Kit (SDK) to allow easy integration of the software into other Automated Tutors/Online Education. As distance applications. The hardware available impacts the learning becomes more popular automated Figure 1: Example images taken from the world’s largest dataset number of frames that can be processedhttps per:// second.www.affectiva.com measurement of learners’ emotional states becomes of facial videos from over 75 Typically it is possible to achieve frame rates of 10 more critical. Having access to affective data could countries in very diverse lighting Frames per Second (FPS) on mobile devices and 30 FPS help educators improve the quality of content. and surroundings. on laptop/desktop devices. Our demonstration will allow the users to get real-time feedback on their facial Life-logging and Health. Emotions impact many aspects ex²pressions. Measures Figure 3 shows 7 emotionsan example of the and mobile20 facialof our daily lives. However, we often have difficulty interface for our demo. Participants will be able to try reflecting on our emotional states. Life-logging tools interactingexpressions with an iPad and. Android tablet. Figure 4 and devices that help track emotions would be very shows the desktop demo that will allow users to test useful [5,7,9] and would provide a way of linking Disgust Fear Joy the multi-face capability on a larger screen. The SDK lifestyle patterns to changes in mood. demo² application In addition code and to the SDKfacial are available action to and emotion download at: http://www.affectiva.com/sdk Gaming. Computer games that respond to human expression classifiers, the SDK providesemotions offer a new dimension to game-play with Applications characters that adapt naturally to the player.

Emotion sensing offers a vast amount of potential for Sadness Surprise Anger improvingclassifiers human-computer for anddetermining human-human gender and whether the person is wearing glasses.

Contempt ² Two operating modes: static and causal. AU1 AU2 AU4 AU9 AU10 AU12 AU15 AU17 AU18 AU20 AU25 AU28 AU43 Smirk* In. Brow Out. Brow Brow Nose Upper Lip Corner Lip Chin Lip Lip Mouth Lip Eyes Table 1: The emotion expressions AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS Raise Raise Furrow Wrinkle Lip Raise Pull Depress Raise Pucker Press Open Suck Closed that are detected by the AFFDEX 46 SDK. Each emotion is given a MARCO DI FELICE score from 0 (absent) to 100 Table 2: The facial actions that are detected by the SDK. Each action is given a score from 0 (absent) to 100 (present). *Smirk is (present). defined as an asymmetric lip corner pull (either on the right or left side of the face but not both)

https://www.affectiva.com Affective Computing

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 47 https://knowledge.affectiva.com/v3.2/docs/analyze-a-photo-2 Affective Computing CREATE THE JAVA CLASS

public class MyActivity implements Detector.FaceListener { @Override protected void onCreate(Bundle savedInstanceState) { detector.setFaceListener(this); } };

ENABLE/DISABLE THE DETECTION MODULES detector.setDetectAllExpressions(true); detector.setDetectAllEmotions(true); detector.setDetectAllEmojis(true); detector.setDetectAllAppearances(true);

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 48 https://knowledge.affectiva.com/v3.2/docs/analyze-a-photo-2 Affective Computing INITIALIZE THE DETECTOR

detector.start();

PROCESS A PHOTO

byte [] frameData; int width = 640; int height = 480; Frame.ByteArrayFrame frame = new Frame.ByteArrayFrame(frameData, width, height,

Frame.COLOR_FORMAT.YUV_NV21); detector.process(frame);

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 49 https://knowledge.affectiva.com/v3.2/docs/analyze-a-photo-2 Affective Computing PROCESS THE RESULTS public void onImageResults(List faces, Frame image,float timestamp) { //For each face found for (int i = 0 ; i < faces.size() ; i++) { Face face = faces.get(i); int faceId = face.getId(); //Appearance Face.GENDER genderValue = face.appearance.getGender(); Face.GLASSES glassesValue = face.appearance.getGlasses(); Face.AGE ageValue = face.appearance.getAge(); Face.ETHNICITY ethnicityValue = face.appearance.getEthnicity(); //Some Emotions float joy = face.emotions.getJoy(); float anger = face.emotions.getAnger(); float disgust = face.emotions.getDisgust(); .... } } };

AFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS MARCO DI FELICE 50 44

Guida dell’app

1. La pressione del bottone ”Star Trip” indica al sistema di cominciare a tracciare 34 i sensori. L’utente pu`ocliccando2.2 il bottone Funzionalit`adell’applicazione impostazioni o quello di spiegazione 45 Il modello di classificazione `estato creato sulla base di un dataset di dati scaricato da dell’app. 3. L’utente pu`oleggere una guida dell’applicazioneinternet [35], utilizzato cliccando in [36]. il bottone Si `epreferito ”?” adottare questa soluzione in quanto non si aveva a disposizione un’automobile e perch´eil rilevamento di dati di guida presuppone la prova di manovre pericolose e azzardate, e quindi pi`uadattate a piloti esperti. Anche nei documenti consultati [18], [16], [8] si `eadottata questa soluzione per garantire l’in- columit`adel guidatore. In questo capitolo viene trattato il funzionamento del sistema e la sua progettazione specificando la sua architettura e le componenti utilizzate.

44 Affective Computing2.1 Architettura e scelte progettuali Guida dell’app Figura 2.4: View iniziale applicazione In questa sezione vengono presentati l’architettura dell’applicazione e scelte metodo- Figura 2.6: View spiegazione applicazione Example of Android application built with the logiche che sonoAffdex state adottate. SDK ( L’applicazioneauthor: Luca D’Ambrosio segue lo schema presentato nella) Fig. 2.1. 1. La pressione del bottone ”Star Trip” indica al sistema di cominciare a tracciare 2. L’utente premendo il bottone impostazioni pu`odecidere di iniziare a tracciare il i sensori. L’utente pu`ocliccando il bottone impostazioniAFFECTIVE DRIVE o4. quelloDurante di spiegazione la fase di rilevazione del comportamento dell’utente, egli/ella visualizza dell’app. proprio viaggio con le soglie6 di defaultsullo oppure schermo cambiarle la classe secondo predetta le dalsue sistema. esigenze.

Figura 2.1: Schema applicazione A↵ectiveDrive Figura 2.4: View inizialeFigura applicazione 2.5: SchemataFiguraAFFECTIVE COMPUTING: THEORY, TECHNIQUES AND APPLICATIONS setting 2.7: View applicazione principale dove viene mostrato l’andamento del comportamento durante la guida MARCO DI FELICE 51

6Le soglie sono state impostate con tali valori dopo aver svolto varie prove ed aver analizzato i diversi 2. L’utente premendorisultati il bottone riportati impostazioni dal sistema pu`odecidere di iniziare a tracciare il 5. Rilevata la situazione pericolosa si attiva la fotocamera che traccia il volto dell’u- proprio viaggio con le soglie6 di default oppure cambiarle secondo le sue esigenze. tente e lo avverte nel caso i dati rilevati superino le soglie impostate nel sistema.

Figura 2.5: Schemata setting applicazione

6Le soglie sono state impostate con tali valori dopo aver svolto varie prove ed aver analizzato i diversi risultati riportati dal sistema