journal name manuscript No. (will be inserted by the editor)
1 A Comparison of Humans and Machine Learning
2 Classifiers Detecting Emotion from Faces of People
3 with Different Coverings
4 Harisu Abdullahi Shehu · Will N.
5 Browne · Hedwig Eisenbarth
6
7 Received: 9th August 2021 / Accepted: date
8 Abstract Partial face coverings such as sunglasses and face masks uninten-
9 tionally obscure facial expressions, causing a loss of accuracy when humans
10 and computer systems attempt to categorise emotion. Experimental Psychol-
11 ogy would benefit from understanding how computational systems differ from
12 humans when categorising emotions as automated recognition systems become
13 available. We investigated the impact of sunglasses and different face masks on
14 the ability to categorize emotional facial expressions in humans and computer
15 systems to directly compare accuracy in a naturalistic context. Computer sys-
16 tems, represented by the VGG19 deep learning algorithm, and humans assessed
17 images of people with varying emotional facial expressions and with four dif-
18 ferent types of coverings, i.e. unmasked, with a mask covering lower-face, a
19 partial mask with transparent mouth window, and with sunglasses. Further-
Harisu Abdullahi Shehu (Corresponding author) School of Engineering and Computer Science, Victoria University of Wellington, 6012 Wellington E-mail: [email protected] ORCID: 0000-0002-9689-3290
Will N. Browne School of Electrical Engineering and Robotics, Queensland University of Technology, Bris- bane QLD 4000, Australia Tel: +61 45 2468 148 E-mail: [email protected] ORCID: 0000-0001-8979-2224
Hedwig Eisenbarth School of Psychology, Victoria University of Wellington, 6012 Wellington Tel: +64 4 463 9541 E-mail: [email protected] ORCID: 0000-0002-0521-2630 2 Shehu H. A. et al.
20 more, we introduced an approach to minimize misclassification costs for the
21 computer systems while maintaining a reasonable accuracy.
22 Computer systems performed significantly better than humans when no
23 covering was present (>15% difference). However, the achieved accuracy dif-
24 fered significantly depending on the type of coverings and, importantly, the
25 emotion category. It was also noted that while humans mainly classified unclear
26 expressions as neutral when the fully covering mask was used, the misclassifi-
27 cation varied in the computer systems, which would affect their performance
28 in automated recognition systems. Importantly, the variation in the misclas-
29 sification can be adjusted by variations in the balance of categories in the
30 training set.
31 Keywords COVID-19 · Emotion classification · Facial expression · Facial
32 features · Face masks
33
34 1 Introduction
35 Facial expressions convey information regarding human emotion in non-verbal
36 communication. Thus, an accurate categorization of a person’s emotional facial
37 expression allows understanding of their needs, intentions, as well as poten-
38 tial actions. This is valuable insight for psychological experiments, but time
39 consuming to collect manually. Thus, computational methods are needed to
40 analyze facial expressions but how do they differ from human classifiers, espe-
41 cially in different circumstances?
42 Face coverings such as sunglasses and face masks are used to cover the face
43 for different reasons. For instance, people use sunglasses to either protect the
44 eyes from sunlight or to improve appearance, and people use face masks often
45 to prevent the spread of infectious diseases such as the coronavirus (COVID-
46 19) or for hygienic reasons in hospitals or food preparation. However, these
47 coverings not only cover parts of a face but might also affect social interaction
48 as they obscure parts of facial expressions, which might contain socially rele-
49 vant information even if the expression does not represent the emotional state
50 of a person. While sunglasses cover an estimated 10-15% of the face, approxi-
51 mately 60-65% of the face is covered with face masks – exact numbers are hard
52 to tell as this might vary across different people, depending on the size of the
53 face (Carbon, 2020). More importantly, these coverings obscure different parts
54 of the face conveying information about emotional expressions to a varying
55 extent (Roberson et al., 2012).
56 Different studies have shown that humans are far from perfect in accessing
57 the emotional states of people by just looking at their face (Picard et al., 2001).
58 Further, humans are still far from perfect even when it comes to categorizing
59 emotional labels from facial expressions (Lewinski et al., 2014). A further
60 occlusion of the face leads to an even higher decrease in the performance
61 of humans in categorizing emotion from the faces of people. For instance, it
62 has been observed that a partial occlusion of the eyes (Noyes et al., 2021) Emotion classification of covered faces 3
63 impaired the achieved performance by humans especially for categorizing sad
64 expressions (Miller et al., 2020; Wegrzyn et al., 2017; Yuki et al., 2007). On
65 the other hand, while negative emotions such as sadness and fear are the most
66 difficult to categorize by humans (Camras & Allison, 1985; Guarnera et al.,
67 2018) compared to positive emotions like happiness (Calvo & Lundqvist, 2008;
68 Hugenberg, 2005; Van den Stock et al., 2007; Wacker et al., 2003), obscuring
69 the mouth leads to less accurate categorization even for happiness (Wegrzyn
70 et al., 2017).
71 More specifically, Carbon (2020) investigated the impact of face masks
72 on humans in evaluating six different categories of emotion (anger, disgust,
73 fear, happy neutral, and sad) from images of people. They found a significant
74 decrease in performance of their human participants across all expressions
75 except for fear and neutral expressions. In addition, emotional expressions such
76 as happy, sad, and anger were repeatedly confused with neutral expressions
77 whereas other emotions like disgust were confused with anger expressions. In
78 accordance with Carbon, several studies confirmed the detrimental effect of
79 face masks on emotion categorization by humans (Grundmann et al., 2021;
80 Marini et al., 2021; Noyes et al., 2021). However, none of these studies directly
81 compared the impact of face masks on humans with machine learning classifiers
82 (computer systems).
83 In contrast to humans, computer systems have done well in categorizing
84 emotion from in-distribution faces images (images from the same dataset) of
85 people, achieving an accuracy of over 95%. Nevertheless, the performance of
86 these systems is greatly affected when it comes to categorizing emotion from
87 natural imagery where the sun is free to shine and people are free to move,
88 changing dramatically the appearance of images received by these computer
89 systems (Picard et al., 2001). For instance, a change in the colour information
90 of the pixels, foreground colour of the whole or a small region of the face, which
91 are nonlinearly cofounded with emotional expressions can lead to a decrease
92 in the performance of these models by up to 25% (Shehu et al., 2020a; Shehu,
93 Siddique, et al., 2021).
94 Furthermore, face coverings such as sunglasses and face masks affect the
95 performance of computer systems by up to 74% (Shehu, Browne, et al., 2021).
96 However, it is unknown to which extent the same type of sunglasses and face
97 masks would affect the performance of humans in direct comparison to com-
98 puter systems in emotion categorization of the same naturalistic stimuli.
99 Moreover, the ecological validity of these computer systems remains ques-
100 tionable as they are mainly trained and evaluated with in-distribution data.
101 We aim to analyze the limitations of humans and computer systems in
102 categorizing emotion of covered and uncovered faces, as well as the misclassi-
103 fication cost of the computer systems:
104 Initially, we will compare the impact of sunglasses and different types of
105 face masks (i.e. fully covered and recently introduced face masks with a trans-
106 parent window (Coleman, 2020)) on humans’ and computer systems’ ability
107 to categorize emotional facial expressions. This direct comparison using the
108 same stimuli across different emotion categories and different naturalistic face 4 Shehu H. A. et al.
109 coverings addresses open questions about the differential relevance of facial
110 features for both types of observers.
111 With respect to the computer systems, classifier induction often assumes
112 that the best distribution of classes within the data is an equal distribution,
113 i.e. where there is no class imbalance (Grzymala-Busse, 2005; Japkowicz et
114 al., 2000). However, research has shown that this assumption is often not
115 true and might lead to a model that is very costly when it misclassifies an
116 instance (Ciraco et al., 2005; Kukar et al., 1999; Pazzani et al., 1994). This
117 might also be true for emotion classification, e.g. when the fully covering mask
118 is used to cover the face, emotion categories like happiness were classified
119 badly by the computer systems, where the majority of the happy expression
120 images were misclassified as anger expression (Shehu, Browne, et al., 2021).
121 Thus, a secondary objective is to investigate how data distribution effects
122 misclassification between humans and computer systems. This has a practical
123 importance as when positive expressions such as happiness are misclassified
124 as negative emotional expressions like anger or sadness, artificial intelligence
125 systems like robots might refrain from interacting with a person, taking actions
126 on the misbelief that they are angry or sad. Conversely, the computer systems
127 might continuously try to interact with a person if an expression such as angry
128 or sad were misclassified as happiness, which could lead to a potential threat
129 to the safety of humans interacting with these computer systems.
130 This is important because it is not the error rate that matters. In most
131 cases, what matters most is “where” the misclassification occurs, i.e. the neg-
132 ative effect that may be caused by computer systems misclassifying human
133 facial expressions. For these reasons, the cost of misclassification for the com-
134 puter systems needs to be assessed for computer systems trained with data
135 that is oversampled based on an expression that is considered to be of less
136 threat. This is anticipated to change the behaviour of the computer systems
137 such that they misclassify unclear expressions to expressions that will cause
138 less threat to the safety of humans interacting with them.
139 Therefore, as an additional analysis, we will introduce an approach to min-
140 imize the misclassification costs for the computer systems while maintaining
141 an equivalent or a relatively similar accuracy. Considering that these coverings
142 are more ecologically valid as they are widely used nowadays, understanding
143 these limitations and regulating the cost of misclassifying faces with these cov-
144 erings can inform the production of artificial emotionally intelligent systems
145 (Berryhill et al., 2019) when it comes to real-world situations.
146 2 Method
147 This study involved collecting and analyzing data from humans and computer
148 systems to compare their accuracy and confidence in categorizing emotion
149 from covered and uncovered faces. Emotion classification of covered faces 5
150 2.1 Human sample
151 Ethical approval for this study has been obtained from the Human Ethics
152 Committee (HEC) of Victoria University of Wellington and the study was
153 preregistered on OSF (https://osf.io/mgx9p). Participants gave informed con- 154 sent for participation in the study and could enter a raffle/draw to win a $50 155 voucher for compensation.
156 All participants were recruited online via SurveyCircle and social media
157 sites such as Facebook, Twitter, Instagram, etc. through snowball sampling.
158 A power calculation based on the effect size obtained from categorizing the
159 “afraid” class from Eisenbarth et al’s (2008) study was performed to estimate
160 the required sample size, which we found to be 52. However, the study aimed
161 to recruit at least 100 participants to prepare for potential missing data.
162 Of 154 people who started participating in the survey, 82 participants com-
163 pleted the full study. Of those 82 participants, 43 were male, 37 were female,
164 and 2 either did not specify their gender or identified themselves to be from
165 other groups. The age of the participants ranged from 19 to 70, with a mean of
166 27.88 (SD = 7.53). 28 of the participants self-identified as white, 11 as Black
167 or African American, 24 as Asian, 4 as Arab, 3 as multicultural, and the re-
168 maining 12 either did not answer the question or identified themselves to be
169 from other groups.
170 Data were collected via SoSciSurvey (https://www.soscisurvey.de/) be- st st 171 tween 21 of December 2020 to 21 of February 2021 during the COVID-19
172 pandemic.
173 2.2 Computer systems sample
174 The computer systems sample was represented by the artificial emotion clas-
175 sification systems. One of the most commonly used deep learning algorithms
176 called VGG19 with 19 weight layers was used as an example of a machine
177 learning algorithm (computer systems). VGG19 was chosen to be used as an
178 example of a computer system as it is a well-suited standard model and has
179 been shown to achieve high performance (Cheng & Zhou, 2020; Mohan et al.,
180 2021; Shehu, Siddique, et al., 2021).
181 VGG19 model was set up using Keras API (2021) to have the same num-
182 ber of layers as it was originally introduced in the VGG paper (Simonyan &
183 Zisserman, 2014). The model was set up to run for 200 epochs with an initial
184 learning rate of 0.001. The learning rate is reduced by 10% after 80, 100, 120, 1 185 160 epochs, and by 5% after 180 epochs to avoid overfitting the data. Note
186 that the computer systems were trained with images from the last-half frames
187 of the CK+ dataset based on the technique developed by Shehu et al. (2020b).
1 Overfitting occurs when a model learns details, including noise in the training data to some extent that it has a negative impact on the performance of the model on new data. As a result, the model performs too well on the training data, but poorly on the test data. 6 Shehu H. A. et al.
188 The accuracy of the VGG19 model is obtained by dividing the number of
189 correctly predicted test images by the total number of presented test images,
190 which in this case is 35 for each type of covering. In VGG19, the prediction
191 of each image returns the probability of associating the image to a partic-
192 ular emotion category where the summation of all the probabilities is equal n P 193 to 1 ( Pi = 1). The predicted class is the class with the highest probabil- i=1 194 ity (maxx∈[1,...,n] P (x) where n is the number of emotion categories). These 195 probabilities are usually considered as the confidence of the model.
196 Due to the stochastic nature of processes and the non-deterministic nature
197 of the VGG19 model, the model was run 30 times and the results presented are
198 an average obtained from the 30 runs. Thus, the study compared the survey
199 results obtained from 82 human participants with the 30 computer systems
200 (or machine runs).
201 2.3 Stimulus Material
202 We included seven categories of emotion, i.e. the six basic emotions defined by
203 Ekman (1971), as well as the neutral expression. The experiment consisted of
204 four blocks presenting facial expressions unmasked, masked, partially masked,
205 and with sunglasses. The unmasked block consisted of images with nothing
206 added to the face. Five images were used for each of the seven categories, i.e.
207 a total of 35 images are used. The images used mainly comprised of images
208 of different people such that participants could not make sense of the facial
209 expression from a previous image seen. Similarly, the masked block consisted
210 of 35 emotional images of people with face masks added to the images. The
211 partially masked block consisted of 35 emotional images of people with a par-
212 tial mask, i.e. a face mask with a transparent window added to the images and
213 finally, the sunglasses block consisted of 35 images of people with sunglasses
214 added to the images. All the images used in the experiment are emotional
215 images from the CK+ dataset as images from the dataset have recently been
216 used in a considerable number and a wide range of studies (Li et al., 2018;
217 Mollahosseini et al., 2016; Zhang et al., 2019). However, only one section (un-
218 masked) consisted of original images from the CK+ dataset. The remaining
219 three blocks (mask, partial mask, and sunglasses) consisted of simulated im-
220 ages with different kinds of face masks, as well as sunglasses. Face masks and
221 sunglasses were added to the images using a strategy that we implemented
222 in our previous research (Shehu, Browne, et al., 2021). Fig. 1a shows sample
223 original and simulated emotional images of the six basic emotions, plus the
224 neutral expression used here.
225 2.4 Descriptive questions
226 Human participants were asked to answer a set of questions such as their age,
227 gender, and ethnicity. Furthermore, we asked the participants three additional Emotion classification of covered faces 7
(a)
(b)
Figure 1: Sample images of faces: a) unmasked (first row), masked (second row), partial mask (third row), and sunglasses (last row) from the CK+ dataset. Left to right: anger, disgust, fear, happy, neutral, sad, and surprise expressions. (Shehu et al., 2020b) b) example trial for human sample. 8 Shehu H. A. et al.
228 questions to understand the extent of their prior experience with people wear-
229 ing masks.
230 The three questions asked to the participants were; “1) For how long did
231 you interact with people wearing masks on a daily basis during the past 3
232 months?”, “2) On average, for how much time per day (in percentage) did
233 you interact with people wearing masks during the past 3 months?”, and
234 “3) Of the people you saw on a day-to-day basis during the past 3 months,
235 approximately how many percentage were wearing masks on average?”. All
236 questions were asked prior to taking the survey and participants answered
237 using a slider ranging from 0 (“not at all”) to 100 (“very often”). A mean
238 exposure score (Qmean) was computed, averaging the three questions (M =
239 36.15, SD = 27.62).
240 2.5 Emotion categorization task
241 Fig. 1b shows a sample trial. For each trial, participants were required to
242 choose the emotion category that the image most closely relates to, as well as
243 their confidence in choosing the category of the emotion. There was a total of
244 140 trials across the 7 categories. All emotion categorization questions were re-
245 quired to be answered and participants were not taken to the next page unless
246 the emotion category and the confidence rating were made (see Supplementary
247 Materials).
248 The data collected, as well as the stimulus set used, can be accessed via
249 OSF (https://osf.io/mgx9p/).
250 2.6 Analysis
251 Group and stimulus effects on accuracy and confidence ratings were tested
252 using a 2×4×7 repeated-measures ANOVA with the factor Emotion category
253 with 7 levels (anger, disgust, fear, happy, neutral, sad, and surprise), the fac-
254 tor group (computer systems vs humans) and the factor covering (unmasked,
255 masked, partial mask, and sunglasses). Post-hoc tests included 2×4 repeated
256 measures ANOVAs and t-tests.
257 For all multiple tests such as posthoc analyze after the initial overall
258 ANOVAs, the alpha level was adjusted using Bonferroni corrections.
259 3 Results
260 This section presents the accuracy ratings obtained from humans compared
261 with computer systems in categorizing emotion from covered and uncovered
262 faces. Emotion classification of covered faces 9
263 3.1 Emotion categorization
264 Initially, a 3-way repeated-measures ANOVA (2×4×7) was performed with the
265 between-subject factor Group (i.e. humans vs computer systems), the within-
266 subject factors Covering (i.e. unmasked, mask, partial mask, and sunglasses),
267 and Emotion (i.e. anger, disgust, fear, happy, neutral, sad, and surprise) to
268 analyze the achieved accuracy. We found significant main effects for Group, F 2 269 (1, 3080) = 80.25, p < .001, η = 0.01, Covering, F (3, 3080) = 40.44, p < 2 2 270 .001, η = 0.09 and Emotion, F (6, 3080) = 81.87, p < .001, η = 0.02. All
271 interactions between the factors were significant (see Table 1a), as well as the
272 three-way interaction of Group × Covering × Emotion F (18, 3080) = 28.11, 2 273 p < .001, η = 0.09.
274 Thereafter, the alpha level was adjusted using Bonferroni correction (α =
275 0.0083) before post-hoc tests. A significant interaction between the emotion
276 categories (i.e. anger, disgust, fear, happy, neutral, sad, and surprise) was found 2 277 for humans, F (6, 567) = 33.18, p < .001, η = 0.26 and computer systems, F 2 278 (6, 203) = 33.14, p < .001, η = 0.50 for the unmasked emotion (see Fig. 2a).
279 Fig. 2a represents the violin plots for categorization accuracy of unmasked
280 images of people by both humans and computer systems. As can be seen from
281 Fig. 2a, the computer systems classify almost all the classes correctly, achieving
282 an overall accuracy of 98.48% compared to humans who achieved an overall ac-
283 curacy of 82.72%. This decrease in the overall performance of humans is based
284 on significantly lower accuracy in categorizing anger, disgust, and fear expres-
285 sions, with fear contributing the most (Error rate > 30%) to the significantly
286 worsen performance (see Table 1b) in humans. However, Bonferroni corrected
287 posthoc t-tests showed no significant difference in categorization accuracies
288 achieved by the computer systems across emotion categories except for the
289 neutral expression (see Table 1c), which was categorized with a significantly
290 lower accuracy compared to other emotion categories.
291 As can be seen, the classification accuracy achieved by the computer sys-
292 tems significantly decreased from an average accuracy of 98.48% when there
293 was no mask to 24.0% when there were masks on the face (see Fig. 2b). Not
294 only did the average accuracy decrease, but also the accuracy achieved from
295 categorizing each class decreased significantly (> 68%), except for anger (less
296 than 16%). Although the overall accuracy achieved by humans with masks on
297 the face also decreases in each category (from when there was no mask), the
298 decrease is not large (< 26%) compared to the decrease seen in the computer
299 systems’ accuracy (up to 74.48%). Also, post-hoc tests showed that differ-
300 ences between emotion categories were significant for both humans, F (6, 567) 2 301 = 64.85, p < .001, η = 0.41 and computer systems, F (6, 203) = 125.79, 2 302 p < .001, η = 0.79 (α = 0.0083) for the masked faces. Posthoc two-sample
303 t-tests for the accuracies in each category obtained from humans and the com-
304 puter systems showed significant differences (p < .001) for most of the emotion
305 categories at alpha = 0.0014 (see Tables 1d and 1e).
306 The violin plots presented in Fig. 2c show accuracies obtained from cate-
307 gorizing images by humans and computer systems when a partial mask (mask 10 Shehu H. A. et al.
(a) unmasked
(b) masked
(c) partial mask
(d) sunglasses Figure 2: Violin plots showing accuracy obtained from categorizing a) un- masked, b) masked, c) partial mask, and d) sunglasses emotions by humans and computer systems. Emotion classification of covered faces 11
Sum Square df F p Eta Squared Intercept 1.76E+05 1.0 438.05 <.001 0.08 C(Group) 3.23E+04 1.0 80.25 <.001 0.01 C(Covering) 4.88E+04 3.0 40.44 <.001 0.09 C(Emotion) 1.98E+05 6.0 81.87 <.001 0.02 C(Group): C(Covering) 1.78E+04 3.0 14.73 <.001 0.09 C(Group): C(Emotion) 1.94E+05 6.0 80.44 <.001 0.01 C(Covering): C(Emotion) 1.56E+05 18.0 21.55 <.001 0.07 C(Group): C(Covering):C(Emotion) 2.03E+05 18.0 28.11 <.001 0.09 Residual 1.24E+06 3080.0 - - - (a) ANOVA comparing humans and computer systems
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.123 (-0.4) 0.341 (0.25) <.001 (-1.63) <.001 (-1.00) <.001 (-0.96) <.001 (-1.20) Disgust 0.123 (-0.40) - 0.025 (0.60) <.001 (-1.08) 0.049 (-0.52) 0.048 (-0.52) 0.008 (-0.72) Fear 0.341 (0.25) 0.025 (0.60) - <.001 (-1.59) <.001 (-1.11) 1.00 (-1.08) 1.00 (-1.27) Happy <.001 (-1.63) <.001 (-1.08) <.001 (-1.59) - 0.008 (0.71) 0.056 (0.51) 0.113 (0.42) Neutral <.001 (-1.00) 0.049 (-0.52) <.001 (-1.11) 0.008 (0.71) - 0.832 (-0.06) 0.354 (-0.24) Sad <.001 ( -0.96) 0.048 (-0.52) <.001 (-1.08) 0.056 (0.51) 0.832 (-0.06) - 0.558 (-0.15) Surprise <.001 (-1.20) 0.008 (-0.72) <.001 (-1.27) 0.113 (0.42) 0.354 (-0.24) 0.558 (-0.15) - (b) unmasked by humans
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 1.00 (0) 1.00 (0) 1.00 (0) <.001 (1.51) 1.00 (0) 1.00 (0) Disgust 1.00 (0) - 1.00 (0) 1.00 (0) <.001 (1.51) 1.00 (0) 1.00 (0) Fear 1.00 (0) 1.00 (0) - 1.00 (0) <.001 (1.51) 1.00 (0) 1.00 (0) Happy 1.00 (0) 1.00 (0) 1.00 (0) - <.001 (1.51) 1.00 (0) 1.00 (0) Neutral <.001 (1.51) <.001 (1.51) <.001 (1.51) <.001 (1.51) - <.001 (1.51) <.001 (1.51) Sad 1.00 (0) 1.00 (0) 1.00 (0) 1.00 (0) <.001 (1.51) - 1.00 (0) Surprise 1.00 (0) 1.00 (0) 1.00 (0) 1.00 (0) <.001 (1.51) 1.00 (0) - (c) unmasked by computer systems
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.027 (0.59) 0.230 (-0.32) <.001 (-1.36) <.001 (-2.00) 0.276 (0.29) <.001 (-0.91) Disgust 0.027 (0.59) - 0.002 (-0.85) <.001 (-2.07) <.001 (-2.78) 0.327 (-0.26) <.001 (-1.67) Fear 0.230 (-0.32) 0.002 (-0.85) - 0.002 (-0.85) <.001 (-1.40) 0.036 (0.56) 0.106 (0.43) Happy <.001 (-1.36) <.001 (-2.07) 0.002 (-0.85) - 0.014 (-0.66) <.001 (1.58) 0.020 (0.62) Neutral <.001 (-2.00) <.001 (-2.78) <.001 (-1.40) 0.014 (-0.66) - <.001 (2.18) <.001 (1.38) Sad 0.276 (0.29) 0.327 (-0.26) 0.036 (0.56) <.001 (1.58) <.001 (2.18) - <.001 (-1.17) Surprise <.001 (-0.91) <.001 (-1.67) 0.106 (0.43) 0.020 (0.62) <.001 (1.38) <.001 (-1.17) - (d) masked by humans
Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (5.03) <.001 (2.90) <.001 (6.05) <.001 (3.61) <.001 (4.69) <.001 (4.19) Disgust <.001 (5.03) - <.001 (-1.80) <.001 (1.00) 0.017 (-0.64) 0.072 (-0.48) <.001 (-1.12) Fear <.001 (2.90) <.001 (-1.80) - <.001 (2.63) 0.002 (0.87) <.001 (1.45) <.001 (0.94) Happy <.001 (6.05) <.001 (1.00) <.001 (2.63) - <.001 (-1.24) <.001 (-1.62) <.001 (-2.36) Neutral <.001 (3.61) 0.017 (-0.64) 0.002 (0.87) <.001 (-1.24) - 0.230 (0.32) 0.605 (-0.13) Sad <.001 (4.69) 0.072 (-0.48) <.001 (1.45) <.001 (-1.62) 0.230 (0.32) - 0.017 (-0.64) Surprise <.001 (4.19) <.001 (-1.12) <.001 (0.94) <.001 (-2.36) 0.605 (-0.13) 0.017 (-0.64) - (e) masked by computer systems
Table 1: Statistical test results presenting a) ANOVA obtained from comparing humans and computer systems based on their achieved accuracy from categorizing people’s emotions with different types of coverings. b) Posthoc t-test results showing the p-value and Cohen’s d effect size obtained from comparing unmasked emotion categories by humans and c) computer systems. d) Posthoc t-test results showing the p-value and Cohen’s d effect size obtained from comparing masked emotion categories by humans and e) computer systems. Note that the α level in all the posthoc comparisons has been adjusted to 0.0014 using Bonferroni correction. 12 Shehu H. A. et al.
308 with transparent window) was used to cover the face. In contrast to the use of a
309 fully covering mask, the classification accuracy obtained by computer systems
310 saw a great increase in categorizing happiness and surprise expressions with
311 partial masks (0% vs 66.67% and 20% vs 73.33%). Also, while the accuracy
312 achieved by humans has also increased across all the emotion categories, the
313 largest increase was found for happy, disgust, sad, and surprise expressions
314 (all > 19% increase). Post-hoc tests revealed significance differences for both 2 315 humans, F (6, 567) = 64.01, p < .001, η = 0.40 and computer systems, F (6, 2 316 203) = 235.28, p < .001, η = 0.87 (α = 0.0083). Posthoc two-sample unpaired
317 t-tests showed a significant differences (< .001) for most emotion categories
318 (see Tables 2a and 2b).
319 The violin plots presented in Fig. 2d show the accuracy obtained from
320 categorizing faces with sunglasses by humans and computer systems. As can
321 be seen, the accuracy achieved by the computer systems decreased most for sad
322 and surprise expressions with sunglasses (all > 27%). Notably, the accuracy
323 achieved by humans in categorizing happiness remains the same both with
324 or without sunglasses. This shows that putting on sunglasses does not reduce
325 humans‘ happiness detection. As expected, there was a large decrease (> 16%)
326 in the accuracy for fear expressions when the eyes were covered. Post-hoc tests
327 showed that these differences between emotion categories were significant for 2 328 both humans, F (6, 567) = 74.15, p < .001, η = 0.44 and computer systems, 2 329 F (6, 203) = 13.85, p < .001, η = 0.29 (α = 0.0083) for faces with sunglasses.
330 Posthoc two-sample unpaired t-test showed significant differences (< .001) for
331 most of the emotion categories (see Tables 2c and 2d).
332 We also tested for differences within each emotion category across different
333 types of covering (i.e. unmasked, mask, partial mask, and sunglasses). A sig-
334 nificant main effect was found for all emotion categories (all p < .001), except 2 335 for the neutral expression, F (3, 324) = 0.037, p = .99, η = 0.003 (see Fig. 3a).
336 Although, there was a significant, F (3, 324) = 5.36, p < .001 (p = 0.00129), 2 337 η = 0.05 main effect for humans categorizing the fear expression comparing
338 the different types of coverings (see Fig. 3b), the effect size was very small 2 2 339 (η = 0.05) compared with other emotion categories (all η > 0.2) like anger
340 (see Fig. 3c), etc. that have a larger effect.
341 It can thus be suggested that humans might be equally worse (see Fig. 3b)
342 in categorizing the fear emotion category with or without covering on the face 2 343 as the effect size is not large as it is in other expressions (all η > 0.2 except 2 344 neutral expression η = 0.003).
345 Based on the three additional questions asked about exposure to face
346 masked people, we added Qmean as a factor to a linear regression and then
347 investigate if people’s experience with masked individuals might help them
348 achieve higher accuracy in categorizing mask emotion images. The results (see
349 Table 2e) showed significant main effects for the factors Covering, F(3, 6) = 2 2 350 122.19, p < .001, η = 0.09 and Qmean, F(6, 70) = 3.73, p < .001, η = 0.07,
351 where Qmean represents the average of the three questions. However, there
352 was no significant interaction between Covering and Qmean, F(6, 210) = 0.62, 2 353 p = 1.00, η = 0.03. Emotion classification of covered faces 13
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.051 (-0.52) 0.015 (-0.65) <.001 (-2.49) <.001 (-2.02) <.001 (-1.13) <.001 (-2.24) Disgust 0.051 (-0.52) - 0.695 (-0.10) <.001 (-1.66) <.001 (-1.29) 0.047 (-0.53) <.001 (-1.49) Fear 0.015 (-0.65) 0.695 (-0.10) - <.001 (-1.66) <.001 (-1.25) 0.096 (-0.44) <.001 (-1.46) Happy <.001 (-2.49) <.001 (-1.66) <.001 (-1.66) - 0.112 (0.42) <.001 (1.26) 0.621 (0.13) Neutral <.001 (-2.02) <.001 (-1.29) <.001 (-1.25) 0.112 (0.42) - 0.002 (0.83) 0.330 (-0.26) Sad <.001 (-1.13) 0.047 (-0.53) 0.096 (-0.44) <.001 (1.26) 0.002 (0.83) - <.001 (-1.06) Surprise <.001 (-2.24) <.001 (-1.49) <.001 (-1.46) 0.621 (0.13) 0.330 (-0.26) <.001 (-1.06) - (a) partial mask by humans
Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (7.29) <.001 (6.11) <.001 (-2.35) 0.566 (0.15) <.001 (8.22) <.001 (-2.16) Disgust <.001 (7.29) - 0.310 (-0.27) <.001 (-7.49) <.001 (-3.30) 0.319 (0.26) <.001 (-5.56) Fear <.001 (6.11) 0.310 (-0.27) - <.001 (-6.85) <.001 (-3.07) 0.078 (0.47) <.001 (-5.28) Happy <.001 (-2.35) <.001 (-7.49) <.001 (-6.85) - <.001 (1.72) <.001 (7.91) 0.104 (-0.43) Neutral 0.566 (0.15) <.001 (-3.30) <.001 (-3.07) <.001 (1.72) - <.001 (3.43) <.001 (-1.82) Sad <.001 (8.22) 0.319 (0.26) 0.078 (0.47) <.001 (7.91) <.001 (3.43) - <.001 (-5.72) Surprise <.001 (-2.16) <.001 (-5.56) <.001 (-5.28) 0.104 (-0.43) <.001 (-1.82) <.001 (-5.72) - (b) partial mask by computer systems
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.274 (0.29) 0.002 (0.86) <.001 (-1.64) <.001 (-0.95) 0.024 (0.60) <.001 (-1.30) Disgust 0.274 (0.29) - 0.031 (0.57) <.001 (-1.87) <.001 (-1.21) 0.292 (0.28) <.001 (-1.55) Fear 0.002 (0.86) 0.031 (0.57) - <.001 (-2.34) <.001 (-1.73) 0.169 (-0.36) <.001 (-2.05) Happy <.001 (-1.64) <.001 (-1.87) <.001 (-2.34) - 0.012 (0.68) <.001 (2.57) 0.158 (0.37) Neutral <.001 (-0.95) <.001 (-1.21) <.001 (-1.73) 0.012 (0.68) - <.001 (1.68) 0.207 (-0.33) Sad 0.024 (0.60) 0.292 (0.28) 0.169 (-0.36) <.001 (2.57) <.001 (1.68) - <.001 (-2.12) Surprise <.001 (-1.30) <.001 (-1.55) <.001 (-2.05) 0.158 (0.37) 0.207 (-0.33) <.001 (-2.12) - (c) sunglasses by humans Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.570 (0.15) 0.244 (0.31) 0.576 (-0.15) <.001 (1.27) <.001 (1.72) <.001 (1.49) Disgust 0.570 (0.15) - 0.566 (0.15) 0.354 (-0.24) <.001 (1.12) <.001 (1.22) <.001 (1.24) Fear 0.244 (0.31) 0.566 (0.15) - 0.158 (-0.38) <.001 (0.99) <.001 (0.94) <.001 (1.05) Happy 0.576 (-0.15) 0.354 (-0.24) 0.158 (-0.38) - <.001 (1.27) <.001 (1.47) <.001 (1.42) Neutral <.001 (1.27) <.001 (1.12) <.001 (0.99) <.001 (1.27) - 0.084 (-0.46) 0.550 (-0.16) Sad <.001 (1.72) <.001 (1.22) <.001 (0.94) <.001 (1.47) 0.084 (-0.46) - 0.156 (0.38) Surprise <.001 (1.49) <.001 (1.24) <.001 (1.05) <.001 (1.42) 0.550 (-0.16) 0.156 (0.38) - (d) sunglasses by computer systems
Sum Square df F p Eta Squared C(Emotion) 5.28E+05 6.0 172.42 <.001 0.27 C(Covering) 1.87E+05 3.0 122.19 <.001 0.09 C(Qmean) 1.33+E05 70.0 3.73 <.001 0.07 C(1/ID) 5.13+E04 81.0 1.24 0.076 0.03 C(Covering):C(Qmean) 6.62E+04 210.0 0.62 1.00 0.03 Residual 1.02+E06 1995.0 - - - (e) ANOVA for human sample including Qmean
Table 2: Statistical test results presenting a) Posthoc test (t-test) results showing the p- value and Cohen’s d effect size obtained from comparing partial mask emotion categories by humans and b) computer systems. c) Posthoc test (t-test) results showing the p-value (and Cohen’s d effect size) obtained from comparing sunglasses emotion categories by humans and d) computer systems. e) Regression results from comparing humans based on their understanding of masked people and their achieved accuracy from categorizing people’s emotions with different coverings. Note that the α level in all the posthoc comparisons has been adjusted to 0.0014 using Bonferroni correction. 14 Shehu H. A. et al.
(a) neutral by humans (b) fear by humans (c) anger by humans
(d) neutral and anger expressions with and without label change
Figure 3: Violin plot showing the accuracy obtained a) from classifying neutral expression by humans across four different covering, b) from categorizing fear expression by humans across four different coverings, b) from categorizing anger expression by humans across four different coverings, and c) by neutral and anger expression before and after the label changed.
354 3.2 Misclassification Analysis
355 As an additional analysis, we investigated the types of misclassification made
356 by humans and computer systems.
357 The results showed that humans mainly misclassify fully covering and
358 therefore, unclear expressions as neutral. However, the misclassifications varied
359 for the computer systems. For instance, the computer systems mainly classified
360 unclear emotions as anger when fully covering masks were used, while partially
361 covered facial expressions were mainly classified as happiness or surprise (see
362 Fig. 4).
363 As such, misclassifications can result in costly negative outcomes and com-
364 plications for humans interacting with computer systems, we aimed to regulate
365 the cost of misclassification in these computer systems. To make them resem-
366 ble human classification behavior, we aimed to alter the training to lead to
367 also misclassify unclear expressions as neutral.
368 We hypothesize that humans mainly misclassify unclear emotions as neu-
369 tral because it is the emotion that they exposed to predominantly on a day-to- Emotion classification of covered faces 15
370 day basis. In other words, humans would classify unclear expressions as neutral
371 because it is the expression they see on peoples’ faces most of the time. There-
372 fore, by oversampling the neutral expression images used to train the computer
373 systems, we anticipated that the models would behave more similar to humans
374 as they see more neutral expression compared to other emotional categories.
375 3.2.1 Method
376 Since classifying any emotional category as a neutral expression is considered
377 less costly compared to when they are classified as other emotional categories
378 such as anger, happy, or sad, etc., the misclassification cost of the models was
379 evaluated without taking the samples that were misclassified as neutral into
380 account. In particular, the quality of the classifier is not measured based on
381 total misclassification but rather using equation 1 below:
Err − Err cost = neutral (1) No. test samples
382 where
383 – Err represents the misclassified samples or the total misclassification
384 – Err neutral represents samples misclassified as neutral
385 – No. test samples represents the total number of test sample images
386 3.2.2 Results
387 In Table 3, the column oversampling ratio represents the oversampling ratio
388 of the category neutral against all other emotion categories. For instance, the
389 value “2:2” means no oversampling whereas the ratio “4:2” means there was
390 twice as much neutral expression images in the training set compared to images
391 from all other emotion categories. Different ratios were chosen based on the
392 number of available images in the CK+ dataset so as to find the best ratio that
393 will give a much lower misclassification cost, while maintaining a reasonable
394 accuracy.
Oversample ratio Misclassification cost Accuracy Avg cost Avg accuracy (Neutral:All) Mask Partial mask Sunglasses Mask Partial mask Sunglasses 2:2 0.61 0.47 0.18 0.41 0.24 0.33 0.80 0.46 3:2 0.40 0.54 0.23 0.39 0.20 0.14 0.68 0.34 4:2 0.26 0.34 0.17 0.26 0.17 0.23 0.71 0.37 5:2 0.25 0.39 0.10 0.25 0.25 0.17 0.66 0.36
Table 3: Table presenting accuracies and the misclassification costs obtained from classifying covered images with and without oversampling neutral ex- pression images
395 The average misclassification cost obtained in each of the oversampled
396 cases was less than the average cost without oversampling. Thus, oversampling 16 Shehu H. A. et al.
397 reduced the cost of misclassification, with the least misclassification cost for
398 the model trained with oversampled data at a 5:2 ratio. 2 399 Fig. 4a, 4b, and 4c present the confusion matrices obtained from classify-
400 ing mask, partial mask, and sunglasses images by humans, Fig. 4d, 4e, and 4f
401 present the confusion matrices obtained from classifying mask, partial mask,
402 and sunglasses images when no oversampling was applied, and Fig. 4g, 4h,
403 and 4i present the confusion matrices obtained from classifying mask, partial
404 mask, and sunglasses images with the best oversampling ratio (i.e. 5:2) ap-
405 plied to the training images. Humans mainly misclassified unclear expressions
406 as neutral when the face was fully mask covered. However, while the misclas-
407 sification varied for different coverings when no oversampling was applied to
408 computer systems, after oversampling, the majority of the misclassified images
409 were misclassified as neutral. Although the average accuracy achieved by the
410 model without oversampling is larger (> 10%) when compared with the aver-
411 age accuracy achieved by the model with the best oversampling ratio (5:2), the
412 difference in the average misclassification cost achieved by the model with the
413 best oversampling ratio was significantly lower (16% lower) than the average
414 misclassification cost of the model with no oversampling, (t(58) = 30.98, p <
415 .001).
416 4 Discussion
417 The work aimed to compare the performance of humans and computer systems
418 in categorizing emotion from faces of people with sunglasses and face masks, as
419 well as compare the decrease caused by each covering. The work also introduces
420 an approach that reduces the misclassification cost for the computer systems.
421 In addition to the replication of previous findings that computer systems
422 (with 98.48% accuracy) perform better than humans (with 82.72% accuracy)
423 in categorizing emotions from uncovered/unmasked images in the same dataset
424 (Barrett et al., 2019; Krumhuber et al., 2019; Krumhuber et al., 2020; Mat-
425 sumoto & Hwang, 2011), we found that face masks (both partially covered and
426 fully covering masks) and sunglasses reduced the accuracy of both humans and
427 computer systems in emotion categorization. Consistent with Carbon (2020),
428 we also found that humans mainly classify faces with fully covering masks as
429 neutral.
430 While the face masks impair emotion categorization in both humans and
431 computer systems by a significant amount, emotion categories are affected dif-
432 ferently depending on the type of covering used. For instance, the use of a fully
433 covering mask mainly obscured disgust, happiness, and sadness for computer
434 systems whereas anger, disgust, and sadness were mainly more difficult to cat-
435 egorize for humans. The partial masks mainly impaired anger categorization
2 A confusion matrix is a table to visualize the performance of an algorithm. In a confusion matrix, on the basis whereby each row represents the number of instances in a predicted class, the columns will represent the total number of instances in an actual class (or vice- versa). Emotion classification of covered faces 17
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 4: Confusion matrices obtained from classifying a) mask, b) partial mask, and c) sunglasses images by humans, from classifying d) mask, e) partial mask, and f) sunglasses images by computer systems without oversampling, from classifying g) mask, h) partial mask, and i) sunglasses images by com- puter systems with oversampling [5:2]. Note that the 82 human participants all classified a sum of 35 images for each covering, i.e. a total of 410 images for each emotion category and 2870 for each covering. Similarly, 30 computer systems all classified a sum of 35 images for each covering, i.e. a total of 150 images for each emotion category and 1050 for each covering. An = anger, Di = disgust, Fe = fear, Ha = happy, Ne = neutral, Sa = sadness and Su = surprise. 18 Shehu H. A. et al.
436 in humans, whereas computer systems were mainly impaired through those
437 for disgust, fear, and sadness. And while the use of sunglasses mainly reduced
438 fear categorization ability in humans, it was neutral and surprise for computer
439 systems.
440 Previous work (Delicato & Mason, 2015; Shehu, Siddique, et al., 2021) has
441 shown the relevance of the mouth for correctly categorizing happiness and
442 surprise in both humans and computer systems. Therefore, since the partial
443 masks allow the mouth to be visible through the transparent window, it was
444 anticipated that happiness and surprise would be categorized more accurately
445 when the partial mask is applied. Indeed, a careful observation of the results
446 showed a large increase in the classification accuracy of happiness and surprise
447 expression in both computer systems and humans. While there was an increase
448 of 19% (for happy) and 28% (for the surprise) in the humans sample, there was
449 an increase of up to 66% for the category happy and up to 55% for surprise
450 expression in computer systems when the partial mask was used.
451 While we found a significant difference in the accuracy of humans cat-
452 egorizing each emotion category across the different coverings, no significant
453 difference was found for the neutral expression across all coverings. As humans
454 classify unclear emotions as neutral, they necessarily were highly accurate in
455 categorizing neutral expression (see Fig. 3a).
456 In contrast to humans, the classification of unclear emotions significantly
457 varies in computer systems. For instance, the computer systems mainly clas-
458 sify unclear emotions as anger expression when fully covering masks were used.
459 However, unclear emotions were mainly classified as happiness or surprise when
460 there was a partial mask on the face. An initial observation of the confusion
461 matrix obtained from categorizing masked emotion images (see Fig. 4d) sug-
462 gested that the computer systems mainly classify unclear emotions as anger
463 expression just because the anger expression appears first after one-hot encod- 3 464 ing the labels that were fed to the classifier. Therefore, we changed the label
465 name of the neutral expression to ‘aneutral’ such that the neutral expression
466 appears first after the labels are one-hot encoded. The violin plots in Fig. 3d
467 show the accuracy obtained from neutral and anger expression before and after
468 the labels were changed. As can be seen from Fig. 3d, the accuracy obtained
469 before and after the label change was nearly the same or relatively similar
470 for both anger and neutral expression. In addition, we found no significant
471 difference in the achieved accuracies before and after the label was changed.
472 Thus, label order was not important.
473 Although the achieved accuracy by the computer systems is relatively low
474 when the fully covering masks were used, the above-mentioned observation
475 suggests that the prediction of the computer systems was not fully random.
476 One possible reason why the images of the fully covering masks were mainly 4 477 classified as anger could be because the produced (extracted) features by
3 One hot encoding is a process whereby categorical variables are converted into a form that could be provided to machine learning algorithms. 4 Feature extraction is the process by which an initial set of raw data (in this case image pixels) is reduced to manageable groups for processing. Emotion classification of covered faces 19
478 the computer systems when the fully covering masks were used have high
479 similarities with the anger expression images.
480 Depending on the type of covering, the classification accuracy in humans
481 changes. There is no general order of increase or decrease in the achieved ac-
482 curacy. However, based on the results obtained in this study, the classification
483 accuracy of the computer systems correlates with the occlusion. The larger
484 the occlusion, the higher the decrease and vice-versa. For instance, the highest
485 accuracy was achieved with nothing added to the image. The achieved accu-
486 racy decreases when sunglasses (covers 10-15% of the face) were used. And
487 although a further decrease has been observed when partial masks (covers 30-
488 35%) were used, the largest decrease was experienced when the fully covering
489 (covers 60-65%) face masks were used. It is important to bear in mind that
490 this observation is only valid for the type of coverings used in this research
491 as it is unknown whether the decrease is based on the percentage of the face
492 that is covered or based on important features that are covered. This is an
493 important issue for future research.
494 We also found that people’s exposure to people wearing masks does not
495 affect their performance in categorizing emotion. Although more exposure to
496 people wearing mask was related to increased accuracy across all facial ex-
497 pressions, we found no evidence for people’s exposure to masked individuals
498 helping them to be specifically better in categorizing masked emotions (see
499 Table 2e). Taken together, these results suggest that people perform worst in
500 categorizing masked emotion images independent of their exposure to masked
501 individuals. In other words, being around more masked people does not in-
502 crease the ability to categorize covered emotional facial expressions.
503 The findings based on the misclassification analysis suggest that it might be
504 important to feed imbalanced data to classifiers. For instance, in situations like
505 this where feeding balanced data might lead to differences in misclassifications,
506 feeding the classifier with data that is oversampled based on a certain (less
507 risky) expression is rather more important than feeding data with an equal
508 distribution from each expression as it results in a high cost of misclassification.
509 Not only did oversampling reduce the misclassification cost of the covered
510 images when a specified class is targeted, at the same time, it also increased the
511 classification accuracy when classifying the fully covering mask images. The
512 accuracy achieved by the oversampling method (ratio 5:2) was greater than the
513 accuracy achieved when no oversampling was used. Nevertheless, oversampling
514 the data might also come at a cost of decreasing the classification accuracy,
515 e.g. the accuracy obtained from classifying the partial mask and the sunglasses
516 images were reduced after the neutral expression was oversampled.
517 This gives rise to the question of which performance measure to prefer? Is
518 the accuracy or the misclassification cost more relevant? Taking into account
519 the protective factor, mediating, and moderating effects of these measures
520 (Brown & Singh, 2014; Rogers, 2000). While there is no general answer to
521 this question, both the accuracy and the misclassification cost need to both be
522 considered. For instance, misclassifying anger or sadness as happiness could
523 have extreme consequences in future decision making for artificial intelligence 20 Shehu H. A. et al.
524 systems like robots. At the same time, an organization will more likely in-
525 troduce robots that can accurately understand people’s emotions into shared
526 workspaces compared to less accurate ones.
527 It is also worth stating that the findings that oversampling reduced the
528 cost of misclassification cannot be generalized to all the different types of face
529 coverings. For instance, the misclassification cost with no oversampling for
530 faces with sunglasses was lower (18%) compared to when oversampling (ratio
531 3:2) was used (23%).
532 Across all images, humans spent approximately 8.47 secs to predict the
533 category of an image and 19.77 mins for the complete survey, while it took
534 the computer systems less than a second (0.007) to classify each image and
535 approximately 1 second to classify all 140 images. Nevertheless, the computer
536 systems took approximately 1 hr 20 mins to train on a 24GB Graphical Pro-
537 cessing Unit (GPU) machine with Quadro RTX 6000 CUDA version 11.2.
538 Thus computer systems can predict the expression of the images faster than
539 humans provided they are already trained. Humans train throughout thier life-
540 time, although they are already ‘experts’ at recognizing emotions at an early
541 age (Decety, 2010; Pollak & Kistler, 2002).
542
543 Implications, limitations, and recommendations
544 Together, the results obtained from the study showed that while the computer
545 systems outperformed humans with no covering on the face, the performance of
546 the computer systems showed a larger decrease when the face is partially cov-
547 ered. Further, as the computer systems make different mistakes, we introduced
548 a method to reduce the cost of misclassification for the computer systems in
549 order to avoid having extreme consequences in future decision making. These
550 findings suggest that greater efforts are needed to develop artificial emotion
551 classification systems that can adapt to more complex situations, e.g. taking
552 into account contextual features. Thus, care is needed prior to using results
553 from computer systems for emotion recognition in psychological experiments.
554 The study also has some potential limitations. For instance, the overall
555 findings of what is referred to as computer systems in this research may be
556 somewhat limited to end-to-end deep learning models as different machine
557 learning techniques might lead to achieving different results depending on the
558 feature extraction technique, as well as the algorithm used to categorize the
559 emotion. In addition, the study was performed on a single stimulus set, there-
560 fore, the question of whether the findings would generalize on other stimulus
561 sets remains open. Future studies should address this issue by including other
562 stimulus sets, as well as other artificial emotion classification methods.
563 Acknowledgements Authors would like to acknowledgements the participants for putting 564 the effort to complete the survey. Emotion classification of covered faces 21
565 5 Declarations
566 Funding
567 This research was not supported by any organization.
568
569 Competing interests
570 The authors declare no competing interests.
571
572 Ethics Approval
573 Ethics approval has been obtained from the Human Ethics Committee (HEC)
574 of Victoria University of Wellington (Application no: 28949).
575
576 Consent to participate
577 All participants gave informed consent for participation in the study.
578
579 Consent for publication
580 All authors have approved the manuscript and agree with submission of the
581 work
582
583 Data availability
584 Data used in this study can be downloaded from OSF at https://osf.io/mgx9p
585
586 Code availability
587 Code implemented in this studycan be downloaded from OSF at https://osf.
588 io/mgx9p
589
590 Authors Contribution
591 Harisu Abdullahi Shehu: Conceived and designed the experiments, performed
592 the experiments, analyzed the data, wrote the paper; Will Browne: Con-
593 tributed materials/analysis tools, supervision, writing – review & editing; Hed-
594 wig Eisenbarth: Contributed materials/analysis tools, supervision, writing –
595 review & editing
596
597 References
598 Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A. M., & Pollak, S. D.
599 (2019). Emotional expressions reconsidered: Challenges to inferring
600 emotion from human facial movements. Psychological science in the
601 public interest, 20 (1), 1–68.
602 Berryhill, J., Heang, K. K., Clogher, R., & McBride, K. (2019). Hello, world:
603 Artificial intelligence and its use in the public sector.
604 Brown, J., & Singh, J. P. (2014). Forensic risk assessment: A beginner’s guide.
605 Archives of Forensic Psychology, 1 (1), 49–59. 22 Shehu H. A. et al.
606 Calvo, M. G., & Lundqvist, D. (2008). Facial expressions of emotion (kdef):
607 Identification under different display-duration conditions. Behavior re-
608 search methods, 40 (1), 109–115.
609 Camras, L. A., & Allison, K. (1985). Children’s understanding of emotional
610 facial expressions and verbal labels. Journal of nonverbal Behavior,
611 9 (2), 84–94.
612 Carbon, C.-C. (2020). Wearing face masks strongly confuses counterparts in
613 reading emotions. Frontiers in Psychology, 11, 2526.
614 Cheng, S., & Zhou, G. (2020). Facial expression recognition method based on
615 improved vgg convolutional neural network. International Journal of
616 Pattern Recognition and Artificial Intelligence, 34 (07), 2056003.
617 Ciraco, M., Rogalewski, M., & Weiss, G. (2005). Improving classifier utility
618 by altering the misclassification cost ratio. Proceedings of the 1st in-
619 ternational workshop on Utility-based data mining, 46–52.
620 Coleman, T. J. (2020). Coronavirus: Call for clear face masks to be ‘the norm’.
621 BBC News. Accessed: 26 May, 2020. https://www.bbc.com/news/
622 world-52764355
623 Decety, J. (2010). The neurodevelopment of empathy in humans. Developmen-
624 tal neuroscience, 32 (4), 257–267.
625 Delicato, L. S., & Mason, R. (2015). Happiness is in the mouth of the beholder
626 and fear in the eyes. Journal of vision, 15 (1378).
627 Eisenbarth, H., Alpers, G. W., Segr`e,D., Calogero, A., & Angrilli, A. (2008).
628 Categorization and evaluation of emotional faces in psychopathic women.
629 Psychiatry research, 159 (1-2), 189–195.
630 Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face
631 and emotion. Journal of personality and social psychology, 7 (2), 124.
632 https://doi.org/10.1109/AFGR.2000.840611
633 Grundmann, F., Epstude, K., & Scheibe, S. (2021). Face masks reduce emotion-
634 recognition accuracy and perceived closeness. PloS one, 16 (4), e0249792.
635 Grzymala-Busse, J. W. (2005). Rule induction. Data mining and knowledge
636 discovery handbook (pp. 277–294). Springer.
637 Guarnera, M., Magnano, P., Pellerone, M., Cascio, M. I., Squatrito, V., &
638 Buccheri, S. L. (2018). Facial expressions and the ability to recognize
639 emotions from the eyes or mouth: A comparison among old adults,
640 young adults, and children. The Journal of genetic psychology, 179 (5),
641 297–310.
642 Hugenberg, K. (2005). Social categorization and the perception of facial affect:
643 Target race moderates the response latency advantage for happy faces.
644 Emotion, 5 (3), 267.
645 Japkowicz, N. et al. (2000). Learning from imbalanced data sets: A comparison
646 of various strategies. AAAI workshop on learning from imbalanced
647 data sets, 68, 10–15.
648 Keras api. (2021). https://keras.io/
649 Krumhuber, E. G., K¨uster,D., Namba, S., Shah, D., & Calvo, M. G. (2019).
650 Emotion recognition from posed and spontaneous dynamic expres-
651 sions: Human observers versus machine analysis. Emotion. Emotion classification of covered faces 23
652 Krumhuber, E. G., K¨uster,D., Namba, S., & Skora, L. (2020). Human and
653 machine validation of 14 databases of dynamic facial expressions. Be-
654 havior research methods, 1–16.
655 Kukar, M., Kononenko, I., Groˇselj,C., Kralj, K., & Fettich, J. (1999). Analysing
656 and improving the diagnosis of ischaemic heart disease with machine
657 learning. Artificial intelligence in medicine, 16 (1), 25–50.
658 Lewinski, P., den Uyl, T. M., & Butler, C. (2014). Automated facial coding:
659 Validation of basic emotions and facs aus in facereader. Journal of
660 Neuroscience, Psychology, and Economics, 7 (4), 227.
661 Li, M., Xu, H., Huang, X., Song, Z., Liu, X., & Li, X. (2018). Facial expression
662 recognition with identity and emotion joint learning. IEEE Transac-
663 tions on Affective Computing, 1–1. https://doi.org/10.1109/TAFFC.
664 2018.2880201
665 Marini, M., Ansani, A., Paglieri, F., Caruana, F., & Viola, M. (2021). The
666 impact of facemasks on emotion recognition, trust attribution and
667 re-identification. Scientific Reports, 11 (1), 1–14.
668 Matsumoto, D., & Hwang, H. S. (2011). Evidence for training the ability
669 to read microexpressions of emotion. Motivation and emotion, 35 (2),
670 181–191.
671 Miller, E. J., Krumhuber, E. G., & Dawel, A. (2020). Observers perceive the
672 duchenne marker as signaling only intensity for sad expressions, not
673 genuine emotion. Emotion.
674 Mohan, K., Seal, A., Krejcar, O., & Yazidi, A. (2021). Fer-net: Facial ex-
675 pression recognition using deep neural net. Neural Computing and
676 Applications, 1–12.
677 Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016). Going deeper in facial
678 expression recognition using deep neural networks. 2016 IEEE Win-
679 ter Conference on Applications of Computer Vision (WACV), 1–10.
680 https://doi.org/10.1109/WACV.2016.7477450
681 Noyes, E., Davis, J. P., Petrov, N., Gray, K. L., & Ritchie, K. L. (2021). The
682 effect of face masks and sunglasses on identity and expression recogni-
683 tion with super-recognizers and typical observers. Royal Society Open
684 Science, 8 (3), 201169.
685 Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994).
686 Reducing misclassification costs. Machine learning proceedings 1994
687 (pp. 217–225). Elsevier.
688 Picard, R. W., Vyzas, E., & Healey, J. (2001). Toward machine emotional in-
689 telligence: Analysis of affective physiological state. IEEE transactions
690 on pattern analysis and machine intelligence, 23 (10), 1175–1191.
691 Pollak, S. D., & Kistler, D. J. (2002). Early experience is associated with
692 the development of categorical representations for facial expressions
693 of emotion. Proceedings of the National Academy of Sciences, 99 (13),
694 9072–9076.
695 Roberson, D., Kikutani, M., D¨oge,P., Whitaker, L., & Majid, A. (2012).
696 Shades of emotion: What the addition of sunglasses or masks to faces 24 Shehu H. A. et al.
697 reveals about the development of facial expression processing. Cogni-
698 tion, 125 (2), 195–206.
699 Rogers, R. (2000). The uncritical acceptance of risk assessment in forensic
700 practice. Law and human behavior, 24 (5), 595–605.
701 Shehu, H. A., Browne, W. N., & Eisenbarth, H. (2020a). An adversarial attacks
702 resistance-based approach to emotion recognition from images using
703 facial landmarks. 2020 29th IEEE International Conference on Robot
704 and Human Interactive Communication (RO-MAN), 1307–1314.
705 Shehu, H. A., Browne, W. N., & Eisenbarth, H. (2020b). Emotion categoriza-
706 tion from video-frame images using a novel sequential voting tech-
707 nique. International Symposium on Visual Computing, 618–632.
708 Shehu, H. A., Browne, W. N., & Eisenbarth, H. (2021). Emotion categorization
709 from faces of people with sunglasses and facemasks. Preprint (version
710 1) available at Research Square. https://doi.org/10.21203/rs.3.rs-
711 375319/v1]
712 Shehu, H. A., Siddique, A., Browne, W. N., & Eisenbarth, H. (2021). Lateral-
713 ized approach for robustness against attacks in emotion categorization
714 from images. 24th International Conference on Applications of Evo-
715 lutionary Computation (EvoApplications 2021).
716 Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
717 large-scale image recognition. arXiv preprint arXiv:1409.1556.
718 Van den Stock, J., Righart, R., & De Gelder, B. (2007). Body expressions
719 influence recognition of emotions in the face and voice. Emotion, 7 (3),
720 487.
721 Wacker, J., Heldmann, M., & Stemmler, G. (2003). Separating emotion and
722 motivational direction in fear and anger: Effects on frontal asymmetry.
723 Emotion, 3 (2), 167.
724 Wegrzyn, M., Vogt, M., Kireclioglu, B., Schneider, J., & Kissler, J. (2017).
725 Mapping the emotional face. how individual face parts contribute to
726 successful emotion recognition. PloS one, 12 (5), e0177239.
727 Yuki, M., Maddux, W. W., & Masuda, T. (2007). Are the windows to the
728 soul the same in the east and west? cultural differences in using the
729 eyes and mouth as cues to recognize emotions in japan and the united
730 states. Journal of Experimental Social Psychology, 43 (2), 303–311.
731 Zhang, T., Zheng, W., Cui, Z., Zong, Y., & Li, Y. (2019). Spatial–temporal
732 recurrent neural network for emotion recognition. IEEE Transactions
733 on Cybernetics, 49 (3), 839–847. https://doi.org/10.1109/TCYB.
734 2017.2788081
735 Supplementary Materials
736 Confidence ratings
737
738 With regard to the confidence, just like for the emotion categories, a 3-way
739 repeated-measures ANOVA (2×4×7) was performed with the between-subject Emotion classification of covered faces 25
740 factor Group (i.e. humans vs computer systems), the within-subject factors
741 Covering (i.e. unmasked, mask, partial mask, and sunglasses), and Confidence
742 obtained from categorizing each emotion (i.e. for anger, disgust, fear, happy,
743 neutral, sad, and surprise). We found significant main effects for Group, F 2 744 (1, 3080) = 139.85, p < .001, η = 0.03, Covering, F (3, 3080) = 39.49, p < 2 2 745 .001, η = 0.07, and Emotion, F (6, 3080) = 58.96, p < .001, η = 0.02. All
746 interactions between the factors were significant (see Fig. 4a), as well as the
747 three-way interaction of Group × Covering × Emotion, F (18, 3080) = 28.86, 2 748 p < .001, η = 0.01.
749 Fig. 5a present the classification confidence obtained in categorizing un-
750 masked emotion images by humans and computer systems. Compared to hu-
751 mans that classify the unmasked images with maximum confidence of only
752 89% (achieved in the happy expression), the computer systems classify the
753 unmasked emotion images with very high confidence, achieving classification
754 confidence greater than 97% for all the emotion categories except for the neu-
755 tral expression, which was classified with a confidence of 88.98%. This is un-
756 derstandable as the artificial system classified all images from all expressions
757 correctly (except for neutral expression) with no mask on the face, achieving
758 a classification accuracy of 100%, except for neutral expression with a classifi-
759 cation accuracy of 89.33%. Post-hoc tests showed that there was a significant
760 difference in the classification confidence for both humans, F (6, 567) = 36.28, 2 761 p < .001, η = 0.28 and computer systems, F (6, 203) = 110.71, p < .001, 2 762 η = 0.77 (α = 0.0083) across different emotion categories when there was no
763 mask on the face. Posthoc t-test for the significance of the classification con-
764 fidence of each category for the unmasked emotional images by both humans
765 and the computer systems showed a significant difference (< .001) for most of
766 the emotion categories. (see Tables 4b and 4c).
767 Fig. 5b presents the confidence obtained from categorizing masked emo-
768 tion images by humans and computer systems. Here, the computer systems
769 achieved higher classification confidence than the humans only in classifying
770 the anger expression, achieving classification confidence of 81.98% in contrast
771 to humans that classify the anger class images with classification confidence of
772 59.06%. Surprisingly, while humans classify the happiness expression images
773 with very high confidence (> 89%-point) compared to other emotion cate-
774 gories (all < 84%-point), the classification confidence of the computer systems
775 is lowest for happiness expression (< 0.001%-point). This shows that humans
776 have nearly zero confidence in the classification of happiness when the mouth
777 is covered with a full covering face mask. Also, post-hoc tests showed that
778 there was a significant difference in the classification confidence for both hu- 2 779 mans, F (6, 567) = 49.70, p ¡ .001, η = 0.35 and computer systems, F (6, 2 780 203) = 205.75, p < .001, η = 0.86 (α = 0.0083) across different emotion cat-
781 egories with masks on the face. Posthoc two-sample t-tests for the differences
782 in classification confidence of each category for the masked emotional images
783 by both humans and computer systems showed a significant difference (p <
784 .001) for most of the emotion categories (see Tables and 4d and 4e). 26 Shehu H. A. et al.
(a) unmasked
(b) masked
(c) partial mask
(d) sunglasses Figure 5: Violin plots showing confidence obtained from categorizing a) un- masked, b) masked, c) partial mask, and d) sunglasses emotions by humans and computer systems. Emotion classification of covered faces 27
785 Fig. 5c presents the confidence obtained from categorizing images by hu-
786 mans and computer systems when the partial mask was used as the type of
787 covering. As can be seen from the figure, the confidence of computer systems
788 in categorizing happy and surprise expression increased when the partial mask
789 was used as the type of covering. However, the classification confidence of hu-
790 mans in categorizing happiness and surprise expression decreased when partial
791 masks were used compared to when there was fully covering masks on the face.
792 Nevertheless, post-hoc tests showed a significant difference in the classification 2 793 confidence for both humans, F (6, 567) = 68.13, p < .001, η = 0.42 and com- 2 794 puter systems, F (6, 203) = 300.06, p < .001, η = 0.90 (α = 0.0083) across
795 different emotion categories when the partial masks was used. Posthoc two-
796 sample t-tests for the differences in classification confidence of each category
797 for the partial mask emotional images (by both humans and computer sys-
798 tems) also showed a significant difference for most of the emotion categories
799 (see Tables 5a and 5b).
800 Fig. 5d presents the confidence obtained by humans and computer systems
801 from categorizing sunglasses images. Here, the confidence achieved by both
802 humans and computer systems has decreased in all classes (except for surprise
803 in humans) when there was sunglasses on the face compared with when there
804 was no coverings added to the face. The artificial system has experienced its
805 largest decrease in neutral, sad, and surprise expressions (all > 28% decrease)
806 whereas the largest decrease was seen in disgust, fear, and sad expressions (all
807 > 17% decrease) for humans. Post-hoc tests showed that there was a significant
808 difference in the classification confidence for both humans, F (6, 567) = 82.47, 2 2 809 p < .001, η = 0.47 and computer systems, F (6, 203) = 12.58, p < .001, η
810 = 0.27 (α = 0.0083) across different emotion categories when the sunglasses
811 was used as the type of covering. Posthoc two-sample t-tests for differences in
812 classification confidence of each category for the sunglasses emotional images
813 by both humans and computer systems showed a significant difference for most
814 of the emotion categories (see Tables 5c and 5d). 28 Shehu H. A. et al.
Sum Square df F p Eta Squared Intercept 9.08E+04 1.0 243.79 <.001 0.05 C(Group) 5.21E+04 1.0 139.85 <.001 0.03 C(Covering) 4.41E+04 3.0 39.49 <.001 0.07 C(Emotion) 1.32E+05 6.0 58.96 <.001 0.02 C(Group): C(Covering) 2.02E+04 3.0 18.04 <.001 0.09 C(Group): C(Emotion) 1.72E+05 6.0 76.87 <.001 0.01 C(Covering): C(Emotion) 1.28E+05 18.0 19.03 <.001 0.06 C(Group): C(Covering):C(Emotion) 1.94E+05 18.0 28.86 <.001 0.1 Residual 1.15E+06 3080.0 - - - (a) ANOVA comparing humans and computer systems
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.159 (-0.37) 0.257 (0.30) <.001 (-1.67) <.001 (-1.03) <.001 (-1.03) <.001 (-1.21) Disgust 0.159 (-0.37) - 0.019 (0.63) <.001 (-1.10) 0.038 (-0.55) 0.027 (-0.59) 0.008 (-0.72) Fear 0.257 (0.30) 0.019 (0.63) - <.001 (-1.83) <.001 (-1.25) <.001 (-1.24) <.001 (-1.41) Happy <.001 (-1.67) <.001 (-1.10) <.001 (-1.83) - 0.017 (0.64) 0.086 (0.45) 0.129 (0.40) Neutral <.001 (-1.03) 0.038 (-0.55) <.001 (-1.25) 0.017 (0.64) - 0.727 (-0.09) 0.429 (-0.21) Sad <.001 (-1.03) 0.027 (-0.59) <.001 (-1.24) 0.086 (0.45) 0.727 (-0.09) - 0.718 (-0.09) Surprise <.001 (-1.21) 0.008 (-0.72) <.001 (-1.41) 0.129 (0.40) 0.429 (-0.21) 0.718 (-0.09) - (b) unmasked by humans
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.887 (0.04) 0.036 (-0.55) <.001 (-1.69) <.001 (2.93) <.001 (1.64) <.001 (-1.16) Disgust 0.887 (0.04) - 0.122 (-0.40) 0.003 (-0.82) <.001 (2.93) <.001 (1.61) 0.025 (-0.61) Fear 0.036 (-0.55) 0.122 (-0.40) - 0.062 (-0.51) <.001 (2.96) <.001 (1.72) 0.496 (-0.20) Happy <.001 (-1.69) 0.003 (-0.82) 0.062 (-0.51) - <.001 (2.99) <.001 (1.80) <.001 (1.00) Neutral <.001 (2.93) <.001 (2.93) <.001 (2.96) <.001 (2.99) - <.001 (-2.28) <.001 (-2.97) Sad <.001 (1.64) <.001 (1.61) <.001 (1.72) <.001 (1.80) <.001 (-2.28) - <.001 (-1.76) Surprise <.001 (-1.16) 0.025 (-0.61) 0.496 (-0.20) <.001 (1.00) <.001 (-2.97) <.001 (-1.76) - (c) unmasked by computer systems
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.049 (0.52) 0.157 (-0.37) <.001 (-1.32) <.001 (-1.50) 0.365 (0.24) 0.002 (-0.86) Disgust 0.049 (0.52) - 0.002 (-0.86) <.001 (-1.98) <.001 (-2.10) 0.334 (-0.25) <.001 (-1.54) Fear 0.157 (-0.37) 0.002 (-0.86) - 0.003 (-0.81) <.001 (-1.03) 0.029 (0.58) 0.171 (-0.36) Happy <.001 (-1.32) <.001 (-1.98) 0.003 (-0.81) - 0.245 (-0.31) <.001 (1.54) 0.029 (0.58) Neutral <.001 (-1.50) <.001 (-2.10) <.001 (-1.03) 0.245 (-0.31) - <.001 (1.70) 0.002 (0.84) Sad 0.365 (0.24) 0.334 (-0.25) 0.029 (0.58) <.001 (1.54) <.001 (1.70) - <.001 (-1.10) Surprise 0.002 (-0.86) <.001 (-1.54) 0.171 (-0.36) 0.029 (0.58) 0.002 (0.84) <.001 (-1.10) - (d) masked by humans
Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (5.84) <.001 (2.88) <.001 (6.79) <.001 (4.78) <.001 (5.53) <.001 (4.76) Disgust <.001 (5.84) - <.001 (-2.70) <.001 (1.18) 0.005 (-0.76) 0.082 (-0.46) <.001 (-1.81) Fear <.001 (2.88) <.001 (-2.70) - <.001 (3.57) <.001 (1.81) <.001 (2.37) <.001 (1.51) Happy <.001 (6.79) <.001 (1.18) <.001 (3.57) - <.001 (-1.63) <.001 (-1.74) <.001 (-3.52) Neutral <.001 (4.78) 0.005 (-0.76) <.001 (1.82) <.001 (-1.63) - 0.137 (0.40) 0.020 (-0.63) Sad <.001 (5.53) 0.082 (-0.46) <.001 (2.37) <.001 (-1.74) 0.137 (0.40) - <.001 (-1.31) Surprise <.001 (4.76) <.001 (-1.81) <.001 (1.51) <.001 (-3.52) 0.020 (-0.63) <.001 (-1.31) - (e) masked by computer systems
Table 4: Statistical test results presenting a) ANOVA obtained from comparing humans and computer systems based on their confidence of categorizing people’s emotions with different types of covering. b) Posthoc test (t-test) results showing the p-value (and Cohen’s d effect size) obtained from comparing unmasked images classification confidence by humans and c) computer systems. d) Posthoc test (t-test) results showing the p-value (and Cohen’s d effect size) obtained from comparing masked images classification confidence by humans and e) computer systems. Note that the α level in all the posthoc comparisons has been adjusted to 0.0014 using Bonferroni correction. Emotion classification of covered faces 29
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.036 (-0.56) 0.014 (-0.66) <.001 (-2.70) <.001 (-1.98) <.001 (-1.21) <.001 (-2.34) Disgust 0.036 (-0.56) - 0.838 (-0.05) <.001 (-1.76) <.001 (-1.21) 0.039 (-0.55) <.001 (-1.50) Fear 0.014 (-0.66) 0.838 (-0.05) - <.001 (-1.88) <.001 (-1.25) 0.044 (-0.53) <.001 (-1.58) Happy <.001 (-2.70) <.001 (-1.76) <.001 (-1.88) - 0.036 (0.56) <.001 (1.27) 0.370 (0.24) Neutral <.001 (-1.98) <.001 (-1.21) <.001 (-1.25) 0.036 (0.56) - 0.010 (0.69) 0.237 (-0.31) Sad <.001 (-1.21) 0.039 (-0.55) 0.044 (-0.53) <.001 (1.27) 0.010 (0.69) - <.001 (-1.00) Surprise <.001 (-2.34) <.001 (-1.50) <.001 (-1.58) 0.370 (0.24) 0.237 (-0.31) <.001 (-1.00) - (a) partial mask by humans
Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (8.02) <.001 (8.07) <.001 (-2.49) 0.613 (0.13) <.001 (9.91) <.001 (-2.93) Disgust <.001 (8.02) - 0.287 (0.28) <.001 (-6.62) <.001 (-3.49) 0.007 (0.73) <.001 (-6.75) Fear <.001 (8.07) 0.287 (0.28) - <.001 (-6.69) <.001 (-3.57) 0.265 (0.30) <.001 (-6.82) Happy <.001 (-2.49) <.001 (-6.62) <.001 (-6.69) - <.001 (1.92) <.001 (7.06) <.001 (-0.49) Neutral 0.613 (0.13) <.001 (-3.49) <.001 (-3.57) <.001 (1.92) - <.001 (3.77) <.001 (-2.31) Sad <.001 (9.91) 0.007 (0.73) 0.265 (0.30) <.001 (7.06) <.001 (3.77) - <.001 (-7.15) Surprise <.001 (-2.93) <.001 (-6.75) <.001 (-6.82) <.001 (-0.49) <.001 (-2.31) <.001 (-7.15) - (b) partial mask by computer systems
Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.394 (0.22) <.001 (0.88) <.001 (-1.85) <.001 (-0.97) 0.144 (0.38) <.001 (-1.51) Disgust 0.394 (0.22) - 0.021 (0.62) <.001 (-1.96) <.001 (-1.14) 0.619 (0.13) <.001 (-1.65) Fear <.001 (0.88) 0.021 (0.62) - <.001 (-2.76) <.001 (-1.83) 0.041 (-0.54) <.001 (-2.41) Happy <.001 (-1.85) <.001 (-1.96) <.001 (-2.76) - 0.005 (0.77) <.001 (2.41) 0.229 (0.32) Neutral <.001 (-0.97) <.001 (-1.14) <.001 (-1.83) 0.005 (0.77) - <.001 (1.41) 0.079 (-0.47) Sad 0.144 (0.38) 0.619 (0.13) 0.041 (-0.54) <.001 (2.41) <.001 (1.41) - <.001 (-2.02) Surprise <.001 (-1.51) <.001 (-1.65) <.001 (-2.41) 0.229 (0.32) 0.079 (-0.47) <.001 (-2.02) - (c) sunglasses by humans Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.548 (0.16) 0.017 (0.64) 0.771 (0.08) <.001 (1.42) <.001 (1.53) <.001 (1.43) Disgust 0.548 (0.16) - 0.202 (0.34) 0.810 (-0.06) <.001 (1.18) <.001 (1.04) <.001 (1.09) Fear 0.017 (0.64) 0.202 (0.34) - 0.127 (-0.41) <.001 (1.00) 0.004 (0.78) 0.002 (0.87) Happy 0.771 (0.08) 0.810 (-0.06) 0.127 (-0.41) - <.001 (1.23) <.001 (1.10) <.001 (1.15) Neutral <.001 (1.42) <.001 (1.18) <.001 (1.00) <.001 (1.23) - 0.080 (-0.47) 0.333 (-0.26) Sad <.001 (1.53) <.001 (1.04) 0.004 (0.78) <.001 (1.10) 0.080 (-0.47) - 0.394 (0.23) Surprise <.001 (1.43) <.001 (1.09) 0.002 (0.87) <.001 (1.15) 0.333 (-0.26) 0.394 (0.23) - (d) sunglasses by computer systems
Table 5: Statistical test results for posthoc tests: a) p-value (and Cohen’s d effect size) comparing partial mask images classification confidence by humans and b) computer sys- tems. c) p-value (and Cohen’s d effect size) for comparing sunglasses images classification confidence by humans and d) computer systems. Note that the α level has been adjusted to 0.0014 using Bonferroni correction.