<<

journal name manuscript No. (will be inserted by the editor)

1 A Comparison of and Machine Learning

2 Classifiers Detecting from of People

3 with Different Coverings

4 Harisu Abdullahi Shehu · Will N.

5 Browne · Hedwig Eisenbarth

6

7 Received: 9th August 2021 / Accepted: date

8 Abstract Partial coverings such as sunglasses and face masks uninten-

9 tionally obscure facial expressions, causing a loss of accuracy when humans

10 and computer systems attempt to categorise emotion. Experimental Psychol-

11 ogy would benefit from how computational systems differ from

12 humans when categorising as automated recognition systems become

13 available. We investigated the impact of sunglasses and different face masks on

14 the ability to categorize emotional facial expressions in humans and computer

15 systems to directly compare accuracy in a naturalistic context. Computer sys-

16 tems, represented by the VGG19 deep learning algorithm, and humans assessed

17 images of people with varying emotional facial expressions and with four dif-

18 ferent types of coverings, i.e. unmasked, with a mask covering lower-face, a

19 partial mask with transparent mouth window, and with sunglasses. Further-

Harisu Abdullahi Shehu (Corresponding author) School of Engineering and Computer , Victoria University of Wellington, 6012 Wellington E-mail: [email protected] ORCID: 0000-0002-9689-3290

Will N. Browne School of Electrical Engineering and Robotics, Queensland University of Technology, Bris- bane QLD 4000, Australia Tel: +61 45 2468 148 E-mail: [email protected] ORCID: 0000-0001-8979-2224

Hedwig Eisenbarth School of , Victoria University of Wellington, 6012 Wellington Tel: +64 4 463 9541 E-mail: [email protected] ORCID: 0000-0002-0521-2630 2 Shehu H. A. et al.

20 more, we introduced an approach to minimize misclassification costs for the

21 computer systems while maintaining a reasonable accuracy.

22 Computer systems performed significantly better than humans when no

23 covering was present (>15% difference). However, the achieved accuracy dif-

24 fered significantly depending on the type of coverings and, importantly, the

25 emotion category. It was also noted that while humans mainly classified unclear

26 expressions as neutral when the fully covering mask was used, the misclassifi-

27 cation varied in the computer systems, which would affect their performance

28 in automated recognition systems. Importantly, the variation in the misclas-

29 sification can be adjusted by variations in the balance of categories in the

30 training .

31 Keywords COVID-19 · Emotion classification · · Facial

32 features · Face masks

33

34 1 Introduction

35 Facial expressions convey regarding emotion in non-verbal

36 communication. Thus, an accurate categorization of a person’s emotional facial

37 expression allows understanding of their needs, intentions, as well as poten-

38 tial actions. This is valuable for psychological experiments, but time

39 consuming to collect manually. Thus, computational methods are needed to

40 analyze facial expressions but how do they differ from human classifiers, espe-

41 cially in different circumstances?

42 Face coverings such as sunglasses and face masks are used to cover the face

43 for different . For instance, people use sunglasses to either protect the

44 eyes from sunlight or to improve appearance, and people use face masks often

45 to prevent the spread of infectious such as the coronavirus (COVID-

46 19) or for hygienic reasons in hospitals or food preparation. However, these

47 coverings not only cover parts of a face but might also affect social interaction

48 as they obscure parts of facial expressions, which might contain socially rele-

49 vant information even if the expression does not represent the emotional state

50 of a person. While sunglasses cover an estimated 10-15% of the face, approxi-

51 mately 60-65% of the face is covered with face masks – exact numbers are hard

52 to tell as this might vary across different people, depending on the size of the

53 face (Carbon, 2020). More importantly, these coverings obscure different parts

54 of the face conveying information about emotional expressions to a varying

55 extent (Roberson et al., 2012).

56 Different studies have shown that humans are far from perfect in accessing

57 the emotional states of people by just looking at their face (Picard et al., 2001).

58 Further, humans are still far from perfect even when it comes to categorizing

59 emotional labels from facial expressions (Lewinski et al., 2014). A further

60 occlusion of the face leads to an even higher decrease in the performance

61 of humans in categorizing emotion from the faces of people. For instance, it

62 has been observed that a partial occlusion of the eyes (Noyes et al., 2021) Emotion classification of covered faces 3

63 impaired the achieved performance by humans especially for categorizing sad

64 expressions (Miller et al., 2020; Wegrzyn et al., 2017; Yuki et al., 2007). On

65 the other hand, while negative emotions such as and are the most

66 difficult to categorize by humans (Camras & Allison, 1985; Guarnera et al.,

67 2018) compared to positive emotions like (Calvo & Lundqvist, 2008;

68 Hugenberg, 2005; Van den Stock et al., 2007; Wacker et al., 2003), obscuring

69 the mouth leads to less accurate categorization even for happiness (Wegrzyn

70 et al., 2017).

71 More specifically, Carbon (2020) investigated the impact of face masks

72 on humans in evaluating six different categories of emotion (, ,

73 fear, happy neutral, and sad) from images of people. They found a significant

74 decrease in performance of their human participants across all expressions

75 except for fear and neutral expressions. In addition, emotional expressions such

76 as happy, sad, and anger were repeatedly confused with neutral expressions

77 whereas other emotions like disgust were confused with anger expressions. In

78 accordance with Carbon, several studies confirmed the detrimental effect of

79 face masks on emotion categorization by humans (Grundmann et al., 2021;

80 Marini et al., 2021; Noyes et al., 2021). However, none of these studies directly

81 compared the impact of face masks on humans with machine learning classifiers

82 (computer systems).

83 In contrast to humans, computer systems have done well in categorizing

84 emotion from in-distribution faces images (images from the same dataset) of

85 people, achieving an accuracy of over 95%. Nevertheless, the performance of

86 these systems is greatly affected when it comes to categorizing emotion from

87 natural imagery where the sun is free to shine and people are free to move,

88 changing dramatically the appearance of images received by these computer

89 systems (Picard et al., 2001). For instance, a change in the colour information

90 of the pixels, foreground colour of the whole or a small region of the face, which

91 are nonlinearly cofounded with emotional expressions can lead to a decrease

92 in the performance of these models by up to 25% (Shehu et al., 2020a; Shehu,

93 Siddique, et al., 2021).

94 Furthermore, face coverings such as sunglasses and face masks affect the

95 performance of computer systems by up to 74% (Shehu, Browne, et al., 2021).

96 However, it is unknown to which extent the same type of sunglasses and face

97 masks would affect the performance of humans in direct comparison to com-

98 puter systems in emotion categorization of the same naturalistic stimuli.

99 Moreover, the ecological validity of these computer systems remains ques-

100 tionable as they are mainly trained and evaluated with in-distribution data.

101 We aim to analyze the limitations of humans and computer systems in

102 categorizing emotion of covered and uncovered faces, as well as the misclassi-

103 fication cost of the computer systems:

104 Initially, we will compare the impact of sunglasses and different types of

105 face masks (i.e. fully covered and recently introduced face masks with a trans-

106 window (Coleman, 2020)) on humans’ and computer systems’ ability

107 to categorize emotional facial expressions. This direct comparison using the

108 same stimuli across different emotion categories and different naturalistic face 4 Shehu H. A. et al.

109 coverings addresses open questions about the differential relevance of facial

110 features for both types of observers.

111 With respect to the computer systems, classifier induction often assumes

112 that the best distribution of classes within the data is an equal distribution,

113 i.e. where there is no class imbalance (Grzymala-Busse, 2005; Japkowicz et

114 al., 2000). However, has shown that this assumption is often not

115 true and might lead to a that is very costly when it misclassifies an

116 instance (Ciraco et al., 2005; Kukar et al., 1999; Pazzani et al., 1994). This

117 might also be true for emotion classification, e.g. when the fully covering mask

118 is used to cover the face, emotion categories like happiness were classified

119 badly by the computer systems, where the majority of the happy expression

120 images were misclassified as anger expression (Shehu, Browne, et al., 2021).

121 Thus, a secondary objective is to investigate how data distribution effects

122 misclassification between humans and computer systems. This has a practical

123 importance as when positive expressions such as happiness are misclassified

124 as negative emotional expressions like anger or sadness, artificial

125 systems like robots might refrain from interacting with a person, taking actions

126 on the misbelief that they are angry or sad. Conversely, the computer systems

127 might continuously try to interact with a person if an expression such as angry

128 or sad were misclassified as happiness, which could lead to a potential threat

129 to the safety of humans interacting with these computer systems.

130 This is important because it is not the error rate that matters. In most

131 cases, what matters most is “where” the misclassification occurs, i.e. the neg-

132 ative effect that may be caused by computer systems misclassifying human

133 facial expressions. For these reasons, the cost of misclassification for the com-

134 puter systems needs to be assessed for computer systems trained with data

135 that is oversampled based on an expression that is considered to be of less

136 threat. This is anticipated to change the behaviour of the computer systems

137 such that they misclassify unclear expressions to expressions that will cause

138 less threat to the safety of humans interacting with them.

139 Therefore, as an additional analysis, we will introduce an approach to min-

140 imize the misclassification costs for the computer systems while maintaining

141 an equivalent or a relatively similar accuracy. Considering that these coverings

142 are more ecologically valid as they are widely used nowadays, understanding

143 these limitations and regulating the cost of misclassifying faces with these cov-

144 erings can inform the production of artificial emotionally intelligent systems

145 (Berryhill et al., 2019) when it comes to real-world situations.

146 2 Method

147 This study involved collecting and analyzing data from humans and computer

148 systems to compare their accuracy and confidence in categorizing emotion

149 from covered and uncovered faces. Emotion classification of covered faces 5

150 2.1 Human sample

151 Ethical approval for this study has been obtained from the Human

152 Committee (HEC) of Victoria University of Wellington and the study was

153 preregistered on OSF (https://osf.io/mgx9p). Participants gave informed con- 154 sent for participation in the study and could enter a raffle/draw to win a $50 155 voucher for compensation.

156 All participants were recruited online via SurveyCircle and social media

157 sites such as Facebook, Twitter, Instagram, etc. through snowball sampling.

158 A power calculation based on the effect size obtained from categorizing the

159 “afraid” class from Eisenbarth et al’s (2008) study was performed to estimate

160 the required sample size, which we found to be 52. However, the study aimed

161 to recruit at least 100 participants to prepare for potential missing data.

162 Of 154 people who started participating in the survey, 82 participants com-

163 pleted the full study. Of those 82 participants, 43 were male, 37 were female,

164 and 2 either did not specify their or identified themselves to be from

165 other groups. The age of the participants ranged from 19 to 70, with a mean of

166 27.88 (SD = 7.53). 28 of the participants self-identified as white, 11 as Black

167 or African American, 24 as Asian, 4 as Arab, 3 as multicultural, and the re-

168 maining 12 either did not answer the question or identified themselves to be

169 from other groups.

170 Data were collected via SoSciSurvey (https://www.soscisurvey.de/) be- st st 171 tween 21 of December 2020 to 21 of February 2021 during the COVID-19

172 pandemic.

173 2.2 Computer systems sample

174 The computer systems sample was represented by the artificial emotion clas-

175 sification systems. One of the most commonly used deep learning algorithms

176 called VGG19 with 19 weight layers was used as an example of a machine

177 learning algorithm (computer systems). VGG19 was chosen to be used as an

178 example of a computer system as it is a well-suited standard model and has

179 been shown to achieve high performance (Cheng & Zhou, 2020; Mohan et al.,

180 2021; Shehu, Siddique, et al., 2021).

181 VGG19 model was set up using Keras API (2021) to have the same num-

182 ber of layers as it was originally introduced in the VGG paper (Simonyan &

183 Zisserman, 2014). The model was set up to run for 200 epochs with an initial

184 learning rate of 0.001. The learning rate is reduced by 10% after 80, 100, 120, 1 185 160 epochs, and by 5% after 180 epochs to avoid overfitting the data. Note

186 that the computer systems were trained with images from the last-half frames

187 of the CK+ dataset based on the technique developed by Shehu et al. (2020b).

1 Overfitting occurs when a model learns details, including noise in the training data to some extent that it has a negative impact on the performance of the model on new data. As a result, the model performs too well on the training data, but poorly on the test data. 6 Shehu H. A. et al.

188 The accuracy of the VGG19 model is obtained by dividing the number of

189 correctly predicted test images by the total number of presented test images,

190 which in this case is 35 for each type of covering. In VGG19, the prediction

191 of each image returns the probability of associating the image to a partic-

192 ular emotion category where the summation of all the probabilities is equal n P 193 to 1 ( Pi = 1). The predicted class is the class with the highest probabil- i=1 194 ity (maxx∈[1,...,n] P (x) where n is the number of emotion categories). These 195 probabilities are usually considered as the confidence of the model.

196 Due to the stochastic nature of processes and the non-deterministic nature

197 of the VGG19 model, the model was run 30 times and the results presented are

198 an average obtained from the 30 runs. Thus, the study compared the survey

199 results obtained from 82 human participants with the 30 computer systems

200 (or machine runs).

201 2.3 Stimulus Material

202 We included seven categories of emotion, i.e. the six basic emotions defined by

203 Ekman (1971), as well as the neutral expression. The experiment consisted of

204 four blocks presenting facial expressions unmasked, masked, partially masked,

205 and with sunglasses. The unmasked block consisted of images with nothing

206 added to the face. Five images were used for each of the seven categories, i.e.

207 a total of 35 images are used. The images used mainly comprised of images

208 of different people such that participants could not make sense of the facial

209 expression from a previous image seen. Similarly, the masked block consisted

210 of 35 emotional images of people with face masks added to the images. The

211 partially masked block consisted of 35 emotional images of people with a par-

212 tial mask, i.e. a face mask with a transparent window added to the images and

213 finally, the sunglasses block consisted of 35 images of people with sunglasses

214 added to the images. All the images used in the experiment are emotional

215 images from the CK+ dataset as images from the dataset have recently been

216 used in a considerable number and a wide range of studies (Li et al., 2018;

217 Mollahosseini et al., 2016; Zhang et al., 2019). However, only one section (un-

218 masked) consisted of original images from the CK+ dataset. The remaining

219 three blocks (mask, partial mask, and sunglasses) consisted of simulated im-

220 ages with different kinds of face masks, as well as sunglasses. Face masks and

221 sunglasses were added to the images using a strategy that we implemented

222 in our previous research (Shehu, Browne, et al., 2021). Fig. 1a shows sample

223 original and simulated emotional images of the six basic emotions, plus the

224 neutral expression used here.

225 2.4 Descriptive questions

226 Human participants were asked to answer a set of questions such as their age,

227 gender, and ethnicity. Furthermore, we asked the participants three additional Emotion classification of covered faces 7

(a)

(b)

Figure 1: Sample images of faces: a) unmasked (first row), masked (second row), partial mask (third row), and sunglasses (last row) from the CK+ dataset. Left to right: anger, disgust, fear, happy, neutral, sad, and expressions. (Shehu et al., 2020b) b) example trial for human sample. 8 Shehu H. A. et al.

228 questions to understand the extent of their prior with people wear-

229 ing masks.

230 The three questions asked to the participants were; “1) For how long did

231 you interact with people wearing masks on a daily basis during the past 3

232 months?”, “2) On average, for how much time per day (in percentage) did

233 you interact with people wearing masks during the past 3 months?”, and

234 “3) Of the people you saw on a day-to-day basis during the past 3 months,

235 approximately how many percentage were wearing masks on average?”. All

236 questions were asked prior to taking the survey and participants answered

237 using a slider ranging from 0 (“not at all”) to 100 (“very often”). A mean

238 exposure score (Qmean) was computed, averaging the three questions (M =

239 36.15, SD = 27.62).

240 2.5 Emotion categorization task

241 Fig. 1b shows a sample trial. For each trial, participants were required to

242 choose the emotion category that the image most closely relates to, as well as

243 their confidence in choosing the category of the emotion. There was a total of

244 140 trials across the 7 categories. All emotion categorization questions were re-

245 quired to be answered and participants were not taken to the next page unless

246 the emotion category and the confidence rating were made (see Supplementary

247 Materials).

248 The data collected, as well as the stimulus set used, can be accessed via

249 OSF (https://osf.io/mgx9p/).

250 2.6 Analysis

251 Group and stimulus effects on accuracy and confidence ratings were tested

252 using a 2×4×7 repeated-measures ANOVA with the factor Emotion category

253 with 7 levels (anger, disgust, fear, happy, neutral, sad, and surprise), the fac-

254 tor group (computer systems vs humans) and the factor covering (unmasked,

255 masked, partial mask, and sunglasses). Post-hoc tests included 2×4 repeated

256 measures ANOVAs and t-tests.

257 For all multiple tests such as posthoc analyze after the initial overall

258 ANOVAs, the alpha level was adjusted using Bonferroni corrections.

259 3 Results

260 This section presents the accuracy ratings obtained from humans compared

261 with computer systems in categorizing emotion from covered and uncovered

262 faces. Emotion classification of covered faces 9

263 3.1 Emotion categorization

264 Initially, a 3-way repeated-measures ANOVA (2×4×7) was performed with the

265 between-subject factor Group (i.e. humans vs computer systems), the within-

266 subject factors Covering (i.e. unmasked, mask, partial mask, and sunglasses),

267 and Emotion (i.e. anger, disgust, fear, happy, neutral, sad, and surprise) to

268 analyze the achieved accuracy. We found significant main effects for Group, F 2 269 (1, 3080) = 80.25, p < .001, η = 0.01, Covering, F (3, 3080) = 40.44, p < 2 2 270 .001, η = 0.09 and Emotion, F (6, 3080) = 81.87, p < .001, η = 0.02. All

271 interactions between the factors were significant (see Table 1a), as well as the

272 three-way interaction of Group × Covering × Emotion F (18, 3080) = 28.11, 2 273 p < .001, η = 0.09.

274 Thereafter, the alpha level was adjusted using Bonferroni correction (α =

275 0.0083) before post-hoc tests. A significant interaction between the emotion

276 categories (i.e. anger, disgust, fear, happy, neutral, sad, and surprise) was found 2 277 for humans, F (6, 567) = 33.18, p < .001, η = 0.26 and computer systems, F 2 278 (6, 203) = 33.14, p < .001, η = 0.50 for the unmasked emotion (see Fig. 2a).

279 Fig. 2a represents the violin plots for categorization accuracy of unmasked

280 images of people by both humans and computer systems. As can be seen from

281 Fig. 2a, the computer systems classify almost all the classes correctly, achieving

282 an overall accuracy of 98.48% compared to humans who achieved an overall ac-

283 curacy of 82.72%. This decrease in the overall performance of humans is based

284 on significantly lower accuracy in categorizing anger, disgust, and fear expres-

285 sions, with fear contributing the most (Error rate > 30%) to the significantly

286 worsen performance (see Table 1b) in humans. However, Bonferroni corrected

287 posthoc t-tests showed no significant difference in categorization accuracies

288 achieved by the computer systems across emotion categories except for the

289 neutral expression (see Table 1c), which was categorized with a significantly

290 lower accuracy compared to other emotion categories.

291 As can be seen, the classification accuracy achieved by the computer sys-

292 tems significantly decreased from an average accuracy of 98.48% when there

293 was no mask to 24.0% when there were masks on the face (see Fig. 2b). Not

294 only did the average accuracy decrease, but also the accuracy achieved from

295 categorizing each class decreased significantly (> 68%), except for anger (less

296 than 16%). Although the overall accuracy achieved by humans with masks on

297 the face also decreases in each category (from when there was no mask), the

298 decrease is not large (< 26%) compared to the decrease seen in the computer

299 systems’ accuracy (up to 74.48%). Also, post-hoc tests showed that differ-

300 ences between emotion categories were significant for both humans, F (6, 567) 2 301 = 64.85, p < .001, η = 0.41 and computer systems, F (6, 203) = 125.79, 2 302 p < .001, η = 0.79 (α = 0.0083) for the masked faces. Posthoc two-sample

303 t-tests for the accuracies in each category obtained from humans and the com-

304 puter systems showed significant differences (p < .001) for most of the emotion

305 categories at alpha = 0.0014 (see Tables 1d and 1e).

306 The violin plots presented in Fig. 2c show accuracies obtained from cate-

307 gorizing images by humans and computer systems when a partial mask (mask 10 Shehu H. A. et al.

(a) unmasked

(b) masked

(c) partial mask

(d) sunglasses Figure 2: Violin plots showing accuracy obtained from categorizing a) un- masked, b) masked, c) partial mask, and d) sunglasses emotions by humans and computer systems. Emotion classification of covered faces 11

Sum Square df F p Eta Squared Intercept 1.76E+05 1.0 438.05 <.001 0.08 C(Group) 3.23E+04 1.0 80.25 <.001 0.01 C(Covering) 4.88E+04 3.0 40.44 <.001 0.09 C(Emotion) 1.98E+05 6.0 81.87 <.001 0.02 C(Group): C(Covering) 1.78E+04 3.0 14.73 <.001 0.09 C(Group): C(Emotion) 1.94E+05 6.0 80.44 <.001 0.01 C(Covering): C(Emotion) 1.56E+05 18.0 21.55 <.001 0.07 C(Group): C(Covering):C(Emotion) 2.03E+05 18.0 28.11 <.001 0.09 Residual 1.24E+06 3080.0 - - - (a) ANOVA comparing humans and computer systems

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.123 (-0.4) 0.341 (0.25) <.001 (-1.63) <.001 (-1.00) <.001 (-0.96) <.001 (-1.20) Disgust 0.123 (-0.40) - 0.025 (0.60) <.001 (-1.08) 0.049 (-0.52) 0.048 (-0.52) 0.008 (-0.72) Fear 0.341 (0.25) 0.025 (0.60) - <.001 (-1.59) <.001 (-1.11) 1.00 (-1.08) 1.00 (-1.27) Happy <.001 (-1.63) <.001 (-1.08) <.001 (-1.59) - 0.008 (0.71) 0.056 (0.51) 0.113 (0.42) Neutral <.001 (-1.00) 0.049 (-0.52) <.001 (-1.11) 0.008 (0.71) - 0.832 (-0.06) 0.354 (-0.24) Sad <.001 ( -0.96) 0.048 (-0.52) <.001 (-1.08) 0.056 (0.51) 0.832 (-0.06) - 0.558 (-0.15) Surprise <.001 (-1.20) 0.008 (-0.72) <.001 (-1.27) 0.113 (0.42) 0.354 (-0.24) 0.558 (-0.15) - (b) unmasked by humans

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 1.00 (0) 1.00 (0) 1.00 (0) <.001 (1.51) 1.00 (0) 1.00 (0) Disgust 1.00 (0) - 1.00 (0) 1.00 (0) <.001 (1.51) 1.00 (0) 1.00 (0) Fear 1.00 (0) 1.00 (0) - 1.00 (0) <.001 (1.51) 1.00 (0) 1.00 (0) Happy 1.00 (0) 1.00 (0) 1.00 (0) - <.001 (1.51) 1.00 (0) 1.00 (0) Neutral <.001 (1.51) <.001 (1.51) <.001 (1.51) <.001 (1.51) - <.001 (1.51) <.001 (1.51) Sad 1.00 (0) 1.00 (0) 1.00 (0) 1.00 (0) <.001 (1.51) - 1.00 (0) Surprise 1.00 (0) 1.00 (0) 1.00 (0) 1.00 (0) <.001 (1.51) 1.00 (0) - (c) unmasked by computer systems

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.027 (0.59) 0.230 (-0.32) <.001 (-1.36) <.001 (-2.00) 0.276 (0.29) <.001 (-0.91) Disgust 0.027 (0.59) - 0.002 (-0.85) <.001 (-2.07) <.001 (-2.78) 0.327 (-0.26) <.001 (-1.67) Fear 0.230 (-0.32) 0.002 (-0.85) - 0.002 (-0.85) <.001 (-1.40) 0.036 (0.56) 0.106 (0.43) Happy <.001 (-1.36) <.001 (-2.07) 0.002 (-0.85) - 0.014 (-0.66) <.001 (1.58) 0.020 (0.62) Neutral <.001 (-2.00) <.001 (-2.78) <.001 (-1.40) 0.014 (-0.66) - <.001 (2.18) <.001 (1.38) Sad 0.276 (0.29) 0.327 (-0.26) 0.036 (0.56) <.001 (1.58) <.001 (2.18) - <.001 (-1.17) Surprise <.001 (-0.91) <.001 (-1.67) 0.106 (0.43) 0.020 (0.62) <.001 (1.38) <.001 (-1.17) - (d) masked by humans

Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (5.03) <.001 (2.90) <.001 (6.05) <.001 (3.61) <.001 (4.69) <.001 (4.19) Disgust <.001 (5.03) - <.001 (-1.80) <.001 (1.00) 0.017 (-0.64) 0.072 (-0.48) <.001 (-1.12) Fear <.001 (2.90) <.001 (-1.80) - <.001 (2.63) 0.002 (0.87) <.001 (1.45) <.001 (0.94) Happy <.001 (6.05) <.001 (1.00) <.001 (2.63) - <.001 (-1.24) <.001 (-1.62) <.001 (-2.36) Neutral <.001 (3.61) 0.017 (-0.64) 0.002 (0.87) <.001 (-1.24) - 0.230 (0.32) 0.605 (-0.13) Sad <.001 (4.69) 0.072 (-0.48) <.001 (1.45) <.001 (-1.62) 0.230 (0.32) - 0.017 (-0.64) Surprise <.001 (4.19) <.001 (-1.12) <.001 (0.94) <.001 (-2.36) 0.605 (-0.13) 0.017 (-0.64) - (e) masked by computer systems

Table 1: Statistical test results presenting a) ANOVA obtained from comparing humans and computer systems based on their achieved accuracy from categorizing people’s emotions with different types of coverings. b) Posthoc t-test results showing the p-value and Cohen’s d effect size obtained from comparing unmasked emotion categories by humans and c) computer systems. d) Posthoc t-test results showing the p-value and Cohen’s d effect size obtained from comparing masked emotion categories by humans and e) computer systems. Note that the α level in all the posthoc comparisons has been adjusted to 0.0014 using Bonferroni correction. 12 Shehu H. A. et al.

308 with transparent window) was used to cover the face. In contrast to the use of a

309 fully covering mask, the classification accuracy obtained by computer systems

310 saw a great increase in categorizing happiness and surprise expressions with

311 partial masks (0% vs 66.67% and 20% vs 73.33%). Also, while the accuracy

312 achieved by humans has also increased across all the emotion categories, the

313 largest increase was found for happy, disgust, sad, and surprise expressions

314 (all > 19% increase). Post-hoc tests revealed significance differences for both 2 315 humans, F (6, 567) = 64.01, p < .001, η = 0.40 and computer systems, F (6, 2 316 203) = 235.28, p < .001, η = 0.87 (α = 0.0083). Posthoc two-sample unpaired

317 t-tests showed a significant differences (< .001) for most emotion categories

318 (see Tables 2a and 2b).

319 The violin plots presented in Fig. 2d show the accuracy obtained from

320 categorizing faces with sunglasses by humans and computer systems. As can

321 be seen, the accuracy achieved by the computer systems decreased most for sad

322 and surprise expressions with sunglasses (all > 27%). Notably, the accuracy

323 achieved by humans in categorizing happiness remains the same both with

324 or without sunglasses. This shows that putting on sunglasses does not reduce

325 humans‘ happiness detection. As expected, there was a large decrease (> 16%)

326 in the accuracy for fear expressions when the eyes were covered. Post-hoc tests

327 showed that these differences between emotion categories were significant for 2 328 both humans, F (6, 567) = 74.15, p < .001, η = 0.44 and computer systems, 2 329 F (6, 203) = 13.85, p < .001, η = 0.29 (α = 0.0083) for faces with sunglasses.

330 Posthoc two-sample unpaired t-test showed significant differences (< .001) for

331 most of the emotion categories (see Tables 2c and 2d).

332 We also tested for differences within each emotion category across different

333 types of covering (i.e. unmasked, mask, partial mask, and sunglasses). A sig-

334 nificant main effect was found for all emotion categories (all p < .001), except 2 335 for the neutral expression, F (3, 324) = 0.037, p = .99, η = 0.003 (see Fig. 3a).

336 Although, there was a significant, F (3, 324) = 5.36, p < .001 (p = 0.00129), 2 337 η = 0.05 main effect for humans categorizing the fear expression comparing

338 the different types of coverings (see Fig. 3b), the effect size was very small 2 2 339 (η = 0.05) compared with other emotion categories (all η > 0.2) like anger

340 (see Fig. 3c), etc. that have a larger effect.

341 It can thus be suggested that humans might be equally worse (see Fig. 3b)

342 in categorizing the fear emotion category with or without covering on the face 2 343 as the effect size is not large as it is in other expressions (all η > 0.2 except 2 344 neutral expression η = 0.003).

345 Based on the three additional questions asked about exposure to face

346 masked people, we added Qmean as a factor to a linear regression and then

347 investigate if people’s experience with masked individuals might help them

348 achieve higher accuracy in categorizing mask emotion images. The results (see

349 Table 2e) showed significant main effects for the factors Covering, F(3, 6) = 2 2 350 122.19, p < .001, η = 0.09 and Qmean, F(6, 70) = 3.73, p < .001, η = 0.07,

351 where Qmean represents the average of the three questions. However, there

352 was no significant interaction between Covering and Qmean, F(6, 210) = 0.62, 2 353 p = 1.00, η = 0.03. Emotion classification of covered faces 13

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.051 (-0.52) 0.015 (-0.65) <.001 (-2.49) <.001 (-2.02) <.001 (-1.13) <.001 (-2.24) Disgust 0.051 (-0.52) - 0.695 (-0.10) <.001 (-1.66) <.001 (-1.29) 0.047 (-0.53) <.001 (-1.49) Fear 0.015 (-0.65) 0.695 (-0.10) - <.001 (-1.66) <.001 (-1.25) 0.096 (-0.44) <.001 (-1.46) Happy <.001 (-2.49) <.001 (-1.66) <.001 (-1.66) - 0.112 (0.42) <.001 (1.26) 0.621 (0.13) Neutral <.001 (-2.02) <.001 (-1.29) <.001 (-1.25) 0.112 (0.42) - 0.002 (0.83) 0.330 (-0.26) Sad <.001 (-1.13) 0.047 (-0.53) 0.096 (-0.44) <.001 (1.26) 0.002 (0.83) - <.001 (-1.06) Surprise <.001 (-2.24) <.001 (-1.49) <.001 (-1.46) 0.621 (0.13) 0.330 (-0.26) <.001 (-1.06) - (a) partial mask by humans

Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (7.29) <.001 (6.11) <.001 (-2.35) 0.566 (0.15) <.001 (8.22) <.001 (-2.16) Disgust <.001 (7.29) - 0.310 (-0.27) <.001 (-7.49) <.001 (-3.30) 0.319 (0.26) <.001 (-5.56) Fear <.001 (6.11) 0.310 (-0.27) - <.001 (-6.85) <.001 (-3.07) 0.078 (0.47) <.001 (-5.28) Happy <.001 (-2.35) <.001 (-7.49) <.001 (-6.85) - <.001 (1.72) <.001 (7.91) 0.104 (-0.43) Neutral 0.566 (0.15) <.001 (-3.30) <.001 (-3.07) <.001 (1.72) - <.001 (3.43) <.001 (-1.82) Sad <.001 (8.22) 0.319 (0.26) 0.078 (0.47) <.001 (7.91) <.001 (3.43) - <.001 (-5.72) Surprise <.001 (-2.16) <.001 (-5.56) <.001 (-5.28) 0.104 (-0.43) <.001 (-1.82) <.001 (-5.72) - (b) partial mask by computer systems

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.274 (0.29) 0.002 (0.86) <.001 (-1.64) <.001 (-0.95) 0.024 (0.60) <.001 (-1.30) Disgust 0.274 (0.29) - 0.031 (0.57) <.001 (-1.87) <.001 (-1.21) 0.292 (0.28) <.001 (-1.55) Fear 0.002 (0.86) 0.031 (0.57) - <.001 (-2.34) <.001 (-1.73) 0.169 (-0.36) <.001 (-2.05) Happy <.001 (-1.64) <.001 (-1.87) <.001 (-2.34) - 0.012 (0.68) <.001 (2.57) 0.158 (0.37) Neutral <.001 (-0.95) <.001 (-1.21) <.001 (-1.73) 0.012 (0.68) - <.001 (1.68) 0.207 (-0.33) Sad 0.024 (0.60) 0.292 (0.28) 0.169 (-0.36) <.001 (2.57) <.001 (1.68) - <.001 (-2.12) Surprise <.001 (-1.30) <.001 (-1.55) <.001 (-2.05) 0.158 (0.37) 0.207 (-0.33) <.001 (-2.12) - (c) sunglasses by humans Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.570 (0.15) 0.244 (0.31) 0.576 (-0.15) <.001 (1.27) <.001 (1.72) <.001 (1.49) Disgust 0.570 (0.15) - 0.566 (0.15) 0.354 (-0.24) <.001 (1.12) <.001 (1.22) <.001 (1.24) Fear 0.244 (0.31) 0.566 (0.15) - 0.158 (-0.38) <.001 (0.99) <.001 (0.94) <.001 (1.05) Happy 0.576 (-0.15) 0.354 (-0.24) 0.158 (-0.38) - <.001 (1.27) <.001 (1.47) <.001 (1.42) Neutral <.001 (1.27) <.001 (1.12) <.001 (0.99) <.001 (1.27) - 0.084 (-0.46) 0.550 (-0.16) Sad <.001 (1.72) <.001 (1.22) <.001 (0.94) <.001 (1.47) 0.084 (-0.46) - 0.156 (0.38) Surprise <.001 (1.49) <.001 (1.24) <.001 (1.05) <.001 (1.42) 0.550 (-0.16) 0.156 (0.38) - (d) sunglasses by computer systems

Sum Square df F p Eta Squared C(Emotion) 5.28E+05 6.0 172.42 <.001 0.27 C(Covering) 1.87E+05 3.0 122.19 <.001 0.09 C(Qmean) 1.33+E05 70.0 3.73 <.001 0.07 C(1/ID) 5.13+E04 81.0 1.24 0.076 0.03 C(Covering):C(Qmean) 6.62E+04 210.0 0.62 1.00 0.03 Residual 1.02+E06 1995.0 - - - (e) ANOVA for human sample including Qmean

Table 2: Statistical test results presenting a) Posthoc test (t-test) results showing the p- value and Cohen’s d effect size obtained from comparing partial mask emotion categories by humans and b) computer systems. c) Posthoc test (t-test) results showing the p-value (and Cohen’s d effect size) obtained from comparing sunglasses emotion categories by humans and d) computer systems. e) Regression results from comparing humans based on their understanding of masked people and their achieved accuracy from categorizing people’s emotions with different coverings. Note that the α level in all the posthoc comparisons has been adjusted to 0.0014 using Bonferroni correction. 14 Shehu H. A. et al.

(a) neutral by humans (b) fear by humans (c) anger by humans

(d) neutral and anger expressions with and without label change

Figure 3: Violin plot showing the accuracy obtained a) from classifying neutral expression by humans across four different covering, b) from categorizing fear expression by humans across four different coverings, b) from categorizing anger expression by humans across four different coverings, and c) by neutral and anger expression before and after the label changed.

354 3.2 Misclassification Analysis

355 As an additional analysis, we investigated the types of misclassification made

356 by humans and computer systems.

357 The results showed that humans mainly misclassify fully covering and

358 therefore, unclear expressions as neutral. However, the misclassifications varied

359 for the computer systems. For instance, the computer systems mainly classified

360 unclear emotions as anger when fully covering masks were used, while partially

361 covered facial expressions were mainly classified as happiness or surprise (see

362 Fig. 4).

363 As such, misclassifications can result in costly negative outcomes and com-

364 plications for humans interacting with computer systems, we aimed to regulate

365 the cost of misclassification in these computer systems. To make them resem-

366 ble human classification behavior, we aimed to alter the training to lead to

367 also misclassify unclear expressions as neutral.

368 We hypothesize that humans mainly misclassify unclear emotions as neu-

369 tral because it is the emotion that they exposed to predominantly on a day-to- Emotion classification of covered faces 15

370 day basis. In other words, humans would classify unclear expressions as neutral

371 because it is the expression they see on peoples’ faces most of the time. There-

372 fore, by oversampling the neutral expression images used to train the computer

373 systems, we anticipated that the models would behave more similar to humans

374 as they see more neutral expression compared to other emotional categories.

375 3.2.1 Method

376 Since classifying any emotional category as a neutral expression is considered

377 less costly compared to when they are classified as other emotional categories

378 such as anger, happy, or sad, etc., the misclassification cost of the models was

379 evaluated without taking the samples that were misclassified as neutral into

380 account. In particular, the quality of the classifier is not measured based on

381 total misclassification but rather using equation 1 below:

Err − Err cost = neutral (1) No. test samples

382 where

383 – Err represents the misclassified samples or the total misclassification

384 – Err neutral represents samples misclassified as neutral

385 – No. test samples represents the total number of test sample images

386 3.2.2 Results

387 In Table 3, the column oversampling ratio represents the oversampling ratio

388 of the category neutral against all other emotion categories. For instance, the

389 value “2:2” means no oversampling whereas the ratio “4:2” means there was

390 twice as much neutral expression images in the training set compared to images

391 from all other emotion categories. Different ratios were chosen based on the

392 number of available images in the CK+ dataset so as to find the best ratio that

393 will give a much lower misclassification cost, while maintaining a reasonable

394 accuracy.

Oversample ratio Misclassification cost Accuracy Avg cost Avg accuracy (Neutral:All) Mask Partial mask Sunglasses Mask Partial mask Sunglasses 2:2 0.61 0.47 0.18 0.41 0.24 0.33 0.80 0.46 3:2 0.40 0.54 0.23 0.39 0.20 0.14 0.68 0.34 4:2 0.26 0.34 0.17 0.26 0.17 0.23 0.71 0.37 5:2 0.25 0.39 0.10 0.25 0.25 0.17 0.66 0.36

Table 3: Table presenting accuracies and the misclassification costs obtained from classifying covered images with and without oversampling neutral ex- pression images

395 The average misclassification cost obtained in each of the oversampled

396 cases was less than the average cost without oversampling. Thus, oversampling 16 Shehu H. A. et al.

397 reduced the cost of misclassification, with the least misclassification cost for

398 the model trained with oversampled data at a 5:2 ratio. 2 399 Fig. 4a, 4b, and 4c present the matrices obtained from classify-

400 ing mask, partial mask, and sunglasses images by humans, Fig. 4d, 4e, and 4f

401 present the confusion matrices obtained from classifying mask, partial mask,

402 and sunglasses images when no oversampling was applied, and Fig. 4g, 4h,

403 and 4i present the confusion matrices obtained from classifying mask, partial

404 mask, and sunglasses images with the best oversampling ratio (i.e. 5:2) ap-

405 plied to the training images. Humans mainly misclassified unclear expressions

406 as neutral when the face was fully mask covered. However, while the misclas-

407 sification varied for different coverings when no oversampling was applied to

408 computer systems, after oversampling, the majority of the misclassified images

409 were misclassified as neutral. Although the average accuracy achieved by the

410 model without oversampling is larger (> 10%) when compared with the aver-

411 age accuracy achieved by the model with the best oversampling ratio (5:2), the

412 difference in the average misclassification cost achieved by the model with the

413 best oversampling ratio was significantly lower (16% lower) than the average

414 misclassification cost of the model with no oversampling, (t(58) = 30.98, p <

415 .001).

416 4 Discussion

417 The work aimed to compare the performance of humans and computer systems

418 in categorizing emotion from faces of people with sunglasses and face masks, as

419 well as compare the decrease caused by each covering. The work also introduces

420 an approach that reduces the misclassification cost for the computer systems.

421 In addition to the replication of previous findings that computer systems

422 (with 98.48% accuracy) perform better than humans (with 82.72% accuracy)

423 in categorizing emotions from uncovered/unmasked images in the same dataset

424 (Barrett et al., 2019; Krumhuber et al., 2019; Krumhuber et al., 2020; Mat-

425 sumoto & Hwang, 2011), we found that face masks (both partially covered and

426 fully covering masks) and sunglasses reduced the accuracy of both humans and

427 computer systems in emotion categorization. Consistent with Carbon (2020),

428 we also found that humans mainly classify faces with fully covering masks as

429 neutral.

430 While the face masks impair emotion categorization in both humans and

431 computer systems by a significant amount, emotion categories are affected dif-

432 ferently depending on the type of covering used. For instance, the use of a fully

433 covering mask mainly obscured disgust, happiness, and sadness for computer

434 systems whereas anger, disgust, and sadness were mainly more difficult to cat-

435 egorize for humans. The partial masks mainly impaired anger categorization

2 A confusion matrix is a table to visualize the performance of an algorithm. In a confusion matrix, on the basis whereby each row represents the number of instances in a predicted class, the columns will represent the total number of instances in an actual class (or vice- versa). Emotion classification of covered faces 17

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 4: Confusion matrices obtained from classifying a) mask, b) partial mask, and c) sunglasses images by humans, from classifying d) mask, e) partial mask, and f) sunglasses images by computer systems without oversampling, from classifying g) mask, h) partial mask, and i) sunglasses images by com- puter systems with oversampling [5:2]. Note that the 82 human participants all classified a sum of 35 images for each covering, i.e. a total of 410 images for each emotion category and 2870 for each covering. Similarly, 30 computer systems all classified a sum of 35 images for each covering, i.e. a total of 150 images for each emotion category and 1050 for each covering. An = anger, Di = disgust, Fe = fear, Ha = happy, Ne = neutral, Sa = sadness and Su = surprise. 18 Shehu H. A. et al.

436 in humans, whereas computer systems were mainly impaired through those

437 for disgust, fear, and sadness. And while the use of sunglasses mainly reduced

438 fear categorization ability in humans, it was neutral and surprise for computer

439 systems.

440 Previous work (Delicato & Mason, 2015; Shehu, Siddique, et al., 2021) has

441 shown the relevance of the mouth for correctly categorizing happiness and

442 surprise in both humans and computer systems. Therefore, since the partial

443 masks allow the mouth to be visible through the transparent window, it was

444 anticipated that happiness and surprise would be categorized more accurately

445 when the partial mask is applied. Indeed, a careful of the results

446 showed a large increase in the classification accuracy of happiness and surprise

447 expression in both computer systems and humans. While there was an increase

448 of 19% (for happy) and 28% (for the surprise) in the humans sample, there was

449 an increase of up to 66% for the category happy and up to 55% for surprise

450 expression in computer systems when the partial mask was used.

451 While we found a significant difference in the accuracy of humans cat-

452 egorizing each emotion category across the different coverings, no significant

453 difference was found for the neutral expression across all coverings. As humans

454 classify unclear emotions as neutral, they necessarily were highly accurate in

455 categorizing neutral expression (see Fig. 3a).

456 In contrast to humans, the classification of unclear emotions significantly

457 varies in computer systems. For instance, the computer systems mainly clas-

458 sify unclear emotions as anger expression when fully covering masks were used.

459 However, unclear emotions were mainly classified as happiness or surprise when

460 there was a partial mask on the face. An initial observation of the confusion

461 matrix obtained from categorizing masked emotion images (see Fig. 4d) sug-

462 gested that the computer systems mainly classify unclear emotions as anger

463 expression just because the anger expression appears first after one-hot encod- 3 464 ing the labels that were fed to the classifier. Therefore, we changed the label

465 name of the neutral expression to ‘aneutral’ such that the neutral expression

466 appears first after the labels are one-hot encoded. The violin plots in Fig. 3d

467 show the accuracy obtained from neutral and anger expression before and after

468 the labels were changed. As can be seen from Fig. 3d, the accuracy obtained

469 before and after the label change was nearly the same or relatively similar

470 for both anger and neutral expression. In addition, we found no significant

471 difference in the achieved accuracies before and after the label was changed.

472 Thus, label order was not important.

473 Although the achieved accuracy by the computer systems is relatively low

474 when the fully covering masks were used, the above-mentioned observation

475 suggests that the prediction of the computer systems was not fully random.

476 One possible why the images of the fully covering masks were mainly 4 477 classified as anger could be because the produced (extracted) features by

3 One hot encoding is a process whereby categorical variables are converted into a form that could be provided to machine learning algorithms. 4 Feature extraction is the process by which an initial set of raw data (in this case image pixels) is reduced to manageable groups for processing. Emotion classification of covered faces 19

478 the computer systems when the fully covering masks were used have high

479 similarities with the anger expression images.

480 Depending on the type of covering, the classification accuracy in humans

481 changes. There is no general order of increase or decrease in the achieved ac-

482 curacy. However, based on the results obtained in this study, the classification

483 accuracy of the computer systems correlates with the occlusion. The larger

484 the occlusion, the higher the decrease and vice-versa. For instance, the highest

485 accuracy was achieved with nothing added to the image. The achieved accu-

486 racy decreases when sunglasses (covers 10-15% of the face) were used. And

487 although a further decrease has been observed when partial masks (covers 30-

488 35%) were used, the largest decrease was experienced when the fully covering

489 (covers 60-65%) face masks were used. It is important to bear in that

490 this observation is only valid for the type of coverings used in this research

491 as it is unknown whether the decrease is based on the percentage of the face

492 that is covered or based on important features that are covered. This is an

493 important issue for future research.

494 We also found that people’s exposure to people wearing masks does not

495 affect their performance in categorizing emotion. Although more exposure to

496 people wearing mask was related to increased accuracy across all facial ex-

497 pressions, we found no evidence for people’s exposure to masked individuals

498 helping them to be specifically better in categorizing masked emotions (see

499 Table 2e). Taken together, these results suggest that people perform worst in

500 categorizing masked emotion images independent of their exposure to masked

501 individuals. In other words, being around more masked people does not in-

502 crease the ability to categorize covered emotional facial expressions.

503 The findings based on the misclassification analysis suggest that it might be

504 important to feed imbalanced data to classifiers. For instance, in situations like

505 this where feeding balanced data might lead to differences in misclassifications,

506 feeding the classifier with data that is oversampled based on a certain (less

507 risky) expression is rather more important than feeding data with an equal

508 distribution from each expression as it results in a high cost of misclassification.

509 Not only did oversampling reduce the misclassification cost of the covered

510 images when a specified class is targeted, at the same time, it also increased the

511 classification accuracy when classifying the fully covering mask images. The

512 accuracy achieved by the oversampling method (ratio 5:2) was greater than the

513 accuracy achieved when no oversampling was used. Nevertheless, oversampling

514 the data might also come at a cost of decreasing the classification accuracy,

515 e.g. the accuracy obtained from classifying the partial mask and the sunglasses

516 images were reduced after the neutral expression was oversampled.

517 This gives rise to the question of which performance measure to prefer? Is

518 the accuracy or the misclassification cost more relevant? Taking into account

519 the protective factor, mediating, and moderating effects of these measures

520 (Brown & Singh, 2014; Rogers, 2000). While there is no general answer to

521 this question, both the accuracy and the misclassification cost need to both be

522 considered. For instance, misclassifying anger or sadness as happiness could

523 have extreme consequences in future decision making for artificial intelligence 20 Shehu H. A. et al.

524 systems like robots. At the same time, an organization will more likely in-

525 troduce robots that can accurately understand people’s emotions into shared

526 workspaces compared to less accurate ones.

527 It is also worth stating that the findings that oversampling reduced the

528 cost of misclassification cannot be generalized to all the different types of face

529 coverings. For instance, the misclassification cost with no oversampling for

530 faces with sunglasses was lower (18%) compared to when oversampling (ratio

531 3:2) was used (23%).

532 Across all images, humans spent approximately 8.47 secs to predict the

533 category of an image and 19.77 mins for the complete survey, while it took

534 the computer systems less than a second (0.007) to classify each image and

535 approximately 1 second to classify all 140 images. Nevertheless, the computer

536 systems took approximately 1 hr 20 mins to train on a 24GB Graphical Pro-

537 cessing Unit (GPU) machine with Quadro RTX 6000 CUDA version 11.2.

538 Thus computer systems can predict the expression of the images faster than

539 humans provided they are already trained. Humans train throughout thier life-

540 time, although they are already ‘experts’ at recognizing emotions at an early

541 age (Decety, 2010; Pollak & Kistler, 2002).

542

543 Implications, limitations, and recommendations

544 Together, the results obtained from the study showed that while the computer

545 systems outperformed humans with no covering on the face, the performance of

546 the computer systems showed a larger decrease when the face is partially cov-

547 ered. Further, as the computer systems make different mistakes, we introduced

548 a method to reduce the cost of misclassification for the computer systems in

549 order to avoid having extreme consequences in future decision making. These

550 findings suggest that greater efforts are needed to develop artificial emotion

551 classification systems that can adapt to more situations, e.g. taking

552 into account contextual features. Thus, care is needed prior to using results

553 from computer systems for in psychological experiments.

554 The study also has some potential limitations. For instance, the overall

555 findings of what is referred to as computer systems in this research may be

556 somewhat limited to end-to-end deep learning models as different machine

557 learning techniques might lead to achieving different results depending on the

558 feature extraction technique, as well as the algorithm used to categorize the

559 emotion. In addition, the study was performed on a single stimulus set, there-

560 fore, the question of whether the findings would generalize on other stimulus

561 sets remains open. Future studies should address this issue by including other

562 stimulus sets, as well as other artificial emotion classification methods.

563 Acknowledgements Authors would like to acknowledgements the participants for putting 564 the effort to complete the survey. Emotion classification of covered faces 21

565 5 Declarations

566 Funding

567 This research was not supported by any organization.

568

569 Competing interests

570 The authors declare no competing interests.

571

572 Ethics Approval

573 Ethics approval has been obtained from the Human Ethics Committee (HEC)

574 of Victoria University of Wellington (Application no: 28949).

575

576 to participate

577 All participants gave informed consent for participation in the study.

578

579 Consent for publication

580 All authors have approved the manuscript and agree with submission of the

581 work

582

583 Data availability

584 Data used in this study can be downloaded from OSF at https://osf.io/mgx9p

585

586 Code availability

587 Code implemented in this studycan be downloaded from OSF at https://osf.

588 io/mgx9p

589

590 Authors Contribution

591 Harisu Abdullahi Shehu: Conceived and designed the experiments, performed

592 the experiments, analyzed the data, wrote the paper; Will Browne: Con-

593 tributed materials/analysis tools, supervision, writing – review & editing; Hed-

594 wig Eisenbarth: Contributed materials/analysis tools, supervision, writing –

595 review & editing

596

597 References

598 Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A. M., & Pollak, S. D.

599 (2019). Emotional expressions reconsidered: Challenges to inferring

600 emotion from human facial movements. Psychological science in the

601 public , 20 (1), 1–68.

602 Berryhill, J., Heang, K. K., Clogher, R., & McBride, K. (2019). Hello, world:

603 Artificial intelligence and its use in the public sector.

604 Brown, J., & Singh, J. P. (2014). Forensic risk assessment: A beginner’s guide.

605 Archives of , 1 (1), 49–59. 22 Shehu H. A. et al.

606 Calvo, M. G., & Lundqvist, D. (2008). Facial expressions of emotion (kdef):

607 Identification under different display-duration conditions. Behavior re-

608 search methods, 40 (1), 109–115.

609 Camras, L. A., & Allison, K. (1985). Children’s understanding of emotional

610 facial expressions and verbal labels. Journal of nonverbal Behavior,

611 9 (2), 84–94.

612 Carbon, C.-C. (2020). Wearing face masks strongly confuses counterparts in

613 reading emotions. Frontiers in Psychology, 11, 2526.

614 Cheng, S., & Zhou, G. (2020). Facial expression recognition method based on

615 improved vgg convolutional neural network. International Journal of

616 Pattern Recognition and Artificial Intelligence, 34 (07), 2056003.

617 Ciraco, M., Rogalewski, M., & Weiss, G. (2005). Improving classifier utility

618 by altering the misclassification cost ratio. Proceedings of the 1st in-

619 ternational workshop on Utility-based data mining, 46–52.

620 Coleman, T. J. (2020). Coronavirus: Call for clear face masks to be ‘the ’.

621 BBC News. Accessed: 26 May, 2020. https://www.bbc.com/news/

622 world-52764355

623 Decety, J. (2010). The neurodevelopment of in humans. Developmen-

624 tal , 32 (4), 257–267.

625 Delicato, L. S., & Mason, R. (2015). Happiness is in the mouth of the beholder

626 and fear in the eyes. Journal of vision, 15 (1378).

627 Eisenbarth, H., Alpers, G. W., Segr`e,D., Calogero, A., & Angrilli, A. (2008).

628 Categorization and evaluation of emotional faces in psychopathic women.

629 research, 159 (1-2), 189–195.

630 Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face

631 and emotion. Journal of and , 7 (2), 124.

632 https://doi.org/10.1109/AFGR.2000.840611

633 Grundmann, F., Epstude, K., & Scheibe, S. (2021). Face masks reduce emotion-

634 recognition accuracy and perceived closeness. PloS one, 16 (4), e0249792.

635 Grzymala-Busse, J. W. (2005). Rule induction. Data mining and knowledge

636 discovery handbook (pp. 277–294). Springer.

637 Guarnera, M., Magnano, P., Pellerone, M., Cascio, M. I., Squatrito, V., &

638 Buccheri, S. L. (2018). Facial expressions and the ability to recognize

639 emotions from the eyes or mouth: A comparison among old adults,

640 young adults, and children. The Journal of genetic psychology, 179 (5),

641 297–310.

642 Hugenberg, K. (2005). Social categorization and the of facial affect:

643 Target race moderates the response latency advantage for happy faces.

644 Emotion, 5 (3), 267.

645 Japkowicz, N. et al. (2000). Learning from imbalanced data sets: A comparison

646 of various strategies. AAAI workshop on learning from imbalanced

647 data sets, 68, 10–15.

648 Keras api. (2021). https://keras.io/

649 Krumhuber, E. G., K¨uster,D., Namba, S., Shah, D., & Calvo, M. G. (2019).

650 Emotion recognition from posed and spontaneous dynamic expres-

651 sions: Human observers versus machine analysis. Emotion. Emotion classification of covered faces 23

652 Krumhuber, E. G., K¨uster,D., Namba, S., & Skora, L. (2020). Human and

653 machine validation of 14 databases of dynamic facial expressions. Be-

654 havior research methods, 1–16.

655 Kukar, M., Kononenko, I., Groˇselj,C., Kralj, K., & Fettich, J. (1999). Analysing

656 and improving the diagnosis of ischaemic with machine

657 learning. Artificial intelligence in , 16 (1), 25–50.

658 Lewinski, P., den Uyl, T. M., & Butler, C. (2014). Automated facial coding:

659 Validation of basic emotions and facs aus in facereader. Journal of

660 Neuroscience, Psychology, and , 7 (4), 227.

661 Li, M., Xu, H., Huang, X., Song, Z., Liu, X., & Li, X. (2018). Facial expression

662 recognition with identity and emotion joint learning. IEEE Transac-

663 tions on Affective Computing, 1–1. https://doi.org/10.1109/TAFFC.

664 2018.2880201

665 Marini, M., Ansani, A., Paglieri, F., Caruana, F., & Viola, M. (2021). The

666 impact of facemasks on emotion recognition, attribution and

667 re-identification. Scientific Reports, 11 (1), 1–14.

668 Matsumoto, D., & Hwang, H. S. (2011). Evidence for training the ability

669 to read of emotion. and emotion, 35 (2),

670 181–191.

671 Miller, E. J., Krumhuber, E. G., & Dawel, A. (2020). Observers perceive the

672 duchenne marker as signaling only intensity for sad expressions, not

673 genuine emotion. Emotion.

674 Mohan, K., Seal, A., Krejcar, O., & Yazidi, A. (2021). Fer-net: Facial ex-

675 pression recognition using deep neural net. Neural Computing and

676 Applications, 1–12.

677 Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016). Going deeper in facial

678 expression recognition using deep neural networks. 2016 IEEE Win-

679 ter Conference on Applications of Computer Vision (WACV), 1–10.

680 https://doi.org/10.1109/WACV.2016.7477450

681 Noyes, E., Davis, J. P., Petrov, N., Gray, K. L., & Ritchie, K. L. (2021). The

682 effect of face masks and sunglasses on identity and expression recogni-

683 tion with super-recognizers and typical observers. Royal Society Open

684 Science, 8 (3), 201169.

685 Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994).

686 Reducing misclassification costs. Machine learning proceedings 1994

687 (pp. 217–225). Elsevier.

688 Picard, R. W., Vyzas, E., & Healey, J. (2001). Toward machine emotional in-

689 telligence: Analysis of affective physiological state. IEEE transactions

690 on pattern analysis and machine intelligence, 23 (10), 1175–1191.

691 Pollak, S. D., & Kistler, D. J. (2002). Early experience is associated with

692 the development of categorical representations for facial expressions

693 of emotion. Proceedings of the National Academy of , 99 (13),

694 9072–9076.

695 Roberson, D., Kikutani, M., D¨oge,P., Whitaker, L., & Majid, A. (2012).

696 Shades of emotion: What the addition of sunglasses or masks to faces 24 Shehu H. A. et al.

697 reveals about the development of facial expression processing. Cogni-

698 tion, 125 (2), 195–206.

699 Rogers, R. (2000). The uncritical of risk assessment in forensic

700 practice. and human behavior, 24 (5), 595–605.

701 Shehu, H. A., Browne, W. N., & Eisenbarth, H. (2020a). An adversarial attacks

702 resistance-based approach to emotion recognition from images using

703 facial landmarks. 2020 29th IEEE International Conference on Robot

704 and Human Interactive Communication (RO-MAN), 1307–1314.

705 Shehu, H. A., Browne, W. N., & Eisenbarth, H. (2020b). Emotion categoriza-

706 tion from video-frame images using a novel sequential voting tech-

707 nique. International Symposium on Visual Computing, 618–632.

708 Shehu, H. A., Browne, W. N., & Eisenbarth, H. (2021). Emotion categorization

709 from faces of people with sunglasses and facemasks. Preprint (version

710 1) available at Research Square. https://doi.org/10.21203/rs.3.rs-

711 375319/v1]

712 Shehu, H. A., Siddique, A., Browne, W. N., & Eisenbarth, H. (2021). Lateral-

713 ized approach for robustness against attacks in emotion categorization

714 from images. 24th International Conference on Applications of Evo-

715 lutionary Computation (EvoApplications 2021).

716 Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for

717 large-scale image recognition. arXiv preprint arXiv:1409.1556.

718 Van den Stock, J., Righart, R., & De Gelder, B. (2007). Body expressions

719 influence recognition of emotions in the face and voice. Emotion, 7 (3),

720 487.

721 Wacker, J., Heldmann, M., & Stemmler, G. (2003). Separating emotion and

722 motivational direction in fear and anger: Effects on frontal asymmetry.

723 Emotion, 3 (2), 167.

724 Wegrzyn, M., Vogt, M., Kireclioglu, B., Schneider, J., & Kissler, J. (2017).

725 Mapping the emotional face. how individual face parts contribute to

726 successful emotion recognition. PloS one, 12 (5), e0177239.

727 Yuki, M., Maddux, W. W., & Masuda, T. (2007). Are the windows to the

728 the same in the east and west? cultural differences in using the

729 eyes and mouth as cues to recognize emotions in and the united

730 states. Journal of Experimental Social Psychology, 43 (2), 303–311.

731 Zhang, T., Zheng, W., Cui, Z., Zong, Y., & Li, Y. (2019). Spatial–temporal

732 recurrent neural network for emotion recognition. IEEE Transactions

733 on , 49 (3), 839–847. https://doi.org/10.1109/TCYB.

734 2017.2788081

735 Supplementary Materials

736 Confidence ratings

737

738 With regard to the confidence, just like for the emotion categories, a 3-way

739 repeated-measures ANOVA (2×4×7) was performed with the between-subject Emotion classification of covered faces 25

740 factor Group (i.e. humans vs computer systems), the within-subject factors

741 Covering (i.e. unmasked, mask, partial mask, and sunglasses), and Confidence

742 obtained from categorizing each emotion (i.e. for anger, disgust, fear, happy,

743 neutral, sad, and surprise). We found significant main effects for Group, F 2 744 (1, 3080) = 139.85, p < .001, η = 0.03, Covering, F (3, 3080) = 39.49, p < 2 2 745 .001, η = 0.07, and Emotion, F (6, 3080) = 58.96, p < .001, η = 0.02. All

746 interactions between the factors were significant (see Fig. 4a), as well as the

747 three-way interaction of Group × Covering × Emotion, F (18, 3080) = 28.86, 2 748 p < .001, η = 0.01.

749 Fig. 5a present the classification confidence obtained in categorizing un-

750 masked emotion images by humans and computer systems. Compared to hu-

751 mans that classify the unmasked images with maximum confidence of only

752 89% (achieved in the happy expression), the computer systems classify the

753 unmasked emotion images with very high confidence, achieving classification

754 confidence greater than 97% for all the emotion categories except for the neu-

755 tral expression, which was classified with a confidence of 88.98%. This is un-

756 derstandable as the artificial system classified all images from all expressions

757 correctly (except for neutral expression) with no mask on the face, achieving

758 a classification accuracy of 100%, except for neutral expression with a classifi-

759 cation accuracy of 89.33%. Post-hoc tests showed that there was a significant

760 difference in the classification confidence for both humans, F (6, 567) = 36.28, 2 761 p < .001, η = 0.28 and computer systems, F (6, 203) = 110.71, p < .001, 2 762 η = 0.77 (α = 0.0083) across different emotion categories when there was no

763 mask on the face. Posthoc t-test for the significance of the classification con-

764 fidence of each category for the unmasked emotional images by both humans

765 and the computer systems showed a significant difference (< .001) for most of

766 the emotion categories. (see Tables 4b and 4c).

767 Fig. 5b presents the confidence obtained from categorizing masked emo-

768 tion images by humans and computer systems. Here, the computer systems

769 achieved higher classification confidence than the humans only in classifying

770 the anger expression, achieving classification confidence of 81.98% in contrast

771 to humans that classify the anger class images with classification confidence of

772 59.06%. Surprisingly, while humans classify the happiness expression images

773 with very high confidence (> 89%-point) compared to other emotion cate-

774 gories (all < 84%-point), the classification confidence of the computer systems

775 is lowest for happiness expression (< 0.001%-point). This shows that humans

776 have nearly zero confidence in the classification of happiness when the mouth

777 is covered with a full covering face mask. Also, post-hoc tests showed that

778 there was a significant difference in the classification confidence for both hu- 2 779 mans, F (6, 567) = 49.70, p ¡ .001, η = 0.35 and computer systems, F (6, 2 780 203) = 205.75, p < .001, η = 0.86 (α = 0.0083) across different emotion cat-

781 egories with masks on the face. Posthoc two-sample t-tests for the differences

782 in classification confidence of each category for the masked emotional images

783 by both humans and computer systems showed a significant difference (p <

784 .001) for most of the emotion categories (see Tables and 4d and 4e). 26 Shehu H. A. et al.

(a) unmasked

(b) masked

(c) partial mask

(d) sunglasses Figure 5: Violin plots showing confidence obtained from categorizing a) un- masked, b) masked, c) partial mask, and d) sunglasses emotions by humans and computer systems. Emotion classification of covered faces 27

785 Fig. 5c presents the confidence obtained from categorizing images by hu-

786 mans and computer systems when the partial mask was used as the type of

787 covering. As can be seen from the figure, the confidence of computer systems

788 in categorizing happy and surprise expression increased when the partial mask

789 was used as the type of covering. However, the classification confidence of hu-

790 mans in categorizing happiness and surprise expression decreased when partial

791 masks were used compared to when there was fully covering masks on the face.

792 Nevertheless, post-hoc tests showed a significant difference in the classification 2 793 confidence for both humans, F (6, 567) = 68.13, p < .001, η = 0.42 and com- 2 794 puter systems, F (6, 203) = 300.06, p < .001, η = 0.90 (α = 0.0083) across

795 different emotion categories when the partial masks was used. Posthoc two-

796 sample t-tests for the differences in classification confidence of each category

797 for the partial mask emotional images (by both humans and computer sys-

798 tems) also showed a significant difference for most of the emotion categories

799 (see Tables 5a and 5b).

800 Fig. 5d presents the confidence obtained by humans and computer systems

801 from categorizing sunglasses images. Here, the confidence achieved by both

802 humans and computer systems has decreased in all classes (except for surprise

803 in humans) when there was sunglasses on the face compared with when there

804 was no coverings added to the face. The artificial system has experienced its

805 largest decrease in neutral, sad, and surprise expressions (all > 28% decrease)

806 whereas the largest decrease was seen in disgust, fear, and sad expressions (all

807 > 17% decrease) for humans. Post-hoc tests showed that there was a significant

808 difference in the classification confidence for both humans, F (6, 567) = 82.47, 2 2 809 p < .001, η = 0.47 and computer systems, F (6, 203) = 12.58, p < .001, η

810 = 0.27 (α = 0.0083) across different emotion categories when the sunglasses

811 was used as the type of covering. Posthoc two-sample t-tests for differences in

812 classification confidence of each category for the sunglasses emotional images

813 by both humans and computer systems showed a significant difference for most

814 of the emotion categories (see Tables 5c and 5d). 28 Shehu H. A. et al.

Sum Square df F p Eta Squared Intercept 9.08E+04 1.0 243.79 <.001 0.05 C(Group) 5.21E+04 1.0 139.85 <.001 0.03 C(Covering) 4.41E+04 3.0 39.49 <.001 0.07 C(Emotion) 1.32E+05 6.0 58.96 <.001 0.02 C(Group): C(Covering) 2.02E+04 3.0 18.04 <.001 0.09 C(Group): C(Emotion) 1.72E+05 6.0 76.87 <.001 0.01 C(Covering): C(Emotion) 1.28E+05 18.0 19.03 <.001 0.06 C(Group): C(Covering):C(Emotion) 1.94E+05 18.0 28.86 <.001 0.1 Residual 1.15E+06 3080.0 - - - (a) ANOVA comparing humans and computer systems

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.159 (-0.37) 0.257 (0.30) <.001 (-1.67) <.001 (-1.03) <.001 (-1.03) <.001 (-1.21) Disgust 0.159 (-0.37) - 0.019 (0.63) <.001 (-1.10) 0.038 (-0.55) 0.027 (-0.59) 0.008 (-0.72) Fear 0.257 (0.30) 0.019 (0.63) - <.001 (-1.83) <.001 (-1.25) <.001 (-1.24) <.001 (-1.41) Happy <.001 (-1.67) <.001 (-1.10) <.001 (-1.83) - 0.017 (0.64) 0.086 (0.45) 0.129 (0.40) Neutral <.001 (-1.03) 0.038 (-0.55) <.001 (-1.25) 0.017 (0.64) - 0.727 (-0.09) 0.429 (-0.21) Sad <.001 (-1.03) 0.027 (-0.59) <.001 (-1.24) 0.086 (0.45) 0.727 (-0.09) - 0.718 (-0.09) Surprise <.001 (-1.21) 0.008 (-0.72) <.001 (-1.41) 0.129 (0.40) 0.429 (-0.21) 0.718 (-0.09) - (b) unmasked by humans

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.887 (0.04) 0.036 (-0.55) <.001 (-1.69) <.001 (2.93) <.001 (1.64) <.001 (-1.16) Disgust 0.887 (0.04) - 0.122 (-0.40) 0.003 (-0.82) <.001 (2.93) <.001 (1.61) 0.025 (-0.61) Fear 0.036 (-0.55) 0.122 (-0.40) - 0.062 (-0.51) <.001 (2.96) <.001 (1.72) 0.496 (-0.20) Happy <.001 (-1.69) 0.003 (-0.82) 0.062 (-0.51) - <.001 (2.99) <.001 (1.80) <.001 (1.00) Neutral <.001 (2.93) <.001 (2.93) <.001 (2.96) <.001 (2.99) - <.001 (-2.28) <.001 (-2.97) Sad <.001 (1.64) <.001 (1.61) <.001 (1.72) <.001 (1.80) <.001 (-2.28) - <.001 (-1.76) Surprise <.001 (-1.16) 0.025 (-0.61) 0.496 (-0.20) <.001 (1.00) <.001 (-2.97) <.001 (-1.76) - (c) unmasked by computer systems

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.049 (0.52) 0.157 (-0.37) <.001 (-1.32) <.001 (-1.50) 0.365 (0.24) 0.002 (-0.86) Disgust 0.049 (0.52) - 0.002 (-0.86) <.001 (-1.98) <.001 (-2.10) 0.334 (-0.25) <.001 (-1.54) Fear 0.157 (-0.37) 0.002 (-0.86) - 0.003 (-0.81) <.001 (-1.03) 0.029 (0.58) 0.171 (-0.36) Happy <.001 (-1.32) <.001 (-1.98) 0.003 (-0.81) - 0.245 (-0.31) <.001 (1.54) 0.029 (0.58) Neutral <.001 (-1.50) <.001 (-2.10) <.001 (-1.03) 0.245 (-0.31) - <.001 (1.70) 0.002 (0.84) Sad 0.365 (0.24) 0.334 (-0.25) 0.029 (0.58) <.001 (1.54) <.001 (1.70) - <.001 (-1.10) Surprise 0.002 (-0.86) <.001 (-1.54) 0.171 (-0.36) 0.029 (0.58) 0.002 (0.84) <.001 (-1.10) - (d) masked by humans

Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (5.84) <.001 (2.88) <.001 (6.79) <.001 (4.78) <.001 (5.53) <.001 (4.76) Disgust <.001 (5.84) - <.001 (-2.70) <.001 (1.18) 0.005 (-0.76) 0.082 (-0.46) <.001 (-1.81) Fear <.001 (2.88) <.001 (-2.70) - <.001 (3.57) <.001 (1.81) <.001 (2.37) <.001 (1.51) Happy <.001 (6.79) <.001 (1.18) <.001 (3.57) - <.001 (-1.63) <.001 (-1.74) <.001 (-3.52) Neutral <.001 (4.78) 0.005 (-0.76) <.001 (1.82) <.001 (-1.63) - 0.137 (0.40) 0.020 (-0.63) Sad <.001 (5.53) 0.082 (-0.46) <.001 (2.37) <.001 (-1.74) 0.137 (0.40) - <.001 (-1.31) Surprise <.001 (4.76) <.001 (-1.81) <.001 (1.51) <.001 (-3.52) 0.020 (-0.63) <.001 (-1.31) - (e) masked by computer systems

Table 4: Statistical test results presenting a) ANOVA obtained from comparing humans and computer systems based on their confidence of categorizing people’s emotions with different types of covering. b) Posthoc test (t-test) results showing the p-value (and Cohen’s d effect size) obtained from comparing unmasked images classification confidence by humans and c) computer systems. d) Posthoc test (t-test) results showing the p-value (and Cohen’s d effect size) obtained from comparing masked images classification confidence by humans and e) computer systems. Note that the α level in all the posthoc comparisons has been adjusted to 0.0014 using Bonferroni correction. Emotion classification of covered faces 29

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.036 (-0.56) 0.014 (-0.66) <.001 (-2.70) <.001 (-1.98) <.001 (-1.21) <.001 (-2.34) Disgust 0.036 (-0.56) - 0.838 (-0.05) <.001 (-1.76) <.001 (-1.21) 0.039 (-0.55) <.001 (-1.50) Fear 0.014 (-0.66) 0.838 (-0.05) - <.001 (-1.88) <.001 (-1.25) 0.044 (-0.53) <.001 (-1.58) Happy <.001 (-2.70) <.001 (-1.76) <.001 (-1.88) - 0.036 (0.56) <.001 (1.27) 0.370 (0.24) Neutral <.001 (-1.98) <.001 (-1.21) <.001 (-1.25) 0.036 (0.56) - 0.010 (0.69) 0.237 (-0.31) Sad <.001 (-1.21) 0.039 (-0.55) 0.044 (-0.53) <.001 (1.27) 0.010 (0.69) - <.001 (-1.00) Surprise <.001 (-2.34) <.001 (-1.50) <.001 (-1.58) 0.370 (0.24) 0.237 (-0.31) <.001 (-1.00) - (a) partial mask by humans

Anger Disgust Fear Happy Neutral Sad Surprise Anger - <.001 (8.02) <.001 (8.07) <.001 (-2.49) 0.613 (0.13) <.001 (9.91) <.001 (-2.93) Disgust <.001 (8.02) - 0.287 (0.28) <.001 (-6.62) <.001 (-3.49) 0.007 (0.73) <.001 (-6.75) Fear <.001 (8.07) 0.287 (0.28) - <.001 (-6.69) <.001 (-3.57) 0.265 (0.30) <.001 (-6.82) Happy <.001 (-2.49) <.001 (-6.62) <.001 (-6.69) - <.001 (1.92) <.001 (7.06) <.001 (-0.49) Neutral 0.613 (0.13) <.001 (-3.49) <.001 (-3.57) <.001 (1.92) - <.001 (3.77) <.001 (-2.31) Sad <.001 (9.91) 0.007 (0.73) 0.265 (0.30) <.001 (7.06) <.001 (3.77) - <.001 (-7.15) Surprise <.001 (-2.93) <.001 (-6.75) <.001 (-6.82) <.001 (-0.49) <.001 (-2.31) <.001 (-7.15) - (b) partial mask by computer systems

Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.394 (0.22) <.001 (0.88) <.001 (-1.85) <.001 (-0.97) 0.144 (0.38) <.001 (-1.51) Disgust 0.394 (0.22) - 0.021 (0.62) <.001 (-1.96) <.001 (-1.14) 0.619 (0.13) <.001 (-1.65) Fear <.001 (0.88) 0.021 (0.62) - <.001 (-2.76) <.001 (-1.83) 0.041 (-0.54) <.001 (-2.41) Happy <.001 (-1.85) <.001 (-1.96) <.001 (-2.76) - 0.005 (0.77) <.001 (2.41) 0.229 (0.32) Neutral <.001 (-0.97) <.001 (-1.14) <.001 (-1.83) 0.005 (0.77) - <.001 (1.41) 0.079 (-0.47) Sad 0.144 (0.38) 0.619 (0.13) 0.041 (-0.54) <.001 (2.41) <.001 (1.41) - <.001 (-2.02) Surprise <.001 (-1.51) <.001 (-1.65) <.001 (-2.41) 0.229 (0.32) 0.079 (-0.47) <.001 (-2.02) - (c) sunglasses by humans Anger Disgust Fear Happy Neutral Sad Surprise Anger - 0.548 (0.16) 0.017 (0.64) 0.771 (0.08) <.001 (1.42) <.001 (1.53) <.001 (1.43) Disgust 0.548 (0.16) - 0.202 (0.34) 0.810 (-0.06) <.001 (1.18) <.001 (1.04) <.001 (1.09) Fear 0.017 (0.64) 0.202 (0.34) - 0.127 (-0.41) <.001 (1.00) 0.004 (0.78) 0.002 (0.87) Happy 0.771 (0.08) 0.810 (-0.06) 0.127 (-0.41) - <.001 (1.23) <.001 (1.10) <.001 (1.15) Neutral <.001 (1.42) <.001 (1.18) <.001 (1.00) <.001 (1.23) - 0.080 (-0.47) 0.333 (-0.26) Sad <.001 (1.53) <.001 (1.04) 0.004 (0.78) <.001 (1.10) 0.080 (-0.47) - 0.394 (0.23) Surprise <.001 (1.43) <.001 (1.09) 0.002 (0.87) <.001 (1.15) 0.333 (-0.26) 0.394 (0.23) - (d) sunglasses by computer systems

Table 5: Statistical test results for posthoc tests: a) p-value (and Cohen’s d effect size) comparing partial mask images classification confidence by humans and b) computer sys- tems. c) p-value (and Cohen’s d effect size) for comparing sunglasses images classification confidence by humans and d) computer systems. Note that the α level has been adjusted to 0.0014 using Bonferroni correction.