Decoding facial expressions that produce valence ratings with human-like accuracy

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Arts in the Graduate School of The Ohio State University

By

Nathaniel Blake Haines, B.A.

Graduate Program in

The Ohio State University

2017

Master's Examination Committee:

Woo-Young Ahn,

Theodore Beauchaine, Advisor

Jennifer Cheavens

 Copyright by

Nathaniel Blake Haines

2017

Abstract

Facial expressions are fundamental to human interaction, including the conveyance of threat, cooperative intent, and internal emotional states. In research settings, facial expressions are typically coded manually by trained human coders; however, advances in computer vision and machine learning (CVML) allow for a more efficient and less labor- intensive alternative. Unfortunately, current CVML implementations achieve only moderate accuracy for rating positive and negative affect intensity; this limitation has limited the adoption of these models. Here, using over 6,000 video recordings from human subjects, we show that CVML models rate positive and negative emotion intensity with human-like accuracy. Additionally, we show that these same models identify theoretically meaningful patterns of facial movement that are strongly associated with human ratings at the individual-subject level. Our results suggest that CVML both provides an efficient method to automate valence intensity coding and rapidly identifies individual differences in facial expression recognition.

ii

Acknowledgments

I thank my committee and Jeffrey Cohn for guidance and comments made on previous versions of this manuscript.

iii

Vita

2011...... Jonathan Alder High School

2015...... B.A. Psychology, The Ohio State University to present ...... Graduate Course Assistant, Department of

Psychology, The Ohio State University

Publications

Ahn, W.-Y., Haines, N., & Zhang, L. (2017). Revealing neuro-computational

mechanisms of reinforcement learning and decision-making with the hBayesDM

package. Computational Psychiatry.

Rogers, A.H., Seager, I., Haines, N., Hahn, H., Aldao, A., Ahn, W.Y. (2017). The

Indirect Effect of Emotion Regulation on Minority Stress and Problematic

Substance Use in Lesbian, Gay, and Bisexual Individuals. Frontiers in

Psychology.

Fields of Study

Major Field: Psychology

iv

Table of Contents

Abstract ...... ii

Acknowledgments...... iii

Vita ...... iiv

List of Tables ...... vi

List of Figures ...... vii

Chapter 1: Introduction ...... 1

Chapter 2: Method ...... 5

Participants ...... 5

Emotion-Evoking Task ...... 5

Exclusion Criteria ...... 8

Manual Coding Procedure ...... 9

Automated Coding Procedure ...... 10

Machine Learning Models ...... 13

Model Fitting Procedures ...... 17

Chapter 3: Results ...... 23

Chapter 4: Discussion ...... 35

References ...... 39

v

List of Tables

Table 1. Facial Action Units Detected by FACET ...... 12

Table 2. Correlations Between Human- and Computer-generated Emotion Ratings ...... 25

vi

List of Figures

Figure 1. Task Flow for Automated and Manual Coding ...... 7

Figure 2. Machine Learning Pipeline...... 16

Figure 3. Positive and Negative Facial Expression Prediction Accuracy ...... 24

Figure 4. Variance of Emotion Intensity and Model Performance ...... 27

Figure 5. Model Performance Across Ethnicities ...... 29

Figure 6. Facial Actions Associated with Positive and Negative Emotion ...... 31

Figure 7. Facial Expressions Attended by Individual Coders...... 33

Figure 8. Permutation Tests on Number of Ratings Necessary for Inference ...... 34

vii

Chapter 1: Introduction

The ability to effectively communicate emotion is essential to maintain adaptive functioning in modern society. Of all the ways that we communicate emotion, facial expressions are one of the most common, with specific functions including the conveyance of threat (Reed, DeScioli, & Pinker, 2014), cooperative intent (Reed, Zeglen,

& Schmidt, 2012), and internal emotional states (e.g., Ekman, 1992). Not surprisingly, the ability to both produce and recognize facial expressions of emotion is of interest to researchers throughout the social and behavioral sciences.

To measure facial expressions in psychological studies, we often train human coders to manually code facial expressions, where training follows one of two protocols:

1) content coding or 2) anatomical coding. Content coding involves training coders to label the emotion category and/or intensity of facial expressions based on the emotional construct of interest (e.g., Southward & Cheavens, 2017), while anatomical coding protocols require coders to label facial expressions in terms of coordinated, observable facial actions. Content coding guides are typically developed by individual research sites to serve a specific function and therefore lack a standardized protocol. Although flexible for researchers, site-specific protocols make it difficult to validly compare results from across studies assessing similar emotional constructs. Conversely, the most widely used anatomical coding protocol is the Facial Action Coding System (FACS; Ekman, Friesen,

& Hager, 2002), which describes facial actions using approximately 30 distinct facial

Action Units (AUs; e.g., AU12 indicates pulling back the lip corners). FACS provides a

1 more standardized and replicable measurement system than content coding approaches, but the time commitment is much greater; training to become certified takes an average of 50 - 100 hours, and after training a single minute of video can take expert coders upwards of one hour to rate reliably (Bartlett, Hager, Ekman, & Sejnowski, 1999).

To facilitate faster, more replicable facial expression coding, computer scientists have developed computer vision and machine learning (CVML) models which automatically decode the content of facial expressions. In fact, applications of CVML have allowed researchers to automatically identify FACS-based AUs (e.g., Cohn, 2010), pain severity (e.g., Sikka et al., 2015), state (e.g., Dibeklioğlu, Hammal, Yang,

& Cohn, 2015), and the presence of basic emotion categories from facial expressions

(e.g., Beszédeš & Culverhouse, 2007; Kotsia & Pitas, 2007). Notably, once CVML models are validated they can be deployed at larger scales and used to automatically code large facial expression datasets in a matter of hours. Additionally, CVML models can be shared across research sites, unlike human coders. It is clear that CVML holds strong promise for emotion research.

Despite its potential, CVML has only been used in a handful of psychology studies. We see two potential barriers impeding the use of CVML in psychology research. First, almost all work on automated analysis of facial expressions has focused on either discrete AUs or basic, prototypical . Dimensional models of emotion

(e.g., Lang, Bradley, & Cuthbert, 1998; Russell, 1980; Watson & Tellegen, 1985) are of major theoretical interest to psychologists, yet they are largely ignored by CVML approaches. For example, CVML models developed to detect basic emotion categories

2

(e.g., happiness, sadness, anger, etc.) from facial expressions show impressive classification accuracy of up to 99% when applied to new data (Kotsia & Pitas, 2007), but models of continuous valence intensity show moderate correspondence to human- coded positive (r = 0.58) and negative (r = 0.23) affect (Bailenson et al., 2008;

Soleymani, Asghari-Esfeden, & Fu, 2015). It is imperative that CVML models can code basic emotions as well as valence intensity with near human-like accuracy if we expect them to inform psychological theories and to be widely used in psychological research.

Second, and related to the first point, we are unsure how AUs map on to perceived valence intensity. Because FACS was originally developed and used to better understand facial expressions related to basic emotions (see, Reisenzein, Studtmann, & Horstmann,

2013), there is a lack of research on AUs associated with the valence feature of dimensional theories of emotion. In fact, in our review of the literature, we found no such studies linking specific AUs to the of valence intensity. Since FACS-based

AU detection comprises such a large area of automatic facial expression detection research, the lack of empirical data linking AUs to dimensional theories of emotion may explain why CVML research is still dominated by basic emotion detection. If we are able to identify strong associations between AUs and perceived valence intensity, the large body of empirical work on automatic detection of AUs may be easily translated and applied to dimensional theories of emotion. Additionally, if AU identification can be extended to individual human coders, interpretable CVML models may be used to determine which AUs are responsible for individual differences in emotion recognition.

With interest in individual differences in emotion recognition becoming more popular

3 across clinical (e.g., Rubinow & Post, 1992), personality (e.g., Kahler et al., 2012), and developmental psychology research areas (e.g., Isaacowitz et al., 2007), CVML could potentially be used to gain a more mechanistic understanding of how individuals differentially evaluate facial actions.

Our aims for the current study were two-fold. We aim to demonstrate that: 1) the intensity of positive and negative facial expressions of emotion can be coded automatically and with human-like accuracy by applying CVML to multivariate patterns of facial AUs, and 2) CVML models can identify specific, theoretically meaningful multivariate patterns of facial movement that are strongly associated with FACS-naïve human coders’ perceptual ratings of emotion valence intensity. Following our second aim, we expand our findings by showing that AU identification can be extended to individual-level human coders and that these results are stable after as few as 60 emotion rating trials. Importantly, the data used to train and validate our CVML models were collected from a validated psychology task and contain over 6,000 video-recorded, evoked facial expressions from 125 human subjects who were unaware of the study goals, thus maximizing the generalizability of our findings. Our findings shed light on the mechanisms of emotion recognition from facial expressions and point the way to real- world applications of large-scale emotional facial expression coding.

4

Chapter 2: Method

Participants

Video recordings and human coder data were collected as part of a larger study

(Southward & Cheavens, 2017). A total of 151 participants gave informed consent and provided data in the original study. Twenty-six participants were excluded in the current study because they did not meet inclusion criteria (see below). A total of 4,632 (4,649 for single segment analysis; see Method Machine Learning Models for details on recording segmentation) recordings from 125 participants (84 females, aged 18 – 35) were used in all reported analyses. The self-reported ethnicities of participants were as follows:

Caucasian (n = 96), East Asian (n = 14), African-American (n = 5), Latino (n = 3), South

Asian (n = 3), and unspecified (n = 4).

Emotion-Evoking Task

We used a validated emotion-evoking task (Fig. 1) to elicit facial expressions of emotion (e.g., Southward & Cheavens, 2017). Participants were asked to view a total of

42 positive and negative images selected from the International Affective Picture System

(IAPS) to balance valence and arousal, where selections were made based on previously reported college-student norms (Lang, Bradley, & Cuthbert, 1995). Images were presented in 6 blocks of 7 trials each, where each block consisted of all positive or all negative images. For each block, participants were asked to either Enhance, React

Normally, or Suppress their naturally evoked emotional expressions to the images. Block order was randomized across participants. Instructions were given so that each valence

5 was paired once with each condition. These instructions effectively increased the variability in facial expressions within each subject; information on the condition was not used to construct/inform CVML models. All images were presented for 10 seconds, with

4 seconds between each image presentation. Participants’ reactions to the images were video-recorded with a computer webcam (Logitech HD C270).

6

7 trials per block Positive PoPosistiitvivee Automated Facial ImagePositive Action Unit Decoding ImageImagePoPosistiitvivee 10 seconds ImagePositive 10 seconds Image 10 seconds Image 4s ITI 10 secondsImage 10 seconds Negative 10 seconds Negative Manual Emotion 1100 seeccoondndss ImNeNeaggaeatitvivee ImImNeNeaaggegaeatitvivee Intensity Coding 10 seconds ImNeaggeative 10 seconds Image 10 seconds Image 10 secondsImage 10 seconds 10 seconds 10 seconds Negative Image 10 seconds

Figure 1. Task Flow for Automated and Manual Coding

Notes. Participants (N=125) viewed a total of 42 images each, broken down into 6 blocks of 7 trials. Pictures were presented for 10 seconds, with a 4 second inter-trial-interval (ITI). Each block of images consisted of either positive or negative image content. In each of the 3 blocks containing positive and negative image content, participants were asked to either Enhance, React Normally, or Suppress their emotional expressions, so that each valence type (i.e., positive and negative) was paired once with each task instruction (i.e., Enhance, React Normally, and Suppress). All images were selected from the International Affective Picture System (Lang et al., 1995). Participants’ reactions to the images were video recorded, and their facial expressions were subsequently rated for positive and negative emotion intensity by a team of trained coders. The same recordings were then analyzed by FACET, a computer vision tool which automatically identifies facial Action Units (AUs).

7

Exclusion Criteria

We originally collected 6,342 recordings (151 participants × 42 clips). Due to experimenter error, one participant’s clips were not recorded correctly, and another 7 participants were only shown 41 clips, resulting in 6,293 usable clips. Of these, 3 clips were corrupt and could not be viewed, thus 6,290 clips were available and used in the current study. To create models that were generalizable to new samples and representative of human coder ratings, all usable recordings underwent a rigorous manual quality check. We used the following rating system to code for recording quality (the number of clips meeting criteria are in parentheses):

• 0 = Participant’s face was off-screen or occluded for 4 or more seconds, or

participant covered or moved face off-screen in response to emotional stimuli.

(558)

• 1 = Participant’s face was off-screen or occluded for 2 to 4 seconds. (123)

• 2 = Participant’s face was off-screen or occluded for 1 to 2 seconds. (771)

• 3 = Participant’s face was off-screen or occluded for less than 1 second. (4,838)

Quality ratings were coded by a trained research assistant. Recordings with a quality rating of 3 were used for all reported analyses. All recordings were converted from .asf to

.wmv video format to be analyzed by FACET (see Method Automated Coding

Procedure). Due to encoding problems that impeded proper reformatting, an additional

30 clips were excluded. When computing the AUC of each AU, any segments in which the participant’s face was detected for 10% or less of the recording time were excluded

(e.g., Sikka et al., 2015). Because we were interested in within-subject model

8 performance, all reported analyses included only participants who had at least 21 (50%) usable recordings. There were slightly more recordings available for the 1-segment (N =

4,649) analyses than the 3-segment (N = 4,632) due to the within-subject inclusion criteria (e.g., if a face was occluded for two seconds or more in one of the 3 segments, the entire case was excluded for the 3-segment analyses but not the 1-second analyses).

Manual Coding Procedure

A team of three trained human coders, unaware of participants’ task instructions, viewed and rated each 10 second recording for both positive and negative emotion intensity. The presentation of recordings was randomized for each coder. Ratings were collected on a 7-point Likert scale, from 1 (no emotion) to 7 (extreme emotion). Coders completed an initial training phase during which they rated recordings of pre-selected non-study cases and discussed the facial features that influenced their decisions. The following guide was developed during the initial training phase to aid coders in maintaining high agreement:

1 – Flat mouth, bored eyes throughout most of the clip 2 – Slight smirk/frown + no eye change held for half clip, or eyes widening/eyebrows change + no mouth change held for half clip 3 – Slight smirk/frown + no eye change held for whole clip, or eyes widening/eyebrows change + no mouth change held for whole clip (a.k.a. Mona Lisa smile) 4 – Smirk/frown with some eye change; may be ambiguous but involves whole face; held for half clip 5 – Clear smile/frown with some eye change; mouth may be open for positive emotion 6 – Clear smile/frown with eye change; mouth may be open for negative emotion

9

7 – Obvious and definite emotion; held for most of the clip. If you would call them bubbly, excited, forlorn, or terrified

The goal of this training (and coding guide) was to ensure that all coders could reliably agree on emotion intensity ratings. In addition, coders participated in once-monthly meetings throughout the coding process to ensure high reliability and reduce coder drift.

Agreement between coders across all usable recordings (6,290 recordings) was high, with intraclass correlation coefficients (ICC(3); McGraw & Wong, 1996) of 0.88 and 0.94 for positive and negative ratings, respectively. The ICC(3) measure reported above indicates the absolute agreement of the average human-coder rating within each condition

(Enhance, React Normally, or Suppress) for each of the 125 participants.

Automated Coding Procedure

Each 10 s facial expression recording was analyzed by FACET (Emotient

Analytics, San Diego, CA), a commercial successor to the open-source Computer

Expression Recognition Toolbox (Littlewort, Whitehill, Wu, & Fasel, 2011). FACET is a computer vision tool that detects Action Units (AUs) which reflect coordinated facial muscle movements as described by the Facial Action Coding System (FACS; Ekman et al., 2002). FACET computes evidence scores for each of 20 AUs at a rate of 30 Hz (i.e.

30 times per second), resulting in evidence time-series for each AU (see Table 1 for a list of AUs and their meanings). Each point in the evidence time-series is a continuous number, which is a direct output from the FACET AU classifier, ranging from about -16 to 16, where more positive and more negative numbers indicate increased and decreased probability for the presence of a given AU, respectively. Each evidence time-series was 10 converted to a point estimate by taking the area under the curve (AUC) of the given AU time-series and dividing the AUC by the total length of time that a face was detected throughout the clip. Dividing by the face-time creates a normalized measure that does not give biased weights to clips of varying quality (e.g., clips where the participant’s face is occasionally not detected). All AUC values were calculated using the auc function in the flux R package (Jurasinski, Koebsch, Guenther, & Beetz, 2014). All AU time-series point estimates were used together as predictor (independent) variables to train the ML models to predict human valence intensity ratings. It took FACET less than 3 days to extract AU time-series data from all recordings (running on a standard 8-core desktop computer).

11

Action Unit Explanation Example 1 Inner Brow Raiser

2 Outer Brow Raiser

4 Brow Lowerer

5 Upper Lid Raiser

6 Cheek Raiser

7 Lid Tightener

9 Nose Wrinkler

10 Upper Lip Raiser

12 Lip Corner Puller

14 Dimpler

15 Lip Corner Depressor

17 Chin Raiser

18 Lip Puckerer

20 Lip stretcher

23 Lip Tightener

24 Lip Pressor

25 Lips part

26 Jaw Drop

28 Lip Suck

43 Eyes Closed

Table 1. Facial Action Units Detected by FACET

Notes. Pictures and descriptions of all Action Units used in the current study. Images were adapted from https://www.cs.cmu.edu/~face/facs.htm.

12

Machine Learning Models

To account for the temporal dynamics of facial expressions (Ambadar, Schooler,

& Cohn, 2005) and maximize ML performance in predicting humans’ emotion ratings, we split each AU time-series into three equally-sized 3.33 s segments (Fig. 2). By separating the time-series into 3 segments, the model is able to account for primacy and/or recency effects in how the coders rated participants’ emotions. For example, facial expressions at the very beginning of the video may have a stronger effect on coders’ valence judgements than those in the middle, etc. We used the AUC method described above to summarize each segment into a point estimate, creating a total of 60 (20 AUs ×

3 segments) variables to use as predictors to train ML models. AU scores created this way represent the evidence for and duration of a specific facial expression within each given segment of the recording.

Next, we used ML models to identify multivariate AU patterns that could predict human ratings of emotion intensity (Fig. 2). We tested two models of different complexity to determine a model that optimized the trade-off between interpretability and prediction accuracy. Here, interpretability refers to how easily the model parameters can be meaningfully interpreted, and prediction accuracy refers to how similar the model- and human-generated emotion intensity ratings are. Specifically, we tested two different ML models in predicting human-coded intensity ratings of positive and negative facial expressions of emotion from AU time-series data:

13

1) Least Absolute Shrinkage and Selection Operator (LASSO). LASSO is a

penalized regression model which penalizes the sum of the absolute values for

beta weights of all predictors (Tibshirani, 1996); this constraint effectively

shrinks all beta weights toward zero by a constant amount. Unimportant

coefficients (i.e., predictors that do not account for much variance in the

dependent variable) are shrunk to zero, and non-zero coefficients are used for

inference. This procedure reduces the chance of overfitting and simplifies

interpretation of a model with many independent variables, as fewer parameters

are used in the final model.

2) Random Forest (RF). The RF model is an ensemble model (i.e., a model that is

composed of many related but different models) that can fit non-linear trends

(Hastie, Tibshirani, & Friedman, 2009). RFs create many decision trees which are

fitted to different bootstrapped (i.e., sampled with replacement) subsets of the

data, where the resulting model uses the average prediction made by all decision

trees to generate predictions. Each individual decision tree provides a highly

variable (i.e., small changes within the training data have large effects on

predictions), yet unbiased estimate of the function of interest. Thus, averaging

over many trees reduces variance without biasing predictions (Hastie et al., 2009).

In the current study, importance scores for each predictor were estimated using

the increase in node impurity, a measure of the change in residual squared error

(i.e., increases in prediction accuracy) that is attributable to the predictor across

14

all trees (Hastie et al., 2009). RF importance scores represent the magnitude of the

effect that the predictor has on overall prediction accuracy.

Of the LASSO and RF models, the LASSO is easiest to interpret, but the RF can fit more complex data. The LASSO model coefficients can be interpreted like regression beta weights, which allows us to make inference on the effects of various AUs on the coders’ of valence intensity. Conversely, the tree structure of the RF can implicitly account for interactions between AUs, but the resulting model is more difficult to interpret.

15

AU Evidence Time-Series Segmented AU Evidence

e Partitioned

c

n

e

d

i

v E

Time (ms) 0.82 1.08 1.21

For each AU Time-normalized Area Under = 60 predictors the Curve (AUC) score is computed for each segment

AU summary scores are entered as predictors into machine learning models. Human coder ratings are entered as dependent variables.

All data (100%)

Divided into Training/Test sets

Training set Test set (66%) (34%)

Out-of-sample Fit 2 machine learning models: predictions 1) Least absolute shrinkage and selection operator (LASSO) 2) Random Forest (RF)

Figure 2. Machine Learning Pipeline

Notes. Each AU time-series was partitioned into thirds to account for the temporal dynamics of facial expressions. For each of the 3 segments, we computed the normalized (i.e., divided by segment length) Area Under the Curve (AUC), which captures the strength of a given AU for that time-segment. All AUC scores (60 total) were used as independent variables in machine learning models. To compare models, we separated the data into training (3,049 recordings) and test (1,583 recordings) sets. We fit the models to the training-sets, and made out-of-sample predictions on the test-sets. Model performance was assessed by comparing the Pearson’s and intraclass correlations between computer- and human-generated ratings in the test sets. 16

Model Fitting Procedures

After excluding poor quality recordings, we split the data into training and test sets, fit each model and optimized tuning parameters on the training set using methods specific to each model, and then made predictions on the unseen test set to assess performance.

Outcome Variables. Given the high agreement between human coders across all recordings (see Method Manual Coding Procedure), their mean ratings for each clip were used as outcome (dependent) variables to train the ML models.

Model Performance. Model performance refers to how similar the model- and human- generated valence intensity rating are. To assess model performance, the 3-segment AU data were separated into a training set (66% of the data; 3,049 recordings) and a test set

(34% of the data; 1,583 recordings). The data were separated randomly with respect to participants so that the training and test data contained 66% and 34% of each participant’s clips, respectively; this separation ensures that training is conducted with all participants, thus creating a more generalizable final model. Separate models were fit to positive and negative human ratings. To achieve robust estimates of out-of-sample prediction accuracy, we permuted the selection of training/test sets 1,000 times (Ahn &

Vassileva, 2016; Ahn, Ramesh, Moeller, & Vassileva, 2016), each time following the fitting procedures outlined above. We used Pearson’s and ICC(1) coefficients to check model performance on training- and test-sets. The Pearson’s correlation measures the

17 amount of variance in human ratings captured by the model, whereas the ICC(1) measures the absolute agreement between human- and model-predicted ratings.

Therefore, high Pearson’s and ICC(1) coefficients indicate that the model is capturing a large amount of variance in and generating ratings using a similar scale as human coders, respectively. We used ICC(1), as opposed to other ICC methods (see, McGraw & Wong,

1996), because we were interested in absolute agreement across all clips, regardless of condition/participant. One-way models were used to compute ICCs in all cases. An ICC between 0.75 and 1.00 is considered “excellent” and an ICC between 0.60 and 0.74 is considered “good” (Cicchetti, 1994).

Model performance across participants of different ethnicities. To test if our largely

Caucasian (n = 93) sample biased predictions made within other ethnic groups (n = 27), we conducted Bayesian independent samples t-tests on within-subject model performance

(i.e. correlation between human-rated and computer-predicted emotion intensity ratings) between Caucasians and other ethnicities. Bayesian analyses were carried out using JASP

(JASP, 2016; Marsman & Wagenmakers, 2016), an open-source toolkit for applying

Bayesian statistics. We used Bayesian methods because they allowed us to interpret evidence in favor of a null effect (i.e., no group differences), which cannot be achieved using traditional frequentist methods (Rouder, Speckman, Sun, Morey, & Iverson, 2009).

We used Bayes Factors to interpret model evidence. Because Bayes Factors are computed using differences between prior and posterior distributions, the choice of prior is important to interpret the end results. We assumed that the differences between groups

18 were distributed along a Cauchy distribution centered at zero with a width of 0.707; the width represents the interquartile range, so 0.707 translates to a 50% confidence that the true effect size (i.e. the true difference between Caucasians and other ethnicities) lies between -0.707 and 0.707. This particular parameterization of the Cauchy width assigns more prior probability to a zero-effect than a uniform distribution does, thus requiring more posterior evidence in favor of the null hypothesis to generate large Bayes Factors

(Wagenmakers, Morey, & Lee, 2016).

Model Inference. Model inferences refers to the interpretation of model parameters after the model is fit to the data (analogous to interpreting beta weights in a regression). To make inference on human coders, models were fit to the single segment AU data without a separate test set. Thus, models were fit to a training set containing 100% of the data

(4,649 recordings). We used this method to identify features that were robust across all samples (see, Ahn et al., 2016; Ahn & Vassileva, 2016). Feature importance metrics (beta weights for LASSO and Increase in Node Impurity for RF) were extracted from the models to make inferences on how human coders generated their facial expression ratings. The predictor variables in our models reflect evidence for specific AUs as detected by FACET, and the model parameters estimated for each predictor (AU) can therefore be interpreted as contributions of specific facial expressions to human- generated ratings. In this way, the strength of parameter estimates/importance weights reflect the facial expressions to which human coders were attending when generating their ratings. ICCs were used to compare the agreement between importance profiles for

19 the full (averaged coders) and individual-coder models. Because the importance scores from the RF reflect the importance of given AUs across all decision trees within the RF, we treated RF importance scores as “averaged” values when computing the ICCs. One- way models were used in all cases. To determine the number of trials necessary to identify a stable AU importance profile at the individual coder level, we used ICCs to compare the profiles of each model fit from each individual coders’ full data (i.e. 4,649 recordings per coder) to profiles generated with subsets of the full data (i.e. 30, 50, 100, etc. recordings per coder). Using this method, ICCs of 0 and 1 indicate that the full feature importance profile was not recovered at all or recovered entirely, respectively, using a subset of all possible recordings.

Determining the minimum number of ratings necessary for individual-level inference. We performed permutation tests to find the minimum number of ratings necessary to accurately infer which AUs the coders attended while generating emotion ratings. For each of the 3 coders, we performed the following steps: 1) randomly sample n recordings rated by the coder i, 2) fit the RF model to the subset of n recordings/ratings according to the model fitting procedures outlined in the main text, 3) compute the ICC(2) of the extracted RF feature importances (i.e. Increase in Node Impurity) between the permuted model and the model fit to all recordings/ratings (Fig. 5) from coder i, and 4) iterate steps

1-3 twenty times for each value of n (note that different subsets of n recordings/ratings were selected for each of these twenty iterations). We varied n ∈ {30, 40, 50, 60, 70, 80,

20

90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000,

2500, 3000}.

Parameter tuning. The LASSO model contains a single tuning parameter 휆 (휆 ≥ 0), which is the effective degrees of freedom for the model–as 휆 approaches zero, the model approaches a non-penalized multiple regression model. We tuned 휆 using 10-fold cross- validation (Kohavi, 1995). General k-fold cross-validation follows these steps: 1) the training data is split into k different folds, 2) k–1 folds are used to fit the model, 3) predictions are made on the left-out fold, 4) an out-of-sample prediction error is calculated on the left-out fold, and 5) steps 2 through 4 are repeated until each of k folds has been left out. The mean squared error over all k folds is minimized by a grid search over various values for 휆 (휆 ≥ 0). To estimate beta weights, the above cross-validation steps were iterated 1,000 times. Values of survived (i.e., not shrunk to zero) beta weights were recorded for each iteration. Confidence intervals for each beta weight were calculated based on the variation of the beta weight estimate across cross-validation iterations. The above steps are described in extensive detail in a previous study (Ahn et al., 2014). We fit the LASSO model using the easyml R package (Hendricks & Ahn,

2017), which provides a wrapper function for the glmnet R package (Friedman, Hastie, &

Tibshirani, 2010).

The RF model contains 2 tuning parameters, namely: 1) ntrees–the number of decision trees used in the forest, and 2) mtry–the number of predictors to sample from at each decision node (i.e., “split”) in a tree. A grid search over ntrees ∈ {100, 200,

21

300, … , 1000} showed that out-of-bag prediction accuracy converged by 500 trees for both positive and negative datasets. A grid search over mtry ∈ {1, 2, 3, … , 20} revealed negligible differences in out-of-bag prediction accuracy for values ranging from 5 to 20.

Because RFs do not over-fit the data with an increasing number of trees (Hastie,

Tibshirani, & Friedman, 2009), we set ntrees = 500 for models presented in all reported analyses to ensure convergence. Because initial grid searches over mtry failed to improve the model, we set mtry heuristically (Hastie et al., 2009) as mtry = p/3, where p represents the number of predictors (i.e., AU scores) in an n  p matrix (n = number of cases) used to train the model. The RF model was fit using the easyml R package (Hendricks & Ahn,

2017), which provides a wrapper function for the randomForest R package (Liaw &

Wiener, 2002).

22

Chapter 3: Results

The RF model showed higher correlations between human- and computer- generated emotion intensity ratings within both the training and test sets compared to the

LASSO model (Table 2), so further results are reported only for the RF model. Fig. 3a shows the performance of the RF model for one representative sample of the 1,000 permutations performed to estimate out-of-sample accuracy, and Fig. 3b shows performance across all 1,000 permutations of training/test sets.

23

a Positive Negative b Positive Negative

%=0.89 %=0.76

g

g

n

n

i

i

n

n

s

i

i

a

a

g

r

r

y

n

T

T

i

c

t n

a r = 0.89 r = 0.77

e

R

u

l

q

a

e

u r

t %=0.89 %=0.77

F

c

t

t

A

s

s

e

e

T T

r = 0.90 r = 0.75

Predicted Ratings Across Subject Correlation (RF)

c Positive Negative

! " # = 0.88 ! " # = 0.75

g

n

i

n

i

a

r

y

T

c

n

e

u

q e

r ! " # = 0.91 ! " # = 0.74

F

t

s

e T

Within Subject Correlation Figure 3. Positive and Negative Facial Expression Prediction Accuracy

Notes. (a) Performance of the random forest (RF) model for positive and negative human coder ratings in the training (3,049 recordings) and test (1,583 recordings) sets, collapsed across subjects. Pearson’s correlations for computer- versus human-generated ratings are superimposed on their respective graphs; correlations are all significant (ps < 0.001). (b) Distributions of across-subject Pearson’s correlations between model-predicted and actual human coder ratings for the RF model. We created 1,000 different splits of the training and test sets, fit a separate model to each training set, and then made predictions on each respective test set (panel (a) shows the results of a single representative permutation). We stored the Pearson’s correlations between predicted and actual ratings for each iteration. The distributions therefore represent uncertainty in prediction accuracy. The means of the distributions (superimposed on respective graphs) are represented by dashed red lines. (c) Distributions of within-subjects Pearson’s correlation coefficients for positive and negative human coder ratings in the training (all 125 subjects) and test (122 subjects for the single permutation shown in panel (a); correlations could not be computed for 3 subjects who had 0 variance in human ratings) sets. The red dashed lines represent the median (i.e., 50th percentile) within-subject Pearson’s correlations for the given distribution.

24

Model Pearson’s Correlation Intraclass Correlation Training Test Training Test (+) (–) (+) (–) (+) (–) (+) (–) LASSO .84 .65 .84 .65 .83 .59 .83 .58 RF .89 .77 .90 .75 .88 .72 .89 .73 Table 2. Correlations between human- and computer-generated emotion ratings

Notes. (+) = Positive ratings, (–) = Negative Ratings. All ps < .001.

25

We also checked within-subject model performance by computing correlations between human and model predictions for each subject separately. We found the same trend (Fig. 3c); the RF consistently outperformed the LASSO across training and test sets. While the RF model performed excellently for most participants in the positive

(median r = 0.91 and ICC(1) = 0.82) and negative (median r = 0.74 and ICC(1) = 0.50) emotion test-sets, five participants within the positive and seven participants within the negative emotion test-set yielded negative correlations between human- and computer- generated emotion ratings (Fig. 3c). Further analyses of within-subject model performance across all participants revealed a significant positive association between within-subject (log) standard deviation (SD) in human- and computer-generated ratings and within-subject model performance (i.e., predicted versus actual rating correlations;

Fig. 4). This relation was found in positive and negative training and test sets (all rs ≥

0.52, ps < 0.001), suggesting that the RF model performs better as subjects express a wider range of emotional intensity.

26

Human Ratings

a n

o

i

t

a l

e r = 0.52 r = 0.72 r = 0.52 r = 0.56 r

r (n = 122) (n = 125) (n = 122) (n = 125)

o

C

t c

e Computer Ratings

j

b

u

S

-

n

i

h

t i b W

r = 0.55 r = 0.73 r = 0.61 r = 0.62 (n = 122) (n = 125) (n = 122) (n = 125)

Log of Within-Subject Rating SD

Figure 4. Variance of Emotion Intensity and Model Performance

Notes. (a) Pearson’s correlations between within-subject model performance (Pearson’s r; see Fig. 3c) and the logarithm of within-subject human rating standard deviation (SD). Human-rated SDs were computed as the logarithm of the SD of human coders’ ratings across a given participants’ recordings. Cases with zero variance in human ratings (i.e., all ratings were “1”) are excluded from this analysis. Correlations and the number of subjects included in each comparison are superimposed on their respective graphs. All correlations are significant (ps < 0.001). (b) Pearson’s Correlations between within- subject model performance (see Fig. 3c) and the logarithm of within-subject computer rating standard deviation. Computer-rated SDs were computed in the same way as human-rated SDs, but the model estimates were used in place of the true human ratings. All correlations are significant (ps < 0.001).

27

Bayesian t-tests comparing model performance between participants identifying as Caucasian (n = 93) versus other ethnicities (n = 27) revealed that our largely Caucasian sample likely did not bias predictions made within other ethnicities (Fig. 5). We found moderate evidence in favor of the null hypothesis in both training (Bayes Factors of 4.3 and 4.4 for positive and negative ratings, respectively) and test sets (Bayes Factors of 3.5 and 4.3 for positive and negative ratings, respectively). The Bayes Factors suggest that the differences in model performance we found between Caucasians and other ethnicities are about 3.5-4.4 times more likely under a model assuming no group differences than one assuming differences. Taken together, these results suggest that our largely

Caucasian sample did not substantially bias computer-rated emotion intensities made within different ethnic groups.

28

Positive Negative

g

n i

n

i a

r T

t

s e

T

Figure 5. Model Performance Across Ethnicities

Notes. Graphical depictions of the Bayesian independent t-tests conducted to compare within-subject model (RF, 3-segment) performance between Caucasians (n = 93) and other ethnicities (n = 27). Prior (dotted lines) and posterior (solid lines) distributions are shown for positive and negative ratings and within both training and test sets. Bayes Factors are superimposed on each comparison, along with a pie chart representing the relative evidence of the alternative (BF10; data | H1) and null (BF01; data | H0) hypotheses. The Bayes Factor is computed as the distance between the probability density of the prior and posterior distributions at the point of comparison (i.e., 0 in our comparison). Results show moderate evidence in favor of the null hypothesis (i.e., no difference between groups), indicating that within-subject model performance was not substantially biased by our largely Caucasian sample.

29

To identify which facial expressions human coders used to generate positive and negative emotion ratings, we examined the importance of all AUs in predicting human emotion ratings. Importance and beta weights extracted from the RF and LASSO models, respectively, were largely in agreement (Fig. 6). Note that importance weights for the RF do not indicate directional effects, but instead reflect the relative importance of a given

AU in predicting positive/negative emotion intensity. Both the RF and LASSO identified

AUs 12 (Lip Corner Puller), 14 (Dimpler), and 18 (Lip Puckerer) as important predictors of positive emotion ratings, and AUs 4 (Brow Lowerer) and 10 (Upper Lip Raiser) as important predictors of negative emotion ratings. Additionally, the RF model identified

AU12 (Lip Corner Pull), AU6 (Cheek Raiser), and AU25 (Lips Part) as 3 of the 4 most important AUs for predicting positive emotion. The validity of the RF model is further supported by the fact that it identified both AU6 and AU12 as strong predictors of positive ratings, which was not the case with the LASSO. However, the LASSO revealed directional effects which are not detectable with the RF. AU 14 (Dimpler) was identified as a negative predictor of positive intensity ratings. The importance of AUs for positive and negative emotion ratings were largely independent. In fact, when the ICC(2) is computed by treating positive and negative importance weights for each AU as averaged ratings from two “coders”, the ICC(2) is negative and non-significant (ICC(2) = -0.11, p

= 0.58), which would only be expected if different facial expressions were important for the coders to rate positive versus negative valence.

30

a

Positive Negative b Positive Negative

t

t

i

i

n

n

U

U

n

n

o

o

i

i

t

t

c

c

A A

Increase in Node Impurity Mean Beta Weight

Figure 6. Facial Actions Associated with Positive and Negative Emotion

Notes. (a) Feature importances extracted from the RF model for positive and negative human coder ratings. Increase in node impurity is a measure of the change in residual squared error (i.e., increase in prediction accuracy) that is attributable to the predictor across all trees in the RF. While it does not capture directional effects, the increase in node impurity can capture interactions between predictors, which can affect a predictor’s effect on prediction accuracy. (b) Feature importances extracted from the LASSO model for positive and negative human coder ratings. The beta weights represent individual contributions of the respective AU to positive and negative ratings. Red and blue beta weights indicate features which are positively and negatively predictive of the respective emotion. Visual depictions of the 5 strongest AUs for predicting positive and negative ratings are shown on the RF and LASSO graphs (see Table 1 for a complete list of AUs).

31

To reveal potential individual differences among human coders, we also fit the RF model to each human coder’s ratings separately (Fig. 7). Although some differences were found between coders, they showed similarly-ordered importance profiles, indicating that they attended to similar AUs while generating emotion ratings. Agreement between all three individual coders’ importance profiles supported this claim, where ICC(3)s were high for both positive (ICC(3) = 0.95) and negative (ICC(3) = 0.93) importance profiles.

Of note, importance scores for positive ratings across all coders were largely clustered around the strongest AUs (i.e., AUs 6, 12, and 18), and scores quickly dropped off in strength past these AUs. In contrast, importance scores for negative ratings across coders were spread out across all AUs, suggesting that coders were attending to a wider variety of AUs while generating negative, in comparison to positive, ratings. Permutation tests on the number of ratings necessary for individual-level inference (Fig. 8) revealed that for positive ratings, ICC(2)s for all 3 coders reached 0.75 (regarded as “excellent” agreement; see Cicchetti, 1994) after 60 ratings. For negative ratings, ICC(2)s for all 3 coders reached 0.75 after 200 ratings. These results suggest that the facial expressions attended by coders as they rate valence intensity may be reliably estimated by collecting

200 ratings.

32

Coder 1 Coder 2 Coder 3

e

v

i

t

i

s

o

P

t

i

n

U

n

o

i

t

c

A

e

v

i

t

a

g

e N

Increase in Node Impurity

Figure 7. Facial Expressions Attended by Individual Coders

Notes. Feature importances extracted from the RF model fit to individual coders using the single segment data (4,649 recordings). Coders all show similarly ordered importance profiles, suggesting that they attended to similar facial expressions while generating emotion ratings. Note that positive importance estimates are condensed into few predictors (i.e., AUs 6, 12, and 18), while negative importance measures are more spread out throughout all predictors. Also, Coder 3 shows higher importance estimates for most AUs compared to Coders 1 and 2, suggesting that they coded facial expressions more consistently (with respect to the detected AUs) than the others. Agreement between all three individual coders’ importance profiles was high, with ICC(3)s of 0.95 and 0.93 for positive and negative ratings, respectively.

33

Figure 8. Permutation Tests on Number of Ratings Necessary for Inference

Notes. Grid searches over the number of recordings/ratings necessary to achieve reliable estimates of AU importances for each valence-coder pair (coders are labelled 1, 2, and 3 and appear in the same order in Fig. 7). Reliability is indexed by the ICC(2) between AU importance profiles (i.e. Increase in Node Impurity) extracted from the model fit to all the recordings that coders rated versus the model fit to subsets of recordings that they rated. The RF model was fit to each sample of size n along the x-axis, AU importance profiles were extracted from the model, and ICC(2)s were then calculated between the given sample and full-data AU importance profile scores. We iterated this procedure 20 times within each different sample size to estimate the variation in estimates across recordings. Shading reflects the 2 standard errors from the mean ICC within each sample across all 20 iterations. The red-dashed line indicates an ICC(2) of 0.75, which is considered “excellent”. For positive ratings, the ICC(2) reached 0.75 after only 60 recordings/ratings for each coder. For negative ratings, coders 1 and 3 reached an ICC(2) of 0.75 by 100 recordings/ratings, whereas coder 3 reached an ICC(2) 0f 0.75 by 200 recordings/ratings.

34

Chapter 4: Discussion

In the current study, we showed that CVML can achieve human-like performance when rating video-recorded facial expressions for positive and negative affect intensity.

Our results provide support for the use of CVML as a valid, efficient alternative to the human coders often used in facial expression studies, and with further validation we expect these findings to expand the possibilities of future facial expression research.

With CVML, we could dramatically reduce the time it takes to generate emotion ratings; it took less than 3 days to automatically extract AUs from 6,260 video recordings and train ML models to generate valence intensity ratings (using a standard desktop computer), whereas it took six months to train three human coders and then have them rate the video clips. Note that five human coders previously spent one month training and coding the same dataset, but their data had to be discarded due to low reliabilities

(ICC(3)s of .33 and .37 for positive and negative ratings, respectively). This underscores the necessity of time-intensive training in teaching human coders to reliably rate facial expressions of emotion.

In addition to excellent model performance, we have also shown that ML model parameters can be used to identify the facial actions (AUs) attended to by human coders while generating valence intensity ratings. Specifically, the RF model identified AU12

(Lip Corner Pull), AU6 (Cheek Raiser), and AU25 (Lips Part) for positive emotion ratings; together these AUs represent the core components of a genuine smile (Ekman et al., 2002; Korb, With, Niedenthal, Kaiser, & Grandjean, 2014). Note that AU12 (Lip

Corner Pull) and AU6 (Cheek Raiser) interact to portray a Duchenne Smile, which can

35 indicate genuine happiness (Ekman, Davidson, & Friesen, 1990), and previous research has demonstrated that accurate observer-coded enjoyment ratings rely on AU6 (Frank,

Ekman, & Friesen, 1993). Additionally, the LASSO model identified AU14 as a strong negative predictor of positive ratings. This finding is consistent with empirical evidence demonstrating that activation of AU14 (Dimpler) is observed when people mask smiles while embarrassed (i.e., it is a smile control; Keltner, 1995), and it suggests that CVML is sensitive enough to pick up on subtle differences in facial expressions that affect coders’ perceptions of valence intensity. Together, the AUs identified by the RF and LASSO models suggest that positive and negative facial expressions occupy separate dimensions

(Belsky, Hsieh, & Crnic, 1996; Miyamoto, Uchida, & Ellsworth, 2010). For example, the five most important predictors extracted from positive and negative RF models had no overlap, indicating that different patterns of facial expressions were used by coders to generate positive versus negative ratings. This is the first study to identify specific AUs associated with the perception of both positive and negative affect, thus providing a bridge between methods used to study basic and dimensional theories of emotions using emotional facial expressions.

The models used in the current study predicted positive emotion intensity with greater accuracy than negative emotion intensity, a common finding among related studies (Bailenson et al., 2008; Beszédeš & Culverhouse, 2007). These results may be due to the number of discrete facial actions associated with negative compared to positive emotional expressions. To support this claim, we found that importance scores for negative, but not positive, emotion ratings were spread across many different AUs (Fig.

36

6a). This evidence suggests that a wider range of facial expressions were used by coders when generating negative rather than positive emotion ratings. Future studies may address this by using CVML models which can detect more than the 20 AUs used in the current study.

Our interpretation of the computer-vision coded AUs in this study is potentially limited because we did not compare the reliability of AU detection between FACET and human FACS experts. Additionally, FACET only detects 20 of the approximately 30

AUs described by FACS, so it is possible that there were other important AUs that the coders used when generating valence ratings that we were unable to capture. However, our models showed excellent out-of-sample prediction accuracy, and we identified theoretically meaningful patterns of AUs for positive and negative emotion intensity that are consistent with prior studies (e.g., components of the Duchenne Smile were important for making positive ratings). It is unlikely that we would achieve these results if FACET did not reliably detect similar, important AUs which represented the intensity of positive and negative facial expressions produced by our 125 participants. Lastly, as computer vision advances, we expect that more AUs will be easy to detect. CVML provides a scalable methodology which can be re-applied to previously collected facial expression recordings as technology progresses.

While the current study investigated emotion valence intensity, our methodology could be easily extended to identify individual differences in recognition of other emotional constructs (e.g. arousal). We showed that AU identification can be extended to individual human coders, and that these results become stable after only 60 and 200

37 ratings for positive and negative valences, respectively (see Fig. 8). Because the clips in our task were 10 seconds long and coders rated positive/negative emotion intensity after each recording, the task used in the current study could be condensed to about 30 minutes and still reveal individual differences in which AUs that coders attend while generating valence intensity ratings. These results are in line with growing trends in emotion research, where focus is shifting from group comparisons of discrete emotion categories to more fine-grained characterizations of individual-level emotion intensities (Chang,

Gianaros, Manuck, Krishnan, & Wager, 2015). The ability to identify individual differences in facial expression recognition has implications for various fields studying human behavior, including: human-computer interaction (e.g., Cowie et al., 2001), neuroscience (e.g., Srinivasan, Golomb, & Martinez, 2016), anthropology (e.g., Crivelli,

Russell, Jarillo, & Fernández-Dols, 2016), and basic psychopathology. The opportunities are particularly pronounced for psychopathology research, where deficits and/or biases in recognizing facial expressions of emotion are associated with a number of psychiatric disorders, including autism, alcoholism, and depression (e.g., Celani, Battacchi, &

Arcidiacono, 1999; Philippot et al., 1999; Rubinow & Post, 1992). Our methodology provides a framework through which both normal and abnormal emotion recognition can be studied efficiently and mechanistically. This line of research could potentially identify the facial expressions that are most associated with emotion recognition deficits, which could lead to rapid and cost-efficient markers of emotion recognition in psychopathology

(see, Ahn & Busemeyer, 2016).

38

References

Ahn, W.-Y., & Busemeyer, J. R. (2016). Challenges and promises for translating

computational tools into clinical practice. Current Opinion in Behavioral Sciences,

11, 1–7. http://doi.org/10.1016/j.cobeha.2016.02.001

Ahn, W.-Y., Kishida, K. T., Gu, X., Lohrenz, T., Harvey, A., Alford, J. R., et al. (2014).

Nonpolitical Images Evoke Neural Predictors of Political Ideology. Current Biology,

24(22), 2693–2699. http://doi.org/10.1016/j.cub.2014.09.050

Ahn, W.-Y., Ramesh, D., Moeller, F. G., & Vassileva, J. (2016). Utility of Machine-

Learning Approaches to Identify Behavioral Markers for Substance Use Disorders:

Impulsivity Dimensions as Predictors of Current Cocaine Dependence. Frontiers in

Psychiatry, 7(11), 290. http://doi.org/10.3389/fpsyt.2016.00034

Ahn, W.-Y., & Vassileva, J. (2016). Machine-learning identifies substance-specific

behavioral markers for opiate and stimulant dependence. Drug and Alcohol

Dependence, 161, 247–257. http://doi.org/10.1016/j.drugalcdep.2016.02.008

Ambadar, Z., Schooler, J. W., & Cohn, J. F. (2005). Deciphering the Enigmatic Face.

Psychological Science, 16(5), 403–410. http://doi.org/10.1111/j.0956-

7976.2005.01548.x

Bailenson, J. N., Pontikakis, E. D., Mauss, I. B., Gross, J. J., Jabon, M. E., Hutcherson, C.

A. C., et al. (2008). Real-time classification of evoked emotions using facial feature

tracking and physiological responses. International Journal of Human-Computer

Studies, 66(5), 303–317. http://doi.org/10.1016/j.ijhcs.2007.10.011

Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Measuring facial

39

expressions by computer image analysis. Psychophysiology, 36(2), 253–263.

Belsky, J., Hsieh, K.-H., & Crnic, K. (1996). Infant positive and negative :

One dimension or two? Developmental Psychology, 32(2), 289–298.

http://doi.org/10.1037/0012-1649.32.2.289

Beszédeš, M., & Culverhouse, P. (2007). Facial emotions and emotion intensity levels

classification and classification evaluation. British Machine Vision ….

Celani, G., Battacchi, M. W., & Arcidiacono, L. (1999). The Understanding of the

Emotional Meaning of Facial Expressions in People with Autism. Journal of Autism

and Developmental Disorders, 29(1), 57–66.

http://doi.org/10.1023/A:1025970600181

Chang, L. J., Gianaros, P. J., Manuck, S. B., Krishnan, A., & Wager, T. D. (2015). A

Sensitive and Specific Neural Signature for Picture-Induced Negative Affect. PLoS

Biol, 13(6), e1002180. http://doi.org/10.1371/journal.pbio.1002180

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed

and standardized assessment instruments in psychology. Psychological Assessment,

6(4), 284–290. http://doi.org/10.1037/1040-3590.6.4.284

Cohn, J. F. (2010). Advances in behavioral science using automated facial image analysis

and synthesis [social sciences]. IEEE Signal Processing Magazine.

Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., &

Taylor, J. G. (2001). Emotion recognition in human-computer interaction. IEEE

Signal Processing Magazine, 18(1), 32–80. http://doi.org/10.1109/79.911197

Crivelli, C., Russell, J. A., Jarillo, S., & Fernández-Dols, J.-M. (2016). The gasping

40

face as a threat display in a Melanesian society. Proceedings of the National

Academy of Sciences, 113(44), 12403–12407.

http://doi.org/10.1073/pnas.1611622113

Ekman, P., Davidson, R. J., & Friesen, W. V. (1990). The Duchenne smile: Emotional

expression and brain physiology: II. Journal of Personality and Social Psychology,

58(2), 342–353. http://doi.org/10.1037/0022-3514.58.2.342

Ekman, P., Friesen, W., & Hager, J. C. (2002). Facial action coding system (2nd ed.).

Salt Lake City, UT.

Frank, M. G., Ekman, P., & Friesen, W. V. (1993). Behavioral markers and

recognizability of the smile of enjoyment. Journal of Personality and Social

Psychology, 64(1), 83–93. http://doi.org/10.1037/0022-3514.64.1.83

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized

Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–968.

http://doi.org/10.1109/TPAMI.2005.127

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

New York, NY: Springer New York. http://doi.org/10.1007/978-0-387-84858-7

Hendricks, P., & Ahn, W.-Y. (2017). Easyml: Easily Build And Evaluate Machine

Learning Models. bioRxiv, 137240. http://doi.org/10.1101/137240

Isaacowitz, D. M., Löckenhoff, C. E., Lane, R. D., Wright, R., Sechrest, L., Riedel, R., &

Costa, P. T. (2007). Age differences in recognition of emotion in lexical stimuli and

facial expressions. Psychology and …, 22(1), 147–159. http://doi.org/10.1037/0882-

7974.22.1.147

41

JASP Team (2017). JASP (Version 0.8.1.2)[Computer software].

Jurasinski, G., Koebsch, F., Guenther, A., & Beetz, S. (2014). flux: Flux rate calculation

from dynamic closed chamber measurements. Retrieved from https://CRAN.R-

project.org/package=flux

Kahler, C. W., Kathryn McHugh, R., Leventhal, A. M., Colby, S. M., Gwaltney, C. J., &

Monti, P. M. (2012). High among smokers predicts slower recognition of

positive facial emotion. Personality and Individual Differences, 52(3), 444–448.

http://doi.org/10.1016/j.paid.2011.11.009

Keltner, D. (1995). Signs of appeasement: Evidence for the distinct displays of

embarrassment, amusement, and shame. Journal of Personality and Social

Psychology, 68(3), 441–454. http://doi.org/10.1037/0022-3514.68.3.441

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and

model selection. Ijcai.

Korb, S., With, S., Niedenthal, P., Kaiser, S., & Grandjean, D. (2014). The Perception

and Mimicry of Facial Movements Predict Judgments of Smile Authenticity. PLoS

ONE, 9(6), e99194. http://doi.org/10.1371/journal.pone.0099194

Kotsia, I., & Pitas, I. (2007). Facial Expression Recognition in Image Sequences Using

Geometric Deformation Features and Support Vector Machines. IEEE Transactions

on Image Processing, 16(1), 172–187. http://doi.org/10.1109/TIP.2006.884954

Lang, P. J., Bradley, M. M., & Cuthbert, B. N. (1995). International Affective Picture

System (IAPS).

Lang, P. J., Bradley, M. M., & Cuthbert, B. N. (1998). Emotion, motivation, and :

42

brain mechanisms and psychophysiology. Biological Psychiatry, 44(12), 1248–1263.

http://doi.org/10.1016/S0006-3223(98)00275-3

Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News.

Littlewort, G., Whitehill, J., Wu, T., & Fasel, I. (2011). The computer expression

recognition toolbox (CERT). Automatic Face & ….

http://doi.org/10.1109/FG.2011.5771414

Marsman, M., & Wagenmakers, E.-J. (2016). Bayesian benefits with JASP. European

Journal of Developmental Psychology, 4, 1–11.

http://doi.org/10.1080/17405629.2016.1259614

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass

correlation coefficients. Psychological Methods, 1(1), 30–46.

http://doi.org/10.1037/1082-989X.1.1.30

Miyamoto, Y., Uchida, Y., & Ellsworth, P. C. (2010). Culture and mixed emotions: Co-

occurrence of positive and negative emotions in Japan and the United States.

Emotion, 10(3), 404–415. http://doi.org/10.1037/a0018430

Philippot, P., Kornreich, C., Blairy, S., Baert, I., Dulk, A. D., Bon, O. L., et al. (1999).

Alcoholics’Deficits in the Decoding of Emotional Facial Expression. Alcoholism,

Clinical and Experimental Research, 23(6), 1031–1038.

http://doi.org/10.1111/j.1530-0277.1999.tb04221.x

Reed, L. I., Zeglen, K. N., & Schmidt, K. L. (2012). Facial expressions as honest signals

of cooperative intent in a one-shot anonymous Prisoner's Dilemma game. Evolution

and Human Behavior, 33(3), 200–209.

43

http://doi.org/10.1016/j.evolhumbehav.2011.09.003

Reisenzein, R., Studtmann, M., & Horstmann, G. (2013). Coherence between Emotion

and Facial Expression: Evidence from Laboratory Experiments. Emotion Review,

5(1), 16–23. http://doi.org/10.1177/1754073912457228

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t

tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review,

16(2), 225–237. http://doi.org/10.3758/PBR.16.2.225

Rubinow, D. R., & Post, R. M. (1992). Impaired recognition of affect in facial expression

in depressed patients. Biological Psychiatry, 31(9), 947–953.

http://doi.org/10.1016/0006-3223(92)90120-O

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social

Psychology.

Sikka, K., Sikka, K., Ahmed, A. A., Diaz, D., Diaz, D., Goodwin, M. S., et al. (2015).

Automated Assessment of Children's Postoperative Pain Using Computer Vision.

Pediatrics, 136(1), e124–e131. http://doi.org/10.1542/peds.2015-0029

Soleymani, M., Asghari-Esfeden, S., & Fu, Y. (2015). Analysis of EEG Signals and

Facial Expressions for Continuous Emotion Detection - IEEE Xplore Document.

http://doi.org/10.1109/TAFFC.2015.2436926

Southward, M. W., & Cheavens, J. S. (2017). Assessing the Relation Between Flexibility

in Emotional Expression and Symptoms of Anxiety and Depression: The Roles of

Context Sensitivity and Feedback Sensitivity. Journal of Social and Clinical …,

36(2), 142–157. http://doi.org/10.1521/jscp.2017.36.2.142

44

Srinivasan, R., Golomb, J. D., & Martinez, A. M. (2016). A Neural Basis of Facial

Action Recognition in Humans. Journal of Neuroscience, 36(16), 4434–4442.

http://doi.org/10.1523/JNEUROSCI.1704-15.2016

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society: Series B ( …. http://doi.org/10.2307/2346178

Wagenmakers, E.-J., Morey, R. D., & Lee, M. D. (2016). Bayesian Benefits for the

Pragmatic Researcher. Current Directions in Psychological Science, 25(3), 169–176.

http://doi.org/10.1177/0963721416643289

Watson, D., & Tellegen, A. (1985). Toward a consensual structure of mood.

Psychological Bulletin.

45