Faculty of Psychology and Educational Sciences

Beyond our means? A multivariate perspective on implicit measures

Tom Everaert

Master dissertation submitted to obtain the degree of Master of Statistical Data Analysis

Promotor: Prof. Dr. Jan De Neve Dept. of Data Analysis Academic year 2015-2016

The author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls under the restrictions of the copyright, in particular concerning the obligation to mention explicitly the source when using results of this master dissertation.

Tom Everaert September 5, 2016

Foreword

In this master dissertation, a study is described that was conducted as part of my post-doctoral research at the Faculty of Psychology and Educational Sciences of Ghent University. Because research in social psychology is often performed in a quick and dirty fashion, peculiarities in the data are frequently overlooked. I therefore set out to demonstrate some methods that would allow for easy and insightful overviews of the complex data that are often associated with psychological studies. I approached my promotor, Prof. Dr. Jan De Neve, with this research question and he agreed to guide me for this master dissertation.

All data were gathered by myself and were obtained in accordance with the ethical guidelines of Ghent University and the Faculty of Psychology and Educational Sciences. The data were anonymized and stored on the server of the Learning and Implicit Processes Lab of the Department of Experimental-Clinical and Health Psychology according to the data management plan of the faculty.

Table of Contents

Introduction ...... 1 1.1. Implicit measures ...... 1 1.2. A multivariate perspective ...... 3 1.3 Principal component analysis ...... 5 1.4 Multidimensional scaling ...... 9 Study 1 ...... 15 2.1. Method ...... 15 2.1.1. Participants ...... 15 2.1.2. Materials ...... 15 2.1.3. Procedure ...... 16 2.2. Data analysis...... 17 2.2.1. Conventional analyses ...... 17 2.2.2. Principal component analysis ...... 18 2.2.3. Multidimensional unfolding ...... 18 2.3. Results ...... 19 2.3.1. Conventional analyses ...... 19 2.3.2. Principal component analysis ...... 21 2.3.3. Multidimensional unfolding ...... 25 2.4. Discussion ...... 27 Study 2 ...... 31 3.1. Method ...... 32 3.1.1. Participants and materials ...... 32 3.1.2. Procedure ...... 32 3.2. Data analysis...... 32 3.2.1. Conventional analyses ...... 32 3.2.2. Principal component analysis ...... 33

3.2.3. Multidimensional unfolding ...... 33 3.3. Results ...... 33 3.3.1. Conventional analyses ...... 33 3.3.2. Principal component analysis ...... 34 3.3.3. Multidimensional unfolding ...... 37 3.4. Discussion ...... 39 General Discussion ...... 42 4.1. Findings ...... 42 4.2. PCA or MDS ...... 43 4.3. Other techniques ...... 44 4.3.1. Non-linear decompositions of proportions ...... 44 4.3.2. Correspondence analysis...... 45 4.3.3. ...... 46 4.5. Other measures: multi-way data ...... 47 4.5. Conclusions ...... 50 References ...... 51 Appendix ...... 54 6.1. Checking assumptions different age AMP scores across race ...... 54 6.2. Correspondence analysis ...... 55

Abstract

The liking of a person, stimulus, or entity is an essential predictor of behavior towards it. Conventional measures of liking through self-report, however, are often biased because of the tendency to give socially desirable answers or the failure to remember one’s likings. New measures have therefore been developed that are sensitive to somebody’s automatic evaluations and are purported to be free of the biases mentioned above. These implicit measures generally assess automatic evaluations through their effects on performance on an unrelated task. The data obtained from these measures are rather complex compared to simple self-report and require some calculation to obtain a score for an automatic evaluation. In a research tradition where quick and dirty analyses are often preferred to accomplish a fast publication, researchers often proceed blindly with calculating such scores instead of carefully scrutinizing the data first.

The current report provides a demonstration of exploratory techniques that can be used to yield quick but insightful summaries of implicit measures data. As such data are multivariate in nature, multivariate decomposition techniques such as principal component analysis (PCA) and multidimensional scaling (MDS) can be applied to scan the data for anomalies and peculiar patterns. Two studies are reported in which PCA and MDS are applied to implicit measures data. Components and dimensions emerged that seemed to align with different automatic evaluations for black faces compared to white faces, and for old faces compared to young faces. Inspection of such components and dimensions revealed atypical stimuli and participants that could be taken into account in subsequent analyses. The advantages and disadvantages of different techniques are discussed, as well as their possible applications to complex multi-way data.

1

Introduction

1.1. Implicit measures

The prediction of human behavior is one of the core goals of psychological science. To that end, many psychological processes have been studied with regard to their potency for prediction. One such process that consistently emerged as an important predictor is evaluation. Since somebody’s evaluation of an object, person, or entity reflects someone’s liking of it, it should come as no surprise that it is a powerful predictor of behavior (Allport, 1935; Martin & Levy, 1978). We tend to purchase products and approach persons we like whereas we tend to ignore products and avoid persons we dislike.

The possible predictive power of evaluations makes their correct assessment critical. Traditional approaches to measuring them relied heavily on conscious self-report through questionnaires or interviews. Such forms of measurement have proven fruitful in the past but do not guarantee a reliable report of evaluations. There are two major impediments to the accurate assessment of evaluations through self-report. First, one might be reluctant to report some evaluations due to cultural sensitivities, peer pressure, or a strong appreciation for privacy. Second, one might not be consciously aware of some evaluations, making it impossible to report them. Evaluations of other races or minorities are often biased because of such tendencies. For instance, people might not 2 want to admit to harboring racist attitudes and therefore not report them. Furthermore, people might honestly ascribe to not having such attitudes while harboring them at a subconscious level.

The problems associated with self-report prompted a search for measures that are impervious to the aforementioned biases. Several decades worth of research spawned a host of so-called implicit measures designed to measure one’s automatic evaluation of a stimulus of interest (e.g. a picture of a black face). These measures generally operate by assessing the effect of presenting the stimulus of interest on the performance on an unrelated task. In the current report, one popular implicit measure that was investigated is the affect misattribution procedure (AMP), which has steadily gained in popularity since its inception (Payne, Cheng, Govorun, & Stewart, 2005).

Figure 1. Illustration of two possible trials in the AMP.

The AMP is a fairly straightforward implicit measure that operates through a rather simple computer task. Participants are shown several Chinese symbols for a short time in the center of the computer screen. They are instructed to judge whether each Chinese symbol is either “less pleasant” or “more pleasant” than what they think is the “average” Chinese symbol with one of two buttons on the keyboard. The presentation of each Chinese symbol is preceded by the short presentation of a picture of the stimulus of interest, such as a black face or a white face (see Figure 1, for an illustration). Although these stimuli, denoted as primes, are irrelevant for the task, they do influence performance. The likelihood of observing a “more pleasant” response to a Chinese symbol tends to be higher when it is preceded by a stimulus that is evaluated positively compared to a stimulus that is evaluated negatively. 3

To measure the automatic preference for white faces over black faces, an AMP can be administered in which several white and black faces are presented as primes. A participant that prefers white faces over black faces is more likely to emit “more pleasant” responses to symbols preceded by white faces compared to symbols preceded by black faces. In contrast, a participant that prefers black faces over white faces is more likely to emit “more pleasant” responses to symbols preceded by black faces compared to symbols preceded by white faces. The tendency to prefer white faces over black faces can be converted into a score by subtracting the proportion of “more pleasant” responses after black faces from the proportion of “more pleasant” responses after white faces. In the study reported in this work, for instance, an AMP was administered that consisted of 120 trials in which black faces and white faces were presented as primes on 60 trials each. If a hypothetical participant emitted a “more pleasant” response 28 times after seeing a black prime and 41 times after seeing a white prime, one can simply calculate the proportions associated with these frequencies (i.e. .467 and .683, respectively) and subtract the proportion for black faces from the proportion for white faces to obtain an automatic preference score of .216.

Such scores have shown to have favorable psychometric properties such as reliability and validity. They have been found to predict actual behavior even when self-reported evaluations were already taken into account. Several studies, for instance, have demonstrated that automatic preference scores can predict voting behavior, the distance one choses to be seated from a black experimenter, and other interesting behaviors (see Greenwald, Poehlman, Uhlmann, & Banaji, 2009, for a review).

1.2. A multivariate perspective

Although the scores derived from implicit measure performance have shown to possess desirable properties, summarizing somebody’s performance into a single number is not likely to paint a complete picture of automatic evaluations. The procedures in implicit measures generate data from many participants on many different stimuli. Such datasets are often diverse and very heterogeneous regarding participants and stimuli. Many interesting peculiarities might be overlooked when blindly proceeding to calculating automatic preference scores. First, atypical participants and stimuli can potentially distort the calculated preference scores as well as the conclusions that follow from them. Although researchers tend to scan data for outlying 4 observations, this process is usually limited to the participants while stimuli are often left unattended. A stimulus that, for possibly unknown reasons, elicits qualitatively different automatic evaluations could remain unidentified and might severely influence eventual preference scores. Pictures of Nelson Mandela or Adolf Hitler, for instance, might not be suitable to measure automatic evaluations regarding race. Second, the selection of participants and stimuli alike might have contained partial confounds that are hard to tease apart without taking into account specific participants and stimuli. Suppose, for instance, an implicit measure designed to capture automatic preferences for the young over the elderly with primes selected to be pictures of young faces and old faces. Aging, however, is a process associated with many different features such as an increasing prevalence of glasses, and so on. A selection of pictures with young faces will therefore contain less faces with glasses than a selection of pictures with old faces. The score that was calculated to measure preference for young faces over old faces might therefore reflect preference for faces without glasses over faces with glasses. It therefore seems warranted to take such confounds into account by analyzing the individual stimuli instead of carelessly aggregating data over stimuli. Third, the data might contain unanticipated patterns that could provide interesting venues for future research. Some stimuli might elicit automatic evaluative behaviors that were different from what was hypothesized, clusters of participants could be characterized by fundamentally different performance patterns, and so on.

The aforementioned concerns can of course be dealt with by carefully scrutinizing the data. Inspection of the data for outliers and unanticipated patterns is vital for a subsequent, correct, analysis, however tedious this process might be. This process is often disregarded, however, by researchers in a culture where quick analysis is generally preferred over intensive inspection in the interest of fast publication. Furthermore, a thorough investigation provides no guarantee that most interesting data patterns can be uncovered. Techniques that allow for the inspection of data in a more automatic fashion would therefore be indispensable tools in the behavioral researcher’s arsenal of exploratory analysis methods.

A helpful perspective on data produced with implicit measures conceptualizes them as realizations of a multivariate variable with observations corresponding to participants and variables corresponding to the distinct behaviors evoked by the stimuli. Such data often span many dimensions because of the large amount of stimuli used in experimental studies. Although 5 it is this high dimensionality that turns careful data inspection into an arduous endeavor, exploratory techniques can be applied that effectively capture the data into fewer dimensions that are inspected more easily for anomalies and other peculiarities. In the current report, the utility of such techniques will be demonstrated for the analysis of implicit measures data. Special attention will be given to principal component analysis and multidimensional scaling as tools for dimension reduction.

1.3 Principal component analysis

Principal component analysis (PCA) is undoubtedly the most popular technique that accomplishes (see Joliffe, 2002, for a thorough introduction). This method expresses the original data into a few uncorrelated variables, or principal components, that are linear combinations of the original variables and capture a sufficient amount of variance in the data. In the most common case, singular value decomposition is applied to express the rectangular 푛 × 푝 matrix 푿 as the product of three matrices, 푨푮푩′. In this decomposition, the

푛 × 푝 matrix 푨 is an orthonormal matrix that contains 푝 column vectors 풂ퟏ to 풂풑, or left-singular vectors, with the scores of the 푛 observations on each of the 푝 possible principal components.

The 푝 × 푝 orthonormal matrix 푩 contains 푝 column vectors 풃ퟏ to 풃풑, or right-singular vectors, with the loadings of the original variables on the principal components, which are the weights of the linear combinations that form the principal components. Finally, the diagonal 푝 × 푝 matrix 푮 has 푝 diagonal elements termed singular values 푔1 to 푔푝 that represent the square roots of the variances associated with the principal components in decreasing order.

This decomposition rearranges the data on 푝 original variables into the same number of principal components and does not really reduce the dimensionality of the data. A reduction is obtained by selecting those principal components that lead to a solution that captures the most possible variance in as few dimensions possible. A solution that is represented by 푘 < 푝 principal components can be obtained easily by deleting the left-singular vectors, right-singular vectors, and singular values that do not correspond to the 푘 principal components from the matrices 푨, 푩, and 푮, respectively. The resulting matrix product yields a rank- 푘 approximation of the original matrix 푿. This approximation can be conceptualized intuitively by the sum of rank-1 matrices 6 corresponding to the outer product of the score and loading vector of a principal component, weighted by that principal component’s singular value.

푘 ′ 푿̂ = ∑ 푔푠풂풔풃풔 푠=1

As mentioned above, PCA can be very beneficial to inspection of data. The most vital principal components can be investigated for anomalies at the level of the participants (i.e. the scores) and the level of the stimuli (i.e. the loadings). Moreover, some outliers might be so pertinent in the data, that they get assigned a principal component with one large score or loading for the outlier and near-zero values for the other observations or variables.

This inspection can be aided further by easily interpretable visualizations of the principal components in the form of . A biplot represents the product of two matrices in a single plot. Suppose, for instance, a 3 × 3 matrix 푲 that can be expressed as the product of a 3 × 2 matrix 푴′, and a 2 × 3 matrix 푵.

2 2 6 12 10 1 4 2 푲 = 푴′푵 = (3 2) ( ) = (7 16 12) 2 2 3 1 3 7 10 11

The column vectors of 푴 and 푵 can be plotted in the same space with one matrix often represented by points and the other by vectors. A biplot representing this matrix product is shown in Figure 2. The left matrix of the product is represented by points and the right matrix by vectors. The elements in matrix 푲 can be calculated by taking the dot products of the row vectors in 푴′ with the column vectors in 푵. In the biplot, the same elements Figure 2. A biplot repre- can be obtained by orthogonally projecting the points onto the senting the abovementioned vectors and scaling the lengths of these projections by the lengths matrices. of the vectors (see Greenacre, 2010, for a concise overview).

The scores and loadings on a subset of principal components can be represented in a biplot, although some additional operations are necessary to express the singular value decomposition as 7 a product of two matrices instead of three. To that end, the decomposition can be re-expressed as either 푿 = (푨푮)푩′ or 푿 = 푨(푮푩′) . Usually, the score vectors in 푨 are multiplied by the respective singular values to yield principal coordinates whereas the loading vectors aren’t and therefore yield standard coordinates. The axes of the biplot represent the principal components and are therefore termed principal axes. The biplots associated with pairs of principal components can be inspected visually for anomalies and interesting patterns in the data.

In the current research, principal component analysis was applied to data produced by the AMP and its utility for data inspection was investigated. One obstacle associated with the application of PCA, however, regards the selection of the eventual principal components that are used to generate a low-dimensional representation of the data. The dimensionality of the solution is often decided on by inspection of the screeplot, which is a plot of the variance captured by each principal component. Simple rules of thumb for principal component selection include selecting the first 푘 components that capture more than 80% of the variance in the data, or the first 푘 components for which the last component is associated with an “elbow” in the screeplot. Although such rules are very straightforward to apply, psychological data are often riddled with noise to such an extent that a decisive answer cannot be reached.

This problem is not unique to multivariate decomposition methods, as finding the optimal set of parameters/dimensions is a fairly common part of the building and selection of statistical models. To that end, cross-validation is perhaps the most invaluable and widely-used technique. This technique is applied most commonly in the building of supervised models, such as linear regression, and follows a straightforward algorithm. Box 1 provides an outline of one such an algorithm, namely leave-one-out cross-validation. Cross-validation techniques differ from one another mainly with regard to the number of observations that are removed from the dataset with each iteration of the algorithm. The retained deviations are often squared and summed to produce the sum of squared prediction errors (SSPE), which can be used as a measure of goodness-of-fit for subsequent model selection. Whereas the rules of thumb for variance decomposition do not provide a measure that reaches a minimal error for the optimal dimensionality, the SSPE could have a minimum with an optimal set of dimensions or parameters in the model. 8

Box 1. Leave-one-out cross-validation

1. Set 푖 ← 1 2. Temporarily remove observation 푖 from the dataset 3. Fit the model to the remaining data 4. Use the model to predict the outcome associated with observation 푖 5. Retain the deviation of the predicted outcome from the observed outcome 6. Set 푖 ← 푖 + 1 7. Repeat steps 2 to 6 until all observations have been removed once

While this algorithm knows a fairly simple implementation in supervised learning methods, its application to unsupervised learning methods, such as the aforementioned variance decomposition techniques, is far from trivial. Removing an observation from the dataset in this setting does not allow one to generate a prediction for that observation because no predictors are present in the data. Wold (1978; see also, Eastment & Krzanowski, 1982) proposed a cross- validatory approach to PCA that would allow for a more informed dimensionality selection. The algorithm is very similar to the basic cross-validation outlined above but relies on the removing of observations as well as variables from the data to predict a single data point. This more complex algorithm is outlined in Box 2. Squaring and summing all retained deviations yields a sum of squared prediction errors (SSPE), similar to the one described in the standard cross- validation algorithm. The SSPE of a solution of dimensionality k is often divided by the traditional sum of squared errors of a solution with k-1 components to yield R.

푆푆푃퐸(푘) 푅 = 푛 푝 2 ∑푖=푙 ∑푗=1((푘−1)푥̂푖푗 − 푥푖푗)

With (푘−1)푥̂푖푗 the predicted value of 푥푖푗 obtained from a PCA solution with 푘 − 1 components. A solution that yields 푅 < 1, can be taken as advice for the addition of component k. This cross- validatory approach to principal component analysis was used in the current report to select principal components that reflect most of the systematic variance in the data. 9

Box 2. Cross-validation for principal component analysis

1. Set dimensionality 푘 2. Set 푖 ← 1, 푗 ← 1 3. Temporarily remove observation 푖 from the dataset and store in 푿(−푖). 4. Temporarily remove variable 푗 from the dataset and store in 푿.(−푗) 5. Perform a PCA on both datasets 5.a 푿(−푖). = 푨(−푖).푮(−푖).푩′(−푖). 5.b. 푿.(−푗) = 푨.(−푗)푮.(−푗)푩′.(−푗)

6. Use both models to predict observation 푥푖푗 based on 푘 first components. Use the scores associated with the left-singular values of 푿.(−푗) and the loadings associated with the right-singular values of 푿(−푖).. 푘 .(−푗)√ .(−푗) √ (−푖). (−푖). 푥̂푖푗 = ∑ (푎푖푠 푔푠 ) ( 푔푠 푏푠푗 ) 푠=1 7. Retain the deviation of the predicted outcome from the observed outcome 8. Set 푗 ← 푗 + 1 9. Repeat steps 4 to 8 until all variables have been removed once 10. Set 푗 ← 1, 푖 ← 푖 + 1 11. Repeat steps 3 to 10 until all observations have been removed once

1.4 Multidimensional scaling

Another approach that can be of substantial help for the inspection of data concerns multidimensional scaling (MDS). This class of algorithms is a set of techniques that allows for the visualization of dissimilarities between several entities in a low-dimensional map with inter- point distances that best approximate the dissimilarities (Borg & Groenen, 2005). Suppose, for instance, an MDS is envisioned on the 푛 × 푝 data matrix 푿 produced by an implicit measure, with rows corresponding to 푛 participants and columns to 푝 administered stimuli. The elements of the matrix correspond to the mean behaviors evoked by the stimuli, such as the proportion of “more pleasant” responses observed after stimulus presentation in case of the AMP. To represent the stimuli in a low-dimensional map, the dissimilarities between the behaviors evoked by the 10 different stimuli need to be calculated first. Dissimilarities can be obtained in various ways by calculating correlation coefficients and rescaling them to represent (positive) distances, or by simply calculating Euclidian distances or other distance metrics between the columns in the data matrix. These dissimilarities can be represented by the elements 훿푖푗 in a 푝 × 푝 triangular dissimilarity matrix 휹, to which an MDS is applied that yields a low-dimensional representation of the stimuli with inter-stimulus distances that can be represented by the elements 푑푖푗 in a 푝 × 푝 triangular 푫. The MDS algorithm is tasked with finding the representation that yields a distance matrix that approximates the dissimilarity matrix as good as possible. The measure that quantifies the error in an MDS solution is called the stress and represents the deviations of the distances in the representation from the dissimilarities in the data.

To attain the stress of a solution, the closeness of the distances to the dissimilarities is first approximated by simply performing a regression of the distances in 푫 on their corresponding dissimilarities in 휹.

푑푖푗 = 푓(훿푖푗) + 휀푖푗

The regression that is applied depends on the measurement level associated with the dissimilarities (i.e. ratio, interval, ordinal). At one extreme, the dissimilarities might truly reflect distances, measured on a ratio , and a regression with slope parameter only suffices. At the other extreme, only the relative ordering of the dissimilarities is informative but not the size of the differences between the distances. In this case, the dissimilarities are measured on an ordinal scale, and monotone regression is applied (Kruskal, 1964).

Similar to traditional regression, the error associated with the MDS solution can subsequently be measured by the sum of squared errors associated with the regression. Rather than taking the mean of this sum to attain the mean squared error, the error sum of squares is normed by dividing it by the sum of squared distances in the representation and taking the square root. This normed sum of squared errors is called the stress-1, hereafter referred to as stress.

2 ∑[푓(훿 ) − 푑 ] √ 푖푗 푖푗 푆푡푟푒푠푠 = 2 ∑ 푑푖푗 11

Other stress formulas exist that mainly differ with regard to the normalizing denominator. Stress provides a good criterion of badness-of-fit that can be minimized using iterative procedures. In these procedures, an initial MDS representation is gradually improved until the stress reaches a minimum or further improvement is negligible. In practice, the aforementioned algorithm is run several times to obtain representations of various dimensionalities. The solution that optimally combines a low stress with low dimensionality is generally chosen as the eventual MDS solution. This solution often corresponds to the “elbow” in a plot of stress against dimensionality.

Unlike the representation obtained with PCA, MDS does not provide nested solutions. The first two dimensions of a 3-dimensional representation can therefore be radically different than the first two dimensions of a 4-dimensional representation. The algorithm fits the best possible representation to the data regardless of whether the first dimension captures most of the variance or not. Since MDS only bothers with fitting the distances as good as possible, it often yields good representations in lower dimensions than PCA and therefore provides simpler solutions that are easier to interpret. Another large advantage of MDS over PCA pertains to the measurement level of the data. MDS can be applied easily to ordinal data, which makes it extremely suitable for the modeling of noisy psychological data. The algorithm also yields good representations even if a sizable portion of the data is missing (Spence & Domoney, 1974).

In a few instances, MDS can rely on PCA rather than iterative stress minimization to retrieve a representation of the data, as is the case in classical metric scaling or principal coordinate analysis (Torgerson, 1952). These algorithms are limited to dissimilarities of interval scale or ratio scale and can often be improved further by stress minimization methods. An MDS solution that is retrieved with PCA is therefore often used as a starting representation for iterative stress minimization methods.

The limitation of traditional MDS algorithms compared to PCA pertains to the fact that PCA jointly models participants and stimuli as scores and loadings in the singular value decomposition whereas traditional MDS is limited to the representation of either the participants or the stimuli. Specialized MDS algorithms are available, however, that allow for the representation of the participants and the stimuli. These multidimensional unfolding models (Borg & Groenen, 2005; De Leeuw, 2005) work by effectively extending the traditional dissimilarity matrix 휹 so that 12 participants and stimuli are represented in both ways of the matrix. Figure 3 displays a conceptual example of an input matrix used for multidimensional unfolding. Whereas classical multidimensional scaling methods would be applied only to dissimilarities in submatrix A for representing the stimuli or submatrix D for representing the participants, multidimensional unfolding operates on the entire matrix, where 푪 = 푩′, and A and D are set to missing. Submatrices C and B contain preferences Figure 3. An example of an that are recoded to reflect distances. In the AMP, for instance, a input matrix for multidi- participant’s high probability of “more pleasant” responses for a mensional unfolding stimulus should result in a small distance. The preference scores are therefore subtracted from 1 to yield scores that are more in line with distances. Multidimensional scaling on this extended matrix yields a representation of the participants and the stimuli in the same space. The distance between a participant and a stimulus in this space reflects that participant’s preference for that stimulus. Participants that are close to each other in this space have similar preferences, while stimuli that are close to each other in this space elicit similar preferences.

Hence, the representation yielded by multidimensional unfolding methods is fairly straightforward and easily interpreted in terms of distances as preferences. Such a representation is very different from the visual representations that are derived from PCA solutions, such as biplots. Here, the preference of a participant for a specific stimulus in the space defined by two principal components cannot be assessed by that participant’s distance from the stimulus but by the orthogonal projection of the point defined by the participant onto the vector defined by the stimulus. As far as interpretability of the visual representation is concerned, multidimensional unfolding clearly outperforms PCA biplots. Multidimensional unfolding solutions should be interpreted with caution, however. Since the data used to fit the model are very sparse (i.e. submatrix A and D are missing), there is a very real possibility of obtaining a degenerate solution, meaning the fit criterion is optimized well but the resulting solution is nonsensical (e.g. all points are on one location, or several points cluster together in one point, …). Degeneracy should always be kept in mind when interpreting a multidimensional unfolding solution. 13

Interpretation can be aided further by adding extra axes to the plot that correspond to the features of the stimuli. If the stimuli used in an experiment were pictures of white faces and black faces, an extra axis could be added to the plot that maximally separates the faces according to their race. A straightforward regression of the stimulus feature of interest on the coordinates can be applied to that end. This can be achieved with linear regression in the continuous case (e.g. ratings of friendliness, trustworthiness, …), whereas logistic regression can be applied in the categorical case. If the MDS solution completely separates the stimuli according to a categorical feature, traditional logistic regression by maximum likelihood estimation might go awry. Specialized logistic regression methods, such as bias-reduced logistic regression by penalized maximum likelihood estimation (Firth, 1993), lead to more reliable results in that case. The parameter estimates returned by the regression model can be used to project the regression equation on the space defined by the MDS solution.

Similar to PCA, the optimal dimensionality is often hard to attain using traditional rules of thumb. Some helpful techniques have been documented, such as jackknifing (De Leeuw & Meulman, 1986) and permutation tests (briefly mentioned by De Leeuw & Mair, 2009), but the available literature is scant to non-existent to my knowledge. Furthermore, the application of these techniques to special MDS algorithms, such as multidimensional unfolding, is even rarer to find. Given cross-validation’s widely accepted reputation, an attempt was made to apply and test the cross-validatory method for PCA to multidimensional unfolding (Box 3). For each data point, two MDS solutions are obtained of a dataset from which the observation is deleted and a dataset from which the variable is deleted. Combining the two solutions to predict the data point poses some additional difficulties compared to the PCA case, however. Because MDS is bothered only with preserving distances as good as possible, two solutions derived from nearly identical datasets might have a vastly difference appearance. After all, distances are unaffected by several transformations, such as rotation, reflection, and translation. The transformations that optimally fit one representation to another can be found with a technique termed Procrustes analysis (Gower & Dijksterhuis, 2004). Suppose 풀 is an 푛 × 푘 matrix of the target configuration of 푛 stimulus points in 푘 dimensions and 푿 is another 푛 × 푘 configuration matrix. Procrustes analysis can be applied to fit 푿 to 풀 by finding the optimal dilation factor 푠, 푘 × 푘 rotation matrix 푻, and ′ translation vector 풕 of length 푘, so that 푡푟(풀 − (푠푿푻 + ퟏ풕′)) 푡푟(풀 − (푠푿푻 + ퟏ풕′)) is minimized. 14

Once this analysis is applied to make the two solutions align optimally, the distance between the point corresponding to the observation and the point corresponding to the variable can be used for prediction.

Box 3. Cross-validation for multidimensional unfolding for a given dimensionality

1. Set dimensionality 푘 2. Set 푖 ← 1, 푗 ← 1 3. Temporarily remove observation 푖 from the dataset and store in 푿(−푖). 4. Temporarily remove variable 푗 from the dataset and store in 푿.(−푗) 5. Perform multidimensional unfolding on both datasets: 5.a. Retain configuration obtained from 푿.(−푗) in 푛 + (푝 − 1) × 푘 matrix 풀.(−푗) 5.b. Retain configuration obtained from 푿(−푖). in (푛 − 1) + 푝 × 푘 matrix 풀(−푖). 6. Match configuration 풀.(−푗) to 풀(−푖). with Procrustes analysis and store result in 풀̂.(−푗) ̂.(−푗) 7. Calculate between participant point 풀푖. and stimulus point (−푖). 풀.((푛−1)+푗) and retain its deviation from the observed dissimilarity in 푿. 8. Set 푗 ← 푗 + 1 9. Repeat steps 4 to 8 until all variables have been removed once 10. Set 푗 ← 1, 푖 ← 푖 + 1 11. Repeat steps 3 to 10 until all observations have been removed once

1.5 Overview In the current research, the utility of multivariate dimension reduction techniques for implicit measures data was investigated. In a first study, analyses were performed on data supplied by the AMP, a well-known measure that yields data that is easily subjected to PCA and MDS. In a second study, the AMP was adapted slightly to yield data that is more suitable for analysis with multivariate reduction techniques. The quality of the solution was compared with the quality of the solution obtained in the first study. In a final discussion, the results and their implications are discussed. Special attention is given to the application of multivariate reduction techniques to data supplied by other implicit measures, which are often more complex. 2

Study 1 Dissecting the AMP

In Study 1, data were analyzed from an AMP study designed to measure preferences regarding race and age. The stimuli that were used in the study were pictures of faces that could be either white or black, and young or old. The resulting data could be used to measure preferences for white faces over black faces, and young faces over old faces. The AMP was administered together with self-report measures of preference for race or age, a measure of racism, and a hypothetical donation to a charity in favor of racial minorities or the elderly. Hence, the design allowed for the assessment of the associations of AMP scores with scores from other measures.

2.1. Method

2.1.1. Participants

The data were taken from an experimental study ran at Ghent University in December of 2015. A sample of 53 participants were selected with an online recruiting system to participate in the current study. Two participants failed to supply demographic data, but the remaining 51 participants consisted of 37 females and 14 males (푀푎𝑔푒 = 24.10, 푆퐷푎𝑔푒 = 6.58). They were compensated with €5 for their participation in the 30-minute study.

2.1.2. Materials 16

The primes were 20 head-and-shoulders pictures of male faces (Figure 4) that varied with regard to race (black vs. white) and age (young vs. old). Each of the 4 possible combinations of race and age therefore contained 5 pictures. These primes were selected from a set of 40 pictures that was used a lot in AMP studies by Gawronski and colleagues (e.g. Gawronski, Cunningham, LeBel, & Deutsch, 2010; Gawronski & Ye, 2015). The targets were 260 Chinese ideographs that were presented in black on a white Figure 4. The stimuli used in the experiment with background a random pattern of black and the accompanying code names. The first letter white pixels was used as a backward mask denotes the race (B = black, W = white). The second for the targets. The study included a small letter denotes the age (O = old, Y = young). The 10-item questionnaire aimed to assess color associated with the categories will be used in modern racism (MRS; McConahay, 1986). the plots below.

2.1.3. Procedure

Participants were tested individually and seated in cubicles in front of the computer screen. After supplying some basic demographic data, participants performed several tasks to measure their preferences regarding race and age.

During the AMP task, participants were presented with 120 AMP trials for which 120 unique Chinese ideographs were drawn randomly from the available set and the 20 prime pictures were presented 6 times each. The time-course of an AMP trial is outlined in Figure 1. Participants were presented with a fixation cross, followed by a prime, a target and a final backward mask. Participants indicated that the Chinese ideograph was “less pleasant” or “more pleasant” than the average Chinese ideograph with the left ctrl or the right ctrl key on the keyboard, respectively. Responses emitted after 1500 ms elapsed resulted in a 750-ms feedback message that prompted the participant to respond faster (i.e. “!!!TE TRAAG!!!”). 17

During the rating task, the 20 primes were presented sequentially alongside a 9-point rating scale that ranged from “highly unpleasant” (zeer onaangenaam) to “highly pleasant” (zeer aangenaam). Participants were asked to rate the pleasantness of each of the primes on this rating scale. To prevent any influence from the explicit evaluation task on the performance in the implicit evaluation task, the AMP was always presented before the rating task.

After performing these tasks, participants completed a computerized version of the modern racism scale (MRS) in which each of the 10 items was presented separately on screen alongside a 5-point likert scale. At the end of the experiment, participants were told they could pick up their payment of €5. Before receiving payment they were asked, however, to hypothetically divide the amount over 2 possible charities and themselves. Both charities aimed at fighting poverty, one for minorities and one for the elderly.

2.2. Data analysis

2.2.1. Conventional analyses

First, traditional analyses were performed on the data provided by the AMP, the rating task, the MRS, and the hypothetical charity donations. Analysis of the AMP was conducted after excluding trials with latencies below 150 ms or above 1500 ms from the data. AMP scores were calculated for race and age by subtracting the proportion of “more pleasant” responses after a black prime or an old prime from the proportion of “more pleasant” responses after a white prime or a young prime, respectively. Similar scores were calculated from the rating data by calculating the difference in mean pleasantness ratings between black and white faces, and old and young faces. The scores on the MRS were calculated by simply summing the scores of all items after reversing the scales of the items that were formulated in a negative manner.

The reliability (i.e. internal consistency) of the AMP scores and rating scores was calculated by performing 10,000 runs in which the data of each participant were divided randomly into two subsets. In each run, scores were calculated for each subset and subsequently correlated. The 10,000 correlations were then averaged and Spearman-Brown corrected to compensate for the dividing of the subset. The resulting correlations were taken as reliability coefficients for the AMP and rating scores on both dimensions. 18

The associations between the various scores were assessed by means of correlations. In addition, a multivariate linear model was run to assess whether charity donations were associated with the AMP scores alone and the rating scores alone. An additional model in which both of the former predictors were added was used and compared with the model containing the rating scores alone to assess whether the AMP scores are associated with charity donating even when explicit rating scores are already taken into account.

2.2.2. Principal component analysis

The data were represented in a matrix with participants for rows and stimuli for columns. The elements of the matrix corresponded to the proportion of “more pleasant” responses a participant made after the presentation of a particular stimulus. This matrix was subjected to a straightforward PCA. Investigations of the loadings of the stimuli on the principal components was performed to find out which response patterns were dominant in the data. The solution was also scanned for principal components that define atypical stimuli, with high loadings for the outliers and low loadings for the other stimuli. The scores on different principal components were inspected for outliers and other peculiarities as well.

After this investigation, the cross-validatory approach to PCA was applied to find out which components predict performance more consistently. Cross-validation was performed for each increasing sequence of principal components. Bias-reduced logistic regression was used to find out to what extent linear combinations of the loadings on the selected components can separate the categories of the races and the ages.

The results of the PCA inspection were used in a subsequent step to assess whether exclusion of atypical participants and stimuli substantially changed the results of the conventional analyses.

2.2.3. Multidimensional unfolding

A multidimensional scaling algorithm was applied to the data after recoding the proportions to reflect distances by subtracting them from 1. The dimensionality of the optimal solution was decided on through inspection of the stress associated with various dimensionalities, cross- validation outlined above, and finally through visual inspection of the plot. Multidimensional unfolding was performed in R (R core team, 2015) using the unfolding function of the SMACOF 19 package, which minimizes stress through iterative majorization (De Leeuw & Mair, 2009). Axes were added to the plot by performing the aforementioned bias-reduced logistic regression of the categorical stimulus features race and age on the coordinates of the stimuli. The quality of the axes were assessed by the fit of the logistic regressions. The results of the investigation of the low-dimensional representation of the data were compared to the results of the PCA data inspection to find out their relative strengths and weaknesses. This inspection was supplemented with conventional analyses to find out how the original analyses were improved by the additional MDS investigation.

2.3. Results

2.3.1. Conventional analyses

After deleting trials with response latencies that were identified as outliers (4.59%), AMP scores were calculated for the two prime dimensions (i.e. race and age). The mean AMP score did not differ significantly from zero for race, 퐷 = −0.036, 휎̂퐷 = 0.163, 푡(52) = −1.61, 푝 = 0.113, as well as age, 퐷 = 0.021, 휎̂퐷 = 0.108, 푡(52) = 1.44, 푝 = 0.157. The race AMP score was found to have a moderate reliability, 푟 = .69, 푡(52) = 6.82, 푝 < .0001, whereas the age AMP score had low reliability, 푟 = .28, 푡(52) = 2.11, 푝 = .010. The mean rating scores showed significant deviations from zero with regard to race, 퐷 = 0.374, 휎̂퐷 = 1.137, 푡(52) = 2.392, 푝 = 0.020, but not age, 퐷 = 0.132, 휎̂퐷 = 1.267, 푡(52) = 0.759, 푝 = 0.452. Both rating scores were found to have moderate reliabilities, 푟 = .59, 푡(52) = 5.21, 푝 < .0001, for race, and 푟 = .67, 푡(52) = 6.49, 푝 < .0001, for age. The mean MRS score did not differ significantly from the theoretical midpoint of 25, 푀 = 25.87, 휎̂푀 = 5.74, 푡(52) = 1.010, 푝 = 0.276 . When asked to hypothetically divide their payment of €5 amongst two charities and themselves, participants allocated on average €2.43 (휎̂ = €1.60) to a charity benefiting racial minorities, €1.36 (휎̂ = €1.18) to a charity benefiting the elderly, and €1.14 (휎̂ = €0.91) to themselves.

The correlations between AMP scores, rating scores, modern racism scores and charity allocations are presented in Table 1. The table reveals the traditionally significant correlation between the race scores of the AMP and the race scores of the explicit ratings. Although the race AMP scores did not correlate significantly with the MRS, the explicit rating scores did. As expected, the charity donations correlated significantly with one another, since giving more to 20 one charity necessarily implies giving less to the other two options. Interestingly, the MRS correlated significantly with allocating money to one’s self. This correlation was negative, suggesting higher scores on the MRS were associated with higher donations to charity. The measures associated with age did not reveal any significant correlations, although a marginal association between age AMP scores and age rating scores was observed.

Race - Race - Modern Race - Age - Age - Age -

AMP Rating Racism Charity AMP Rating Charity Race – .29* Rating Modern .18 .31* Racism Race – -.04 -.13 .11 Charity Age – .03 -.01 .06 .21 AMP Age – .11 .10 .04 .05 .26. Rating Age - .21 .16 -.06 -.61*** -.22 .04 Charity Self - .06 -.20 -.34* -.49*** -.16 -.09 .54*** Charity Table 1. Correlations between the conventional attitude and behavioral measures. Note. . = 푝 < .10, ∗ = 푝 < .05, ∗∗ = 푝 < .01, ∗∗∗ = 푝 < .001

A multivariate linear regression of the variables associated with charity donations on the AMP scores revealed no overall significant effects of the race AMP scores, 퐹(3,48) = 0.958, 푝 = .42, 휂² = .06, or the age AMP scores, 퐹(3,48) = 0.972, 푝 = .41, 휂² = .06. The univariate parameter estimates in the model did not reveal any effects that attained significance. Similar multivariate linear regressions of the charity donations on the rating scores showed an overall significant effect of race rating scores, 퐹(3,48) = 3.000, 푝 = .040, 휂² = .16, although none of the univariate parameter estimates attained significance. The parameters estimates suggested increases in race rating scores are associated with a decrease in donations to charities for ̂ minorities, 훽 = −18.94, 휎̂훽 = 19.70, 푡(50) = −0.961, 푝 = .341 , an increase in donations to ̂ charities for the elderly, 훽 = 16.44, 휎̂훽 = 14.57, 푡(50) = 1.128, 푝 = .265, and a decrease in ̂ allocating money to one’s self, 훽 = −16.06, 휎̂훽 = 11.16, 푡(50) = −1.439, 푝 = .156 . The contrast between the significant overall effect in the multivariate regression and the small 21 univariate effects, suggests it is vital to take the covariance structure of the dependent variables into account. The multivariate regression of charity donations on the age ratings produced no significant overall effect of age ratings, 퐹(3,48) = 0.346, 푝 = .792, 휂² = .02, and none of the parameter estimates attained a semblance of significance, all 푡s < 0.622, all 푝s > .537. Model comparisons between a model that had only the rating score as predictor and a model that has both the rating score and the AMP score as predictor, revealed no association between the AMP scores and charity donations when rating scores were already taken into account, 퐹(3,47) = 0.848, 푝 = .475, and 퐹(3,47) = 0.968, 푝 = .416, for race AMP scores and age AMP scores, respectively.

2.3.2. Principal component analysis

A PCA was performed on the dataset containing the proportions of “more pleasant” responses for each stimulus by each participant. Figure 5 presents a screeplot of the proportion of the variance that is captured by each of the 20 principal components. The first two components captured 23.0% and 12.2% of the variance in the data, respectively. The variance proportion associated with the other components dropped very gradually from

9.6% at component 3 to 0.6% at Figure 5. Screeplot associated with the principal component 20. Inspection of the variances component analysis performed on the data supplied revealed that traditional rules of thumb for by the AMP. component selection would yield unwieldy solutions. The 5-component solution associated with the elbow in the screeplot captured only 58.3% of the variance in the data, a solution that captures 80% of the variance would need to contain 10 components or more, and a solution with components that capture more variance than the original variables in the dataset would need to contain 7 components. Before proceeding with 22 the selection of components with cross-validation, the loadings and scores associated with the principal components were thoroughly inspected.

Figure 6 presents the loadings of the stimuli on the first 3 components. The first components had loadings that were rather homogeneously negative. This type of component is typically termed an “intercept” component as it describes the tendencies of the observations that are independent of the variables. In this case, the first component described a participant’s general tendency to emit “more pleasant” responses regardless of the identity of the stimulus. The loadings on the second component were all negative for black faces and positive for white faces, and hence reflected the tendency of the participants to act differently towards black faces than towards white faces.

Figure 6. Plot of the stimulus loadings on the first 3 principal components.

The third component had loadings that were somewhat more erratic, although most old faces tended to have positive loadings and most young faces tended to have negative loadings. While this pattern was rather weak, this component seemed to reflect the tendency of the participants to act differently towards young faces than to old faces. The weakness of this pattern was not surprising as the reliability of the regular age AMP score was fairly low as well.

Investigation of the loadings on the smaller components revealed some components that were defined by a high loading of one stimulus and relatively low loadings of all other stimuli. Component 15 and Component 16, for instance, displayed rather high loadings for stimulus BO2 23 and WY2, respectively, while low loadings were observed for other stimuli (Figure 7). The loading patterns on other components did not seem to reveal any systematic patterns regarding the features of the stimuli.

Figure 7. Plot of the stimulus loadings on principal components 15 and 16.

Inspection of the component scores with boxplots revealed several outliers, mainly on the first principal components (marked in red on the biplots in Figure 8). The score outliers on the first component represented participants that rarely pressed the “more pleasant” button (i.e. participants 13, 34, 40, 46) or pressed it almost all the time (i.e. participant 50). Score outliers on the second component were participants with extreme automatic preferences for black faces over white faces (i.e. participants 22, 41, 51) or extreme automatic preferences for white faces over black faces (i.e. participants 1, 8). Score outliers on the third component were harder to interpret, as the component loadings corresponded only vaguely with a differential response towards young faces and black faces (i.e. participants 8, 38). Supplemental inspection of the biplots showed the outliers in spaces defined by two principal components and lead to similar conclusions (Figure 8). Note that inspection with biplots becomes increasingly difficult with higher dimensionalities and would require special software to yield representations in more than two dimensions. The main advantage of the biplot compared to the previous inspections lies in the joint representation of the rows and the columns of the data. The plots allow for the quick detection of outliers at the level of the participants (e.g. participant 22, 40, 51) and the stimuli (e.g. stimulus WO5, WY5). 24

Figure 8. Biplots of the first three principal components.

The aforementioned exploratory inspection was followed by an attempt to select the most stable sequence of components with the cross- validation method outlined above. The R-values associated with each incremental solution are plotted in Figure 9. All R-values exceeded 1.0, suggesting that only the first component should be included in the eventual solution. Adding components only seemed to increase the R- values. Hence, cross-validation did not uncover a subset of stable principal components that Figure 9. Plot of the R-values against the could be used for further analysis. number of components

A solution with 5 components was eventually decided upon, as they adequately capture a large part of the variance in the data (58.3%) and had an R-value that was considered sufficiently low, 푅 = 2.79. To find out whether the space defined by these components can separate the categories of the race and age dimensions, bias-reduced logistic regressions were run of these categories on the loadings of the stimuli on the components. These models suggested, unsurprisingly, that a linear combination of the first 5 components can correctly classify the race of the stimuli in 100% 25 of all cases and the age of the stimuli in 90% of the cases. The correct classification of race was driven predominantly by Component 2, 푧 = 2.35, 푝 = .019, whereas the classification of age was driven weakly by Component 3, 푧 = −1.51, 푝 = .130 . Hence, although no individual component could clearly classify the stimuli by age, a vector in the vector space spanned by the 5 principal components could do so to a large extent. The stimuli that were not correctly classified were stimulus BY4 and WO1.

The weak results of the conventional analyses and the multitude of score outliers made further investigation difficult. When the stimuli of which the age was incorrectly classified by the 5- component solution were excluded from the analyses, the correlation between the age AMP score and the age rating score rose from 푟 = .26, 푝 = .056, to 푟 = .32, 푝 = .021, but no further substantial changes were noted in the analyses.

2.3.3. Multidimensional unfolding

A multidimensional unfolding model was fitted to the data for various dimensionalities. Figure 10a displays the plot of stress against dimensionality. The error in the solutions decreased with dimensionality and displayed an elbow around 5 dimensions. After 7 components, the stress did not seem to decrease substantially anymore. The high minimum observed stress-value of .32 suggests the data is very noisy and that many idiosyncrasies are not captured by the solution. Inspection of the R-values obtained through cross-validation (Figure 10b) showed a very steep increase in R that stabilized somewhat after 7 dimensions. Additional dimensions were of such a small magnitude that they didn’t influence prediction anymore. After consideration of the stress plot and the cross-validation plot, a 4-dimensional solution was chosen for its comparatively low stress and R value that stayed below 2.0 (R = 1.84).

The solution was obtained after 583 iterations of the iterative majorization algorithm and had a stress-level of .35. The first two dimensions of the representation (Figure 11a) clearly related to the race and age of the stimuli in the study. Bias-reduced logistic regression of race on the coordinates of the stimuli suggested main contributions of Dimension 1, 푧 = 1.62, 푝 = .103, and Dimension 2, 푧 = −2.11, 푝 = .035, to a model that classified the race of all faces correctly. A similar model of age on the stimulus coordinates did not suggest clear contributions, although the z-values associated with Dimension 1, 푧 = 1.08, 푝 = .280, and Dimension 2, 푧 = 1.10, 푝 = 26

.269, were somewhat higher compared to the other dimensions. This model correctly classified the age of the stimuli in 70% of the cases. The axes defined by the regressions are plotted on the configuration in Figure 11a.

Figure 10. A plot of stress against dimensionality of the MDS solution and the cross-validation plot of R against dimensionality.

Figure 11. Two-dimensional plots of the dimensions of the multidimensional unfolding solution. The axes represent the logistic regression equations. Higher values on the axes suggest higher log-odds of the stimulus being young compared to old, or white compared to black. 27

Inspection of the first two dimensions indeed showed a clear separation of black faces from white faces. The separation of old faces from young faces, however, is very crude. Within the group of white faces, the separation of old and young faces is rather good, with stimulus WO1 being the only stimulus not in the cluster defined by its category. The group of black faces, however, is clustered more tightly compared to the white faces. The differences between the ages in this category are therefore obfuscated somewhat. Stimuli BY2 and BY4 seem to reside in the area of the old faces and stimulus BO5 seems to reside in the area of the young faces. Dimensions 3 and 4 do not seem to capture an intuitively apparent structure although it again shows some stimuli at a rather large distance from their other category members. This plot seemed to show larger differences in reactions to white faces, which are spread out more, compared to black faces. The plots also suggested some participants had response tendencies that were quite different from the main cluster of participants in the center. Figure 12a shows some participants were marked by a strong automatic preference for black faces over white faces (e.g. 22, 51), strong automatic preferences for young over old faces (e.g. 38), or had little preference for any of the stimuli (e.g. 46).

The multitude of atypical participants and stimuli did not warrant an analysis with exclusion of these entities, as the loss of power associated with the exclusion would not guarantee a good comparison. The solution displayed an interesting pattern with regard to the age dimension, however. The white faces showed a near perfect clustering of the ages, whereas this was not the case for the black faces. This interesting observation was subjected to further statistical testing to assess the hypothesis that participants responded more differently to young compared to old faces when these faces where white rather than black. A participant’s tendency to respond differently to young faces compared to old faces was calculated for black faces and white faces by taking the absolute value of the age AMP scores associated with each race. After checking assumptions (see Appendix), a paired samples t-test on the absolute age AMP scores for black faces compared to white faces suggested insufficient evidence to reject the null-hypothesis that there was no difference between the mean absolute age AMP scores, 푡(52) = 1.43, 푝 = .16.

2.4. Discussion 28

In the current study, the application of two exploratory, multivariate reduction techniques to data produced by implicit measures was demonstrated. PCA was chosen for its popularity, simplicity, and its well-documented technical underpinnings, whereas MDS was chosen for its powerful visualizations.

Conventional analyses on the preference scores derived from the performance on the AMP did not reveal the scores derived from the implicit measures to be associated with charity donating behavior, either individually or when explicit scores were taken into account. A further investigation of the data produced by the implicit measure therefore seemed warranted to pinpoint any peculiarities and other sources of noise.

A PCA on the proportions of “more pleasant” responses of the participants to the stimuli suggested noisy data. A couple of components were found to capture a substantial amount of variance, but there was no clear set of components that could account for most of the variance in the data. The noisiness of the data was attested further by an attempt to perform cross-validation on the data for component selection. No component seemed to be stable enough to improve the prediction of omitted data points. Eventually, a 5-component solution was selected that captured approximately 58.6% of the variance in the original dataset. The first component reflected the general tendencies of the participants to emit a “more pleasant” response independent of the type of stimulus. The second component reflected the tendencies of the participants to emit a different response to white faces than to black faces. The third component had a more noisy loading pattern, but seemed to reflect the tendencies of the participants to emit a different response to young faces compared to old faces. Investigation of the loadings showed that a linear combination of the retained principal components could lead to a 100% correct classification of the race of the stimuli and a 90% correct classification of the age of the stimuli. Analyses of the component loadings and component scores revealed some stimuli that evoked atypical responses and some participants that had atypical response tendencies. Taking these peculiarities into account did not substantially alter the outcome of a conventional analyses of the data, however.

Similar to PCA, the MDS showed difficulty fitting a low-dimensional map to the data. The stress associated with a map of almost any given dimensionality never dropped below .32, which is a high value. Inspection of the screeplot and cross-validation both suggested a solution with 4 dimensions might return a stable representation of the data. This high number is rather surprising, 29 as MDS is often thought to yield good representations in fewer dimensions than PCA. Moreover, the first dimensions seemed to show an intuitive structure whereas the last dimensions of the retained solution seemed to capture mainly noise.

Inspection of representations of lower dimensions showed that it might have been necessary to include dimensions to capture these noisy patterns or they would interfere with a low-dimensional representation to such an extent that a degenerate solution is returned. Figure 12, for instance, displays the representation that was obtained by fitting only two dimensions to the data. The representation is clearly different from the representation shown in Figure 11a. Stimuli of different Figure 12. Two-dimensional unfolding solution. races tend to cluster together whereas the participants tend to form a line in between the clusters. This representation clearly has a whiff of degeneracy, which was further attested by the quality of the solution when more dimensions were included.

The investigations with PCA and MDS tended to pinpoint the same participants and stimuli as atypical. The main difference between the representation seemed to be in the first component. The first component returned by PCA reflected the participants’ general response tendencies. Such a component was not present in the MDS solution, although some participants flagged as atypical on the first principal component were simply far away from the stimuli in the representation or very close to the center. The two-dimensional solution did seem to reflect the first principal component however (Figure 12). The line on which the participants were located showed participants with a higher tendency to emit “more pleasant” responses on the outsides of the line whereas participant with a low tendency to emit such responses were located closer to the center. 30

The aforementioned analyses provided some interesting insights into data produced by implicit measures and their underlying processes by consequence. The sensitivity of the analyses above might have been hampered, however, by the design of the experiment. Each stimulus was presented six times only during the task. Consequently, there are only 7 possible observable proportions per stimulus (i.e. 0, 1/6, 2/6, 3/6, 4/6, 5/6, 1). This constitutes a rather discrete scale that is therefore quite insensitive to small variations in automatic evaluations. Moreover, most participants didn’t use the full extent of this scale and often produced less variable outcomes and even more ties in the data. These data might lead to suboptimal presentations in both PCA and MDS. Especially in MDS, the discrete nature of the data and the ties that go along with it might have led to representations with high stress and little accuracy. The likelihood of obtaining a degenerate solution of the data also increases dramatically with an increase in the discreteness and the ties in the data.

This problem can be alleviated somewhat by increasing the number of stimulus presentations. For instance, if each stimulus would have been presented 20 times, the scale associated with each stimulus would have at least 21 different values. There are some practical issues associated with this approach, however. First, the extra stimulus presentation would make the task rather long and tedious to complete. The attention of the participants might wane over time and their performance by the end of the experiment might be unrepresentative of their actual behavior. Second, the repeated presentation of the same stimulus can change how one feels about the stimulus. The mere exposure effect, for instance, is a well-documented effect in which many presentations of a stimulus actually lead to an increase in the liking of said stimulus (e.g. Bornstein, 1989). Habituation effects, on the other hand, suggest repeated stimulus presentations might numb the automatic evaluations evoked by them over time (e.g. Thompson & Spencer, 1966). Hence, simply presenting the stimuli more and making the task longer might not yield the most reliable results.

3

Study 2 To IMPRES

The problem mentioned in the discussion above was anticipated when the original study was ran. The aforementioned tasks were therefore presented along a newly-developed task termed the implicit preference scale (IMPRES). In this task, pairs of Chinese symbols are presented simultaneously on the screen while participants are asked to indicate which symbol of the pair they prefer. Each presented symbol in a pair is preceded by the short presentation of the prime (see Figure 14). Participants were hypothesized to be more likely to choose the symbol that is preceded by the prime stimulus (e.g. a picture of a white face) that is preferred over the other prime stimulus (e.g. a picture of a black face). The task was run with each possible pair of the 20 stimuli described above. This resulted in 190 trials in which each stimulus was presented 19 times. An increase in the number of trials of 58.3% thus resulted in a 216.7% increase in stimulus presentations. The scale associated with the stimuli contains 20 possible discrete values and is therefore more sensitive than the scale associated with the stimuli in the previous study. Moreover, the sensitivity of the measure might be increased by the fact that the task relies on a comparison of two stimuli instead of a judgment of a single stimulus on a rather arbitrary scale. It is often easier to judge whether one likes a given stimulus more than another stimulus instead of trying to quantify one’s liking of the stimulus in an arbitrary fashion. The drawback of this design, however, is that some dependency between the observations is created. If a certain stimulus is preferred all 19 times, for instance, no other stimulus can be preferred to that extent. 32

The IMPRES might thus render more accurate depictions of implicit preferences compared to the AMP data, at the cost of introducing some dependence in the data.

Figure 13. Illustration of a trial in the IMPRES. After the presentation of the stimuli, participants indicate which Chinese symbol they prefer by pressing the button with the location that corresponds its location.

3.1. Method

3.1.1. Participants and materials

Both tasks were performed during the previous experiment and therefore used the same sample and stimuli.

3.1.2. Procedure

During the IMPRES, a trial was presented for each possible pairing of the primes, leading to 190 trials in total. Two of the 260 available Chinese ideographs were drawn randomly for each trial and presented as targets. The time-course of this trial is outlined above in Figure 13. The duration of the prime presentation was increased compared to the AMP to compensate for the simultaneous prime presentation. Participants were asked to indicate which ideograph they preferred by pressing the ctrl key with the location that corresponded to that particular ideograph. Similar to the AMP, responses emitted after 1500 ms elapsed triggered a feedback message prompting faster responding. The IMPRES was presented either before or after the AMP. The order of the tasks was counterbalanced across participants.

3.2. Data analysis

3.2.1. Conventional analyses 33

The IMPRES data can be analyzed in a conventional sense to obtain preference scores for white faces over black faces and young faces over old faces. Of the 190 pairs that were presented, 100 pairs directly pitted opposing categories of race or age against one another. Scores were derived from these pairs by calculating the proportion of times a face of a particular category was (implicitly) preferred over a face of the other category. A preference for one category over another category was then assessed by how far the proportion deviates from 0.50. Reliability indices of these scores were calculated using the bootstrap-inspired method outlined in Study 1. Correlational analyses were run to assess the association of the measures with the measures in the previous studies. Additional multivariate linear models were fitted to assess the association between charity donations and the implicit preference scores and rating scores separately. Model comparisons between a model that includes only the rating score and a model that includes implicit and rating scores were performed to investigate whether the implicit measures predicts behavior when rating scores are already taken into account.

3.2.2. Principal component analysis

The data supplied by the IMPRES was formatted into a matrix with rows for participants and columns for primes. The entries in the matrix denoted the proportion of times that the particular prime was preferred by a particular participant. The resulting data were subjected to the same methods as outlined in Study 1.

3.2.3. Multidimensional unfolding

The proportions in the data matrix were subtracted from 1 so that high preferences corresponded to small distances. The resulting matrix was subjected to similar models and analyses as Study 1.

3.3. Results

3.3.1. Conventional analyses

The data supplied by the IMPRES showed that the proportion with which white faces were (implicitly) preferred over black faces did not differ significantly from 0.50, 푀 = 0.49, 푠 = 0.09, 푡(52) = −1.13, 푝 = 0.267 , whereas the proportion with which young faces were preferred over old faces did, 푀 = 0.52, 푠 = 0.08, 푡(52) = 2.18, 푝 = 0.003 . The reliability 34

coefficients for both measures were reasonably high and similar for both race, 푟 = .69, 푡(52) = 6.73, 푝 < .0001, and age, 푟 = .63, 푡(52) = 5.85, 푝 < .0001.

Table 2 depicts the correlations of the race IMPRES scores Race - Age - and the age IMPRES scores with the other measures presented IMPRES IMPRES Race - .35* .03 in the study. Whereas the race IMPRES scores correlate AMP Race – significantly with the race AMP scores, the same cannot be .01 -.03 Rating said for the age IMPRES scores and the age AMP scores. No Modern .13 .07 other correlations attained significance. There was a marginal Racism Race – .02 -.03 tendency for the race IMPRES scores to correlate with the age Charity Age – AMP scores and for the age IMPRES scores to correlate with .26. .15 AMP the age rating scores, but these didn’t reach conventional Age – .19 .23. levels of significance. Multivariate linear models suggested no Rating Age - .05 -.08 associations between charity donations and the race IMPRES Charity Self - score, 퐹(3,48) = 0.147, 푝 = .93, 휂² = .01 , or the age .05 .02 Charity IMPRES score, 퐹(3,48) = 0.291, 푝 = .83, 휂² = .02. None of Race – 1.00 -.03 the univariate parameter estimates attained significance either, IMPRES Age - -.03 1.00 all 푡s < 0.556, all 푝s > .581. Both IMPRES scores also did IMPRES not reveal any association with charity donations when Table 2. Correlations between explicit rating scores were already entered into the model, the IMPRES scores and the 퐹(1,50) = 0.848, 푝 = .47 , and 퐹(1,50) = 0.968, 푝 = .42 , other measures. Note. . = 푝 < for race and age, respectively. These results thus suggest .10, ∗ = 푝 < .05. insufficient evidence to reject the hypothesis that there is no association between the IMPRES scores and charity donating behavior.

3.3.2. Principal component analysis

A PCA on the IMPRES data returned a solution with principal components of a similar magnitude to Study 1 (Figure 14). The first component and second component captured 16.0% and 13.2% of the variance respectively, after which the variance accounted for by the other components dropped gradually from 9.2% at component 3 to 0.0% at component 20. An attempt 35 at cross-validation was met with the same result as study 1. No clear solution with stable components emerged from the data. The cross-validation plot does show that the R-values increase only moderately in the first 5 components, suggesting a solution that retained the first 5 components might render a good representation.

Figure 14. Screeplot (a) associated with the PCA on the IMPRES data and a cross-validation plot (b) of the R-values associated with solutions with increasing dimensionality.

Inspection of the loadings of the three first components revealed no intercept component in Study 2 (Figure 15). Since participants were forced to choose between two stimuli, a general tendency to emit “more pleasant” or “less pleasant” responses could not influence the data. The first component was characterized by positive loadings when white faces were presented and negative loadings when black faces were presented. The second component had positive loadings for old faces and negative loadings for young faces, whereas the third component did not reveal a clear loading pattern. Unlike Study 1, there were no clear components that were characterized by a single, large loading on one stimulus, suggesting the stimuli in this particular task might have been responded to more homogeneously.

After inspection of the loadings, the scores were further scrutinized with boxplots. Several participants were flagged as atypical (marked in red on the biplots in Figure 16), with some participants having extreme automatic preferences for white faces over black faces (40), extreme 36 preferences for black faces over white faces (22, 41), extreme preferences for old faces over young faces (43), or extreme preferences for young faces over old faces (8, 18, 41).

Figure 15. Loadings observed on the three first principal components of the IMPRES data.

Figure 16. Biplots associated with principal component 1 and 2 (a), and principal component 3 and 4 (b).

Inspection of the biplots showed that participants tended to cluster together very tightly aside from a few outliers (Figure 16). It seems that participants were very homogeneous with regard to the principal components in the data. The loadings clearly differentiated between the four predefined categories in the data. 37

This clear separation was further attested by bias-reduced logistic regressions of race or age on the loadings of the first 5 principal components. Such models correctly classified race in 95% of all cases and age in 100% of all cases. The correct classifications were predominantly associated with principal component 1 in the case of race, 푧 = 2.38, 푝 = .017, and principal component 2 in the case of age, 푧 = −2.36, 푝 = .019. Stimulus WO2 was the stimulus of which race was misclassified.

3.3.3. Multidimensional unfolding

The data were subjected further to a multidimensional unfolding model. Figure 17 depicts the stress plot and the cross-validation plot associated with models of increasing dimensionalities. The overall stress level of the models is a lot lower than the stress observed in Study 1. The stress seems to decrease rather linearly until a fairly weak elbow was reached at 10 dimensions. The R values increase with dimensionality until component 10, after which they further stabilize. Similar to Study 1, a solution was opted for with a dimensionality that had a low R-value. A 3- dimensional solution was therefore preferred, which yielded a stress of .21 after 869 iterations and an R-value of 1.90.

Figure 17. A plot of stress against dimensionality of the MDS solution and the cross-validation plot of R against dimensionality. 38

The first two dimensions of the solution are plotted in Figure 18. Unlike the representations obtained in Study 1, the participants tended to cluster together tightly in the center with a few atypical observations dispersed in the surroundings. The stimuli surrounded the participant cluster in an ellipse and were clearly organized according to their categories. White faces were presented on the right and black faces on the left, whereas old faces were presented on the Figure 18. First two dimensions of the 3- top and young faces on the bottom. Two dimensional unfolding solution on the IMPRES data. stimuli, WY5 and BY2, are located closely to each other but on unexpected sides. These items might be atypical, although the small distance between them suggests this might have been a byproduct of the relative insensitivity of MDS to small distances as opposed to large distances.

The clear separation of the categories was confirmed further by bias-reduced logistic regressions of race and age on the coordinates in the three dimensions. The regression of race on the coordinates suggested a large contribution of Dimension 1 to the separation of black faces from white faces, 푧 = 2.50, 푝 = .012, and a small contribution of Dimension 3, 푧 = 1.30, 푝 = .193. The separation of young faces from old faces was dominated by Dimension 2, 푧 = −2.58, 푝 = .009. Both models correctly classified the stimuli in 100% of all cases.

Inspection of the plots associated with Dimension 3 clearly showed that this Dimension is sensitive to the race of the stimuli as well. A combination of Dimension 1 and Dimension 3 seemed to fully separate the black faces from the white faces. The aforementioned stimuli WY5 and BY2 that appeared close together in the first two dimensions are clearly far apart when the third dimension is taken into account. The plot of Dimension 2 against Dimension 3 suggests Dimension 3 aids in separating mainly the young faces by race compared to the old faces. 39

Figure 21. MDS plots of Dimension 1 against Dimension 3 (a) and Dimension 2 against Dimension 3 (b).

One might suspect that the tight clustering of the participants is a sign of a degenerate solution. Nevertheless, the differences between the participants in the representation, however small, might still reflect meaningful variation. Correlation analyses were performed to investigate whether the differences between the tightly clustered participants were associated with other measures used in the study. Scores for preferences for white faces over young faces and young faces over black faces were extracted from the solution by plugging the participant coordinates into the logistic regression equations that were used to draw the axes on the plots. The predicted log-odds from these regressions were used as preference scores and were correlated with the scores derived from the other measures in the experiment. These analyses suggested a weak tendency for the derived race preference scores to be associated with the race AMP scores, 푟 = .24, 푝 = .08. Similar associations with the derived age preference scores failed to attain significance or revealed any tendency towards association with a score from another measure, max(푟) = .22, min(푝) = .11. The scores obtained from the conventional analyses of the IMPRES data, however, did show a strong association, suggesting that the difference captured by analyzing the raw data were reflected to some extent in the representation, 푟 = .87, 푝 < .0001, and 푟 = .96, 푝 < .0001, respectively for race and age.

3.4. Discussion 40

Study 2 was run to address a concern in Study 1. The few presentations per stimulus made for a rather discrete proportion scale. Simply increasing the amount of stimulus presentations would have made the procedure long-winding and demanding. An adaptation of the paradigm was envisioned in which two stimuli were presented at the same time while the participant simply had to indicate which of the two targets was preferred. Although this paradigm introduced some dependence in the data, it greatly increased the amount of stimulus presentations within a similar time frame and a proportion scale that was less discrete.

Preference scores derived from this new measure were subjected to similar analyses as Study 1. Correlation analyses did not show any strong associations with other measures or behavioral outcomes, nor did multivariate linear regression models.

Principal component analyses on the data revealed similar patterns aside from the lack of an intercept component because the task precludes general response tendencies. A component was found that was sensitive to race whereas another component was sensitive to age. The loading patterns of these components were unequivocal, especially with regard to age. In Study 1, the component related to age had a very crude loading pattern that did not clearly differentiate all faces with regard to age. The age component in Study 2, however, had clearly different loadings for young faces and old faces. Although the stronger pronunciation of this pattern in Study 2 could have been caused by the proportion scale that was more sensitive, one should keep in mind that the stimuli were presented twice as long in Study 2 compared to Study 1 to compensate for the presentation of two stimuli.

Multidimensional unfolding models were run and a 3-dimensional solution was selected to investigate the data. This solution had a stress level that was much lower than the solution obtained in Study 1 and also suggested participants treated stimuli differently for their race and age. The substantially lower stress might have been due to the design’s preclusion of the general tendency to emit a certain response. In Study 1, the MDS algorithm had some difficulties fitting such response tendencies to the data. This was not necessary in Study 2 and could have caused the significant drop in stress. Although the biplot in Figure 18 and the MDS plot in Figure 20 were very similar, it seemed that the MDS model had to call on a third dimension to adequately capture different responses to black faces and white faces. In general, the biplots and MDS plots seemed to show the participants as a tighter, homogeneous cluster compared to the 41 representations in Study 1. This could be due to a reduction of noise, but it might as well be the case that the IMPRES is less sensitive to inter-individual differences than the AMP. Correlation analyses revealed some, albeit weak, evidence that differences between the participants in the representation might be meaningful. A replication with a more powerful design and a more heterogeneous sample should be necessary to further elucidate this issue.

4

General Discussion

4.1. Findings

In the current report, the utility of multivariate reduction techniques was demonstrated for psychological research on implicit measures. The data obtained with these measures are often very complex and researchers frequently proceed with calculating simple preference scores rather than scrutinizing the data for anomalies beforehand. Since inspecting data can be a long and tedious process, techniques that provide for quick but extensive summaries of the data could aid such an endeavor substantially. Two studies were therefore reported in which principal component analysis (PCA) and multidimensional scaling (MDS) were applied to implicit measures data.

Both techniques returned interesting representations of the data and often lead to similar conclusions. Components and dimensions were found that suggested participants acted differently regarding the manipulated stimulus dimensions in the experiment (i.e. race and age). The fact that these components or dimensions emerged spontaneously from the data attests further to the validity of the measure, as participants seem to truly respond differently to different races and ages. Furthermore, stimulus loadings (PCA) or locations (MDS) were used to assess which stimuli contribute most to such differential responding tendencies. Plots allowed for the quick detection of atypical participants as well as stimuli. Interestingly, the solutions can be scanned for patterns that might open up new venues for research. The MDS representation in 43

Study 1, for instance, suggested the differential responding towards different ages is stronger for white stimuli compared to black stimuli, although follow-up analyses did not further confirm this hypothesis, however.

A different paradigm was applied in Study 2 to make the scale of the proportions less discrete compared to Study 1. The investigation of the data obtained with this paradigm did show different responding tendencies towards race and age to emerge more strongly in principal components and dimensions. Moreover, the MDS representation in Study 2 had far lower stress in a lower dimensionality than the MDS representation in Study 1.

4.2. PCA or MDS

Both PCA and MDS returned quite similar results and are therefore quite similar in their utility. Both measures have very different advantages and disadvantages, however.

PCA is a very straightforward technique that is easily applied and does not require one to take many considerations or preprocessing in account when using it for explorative purposes. Component selection is relatively straightforward and the evaluation of the solutions with different components does not require one to fit different models. The solutions are said to be nested in the sense that the components are the same no matter which dimensionality is selected eventually. The decomposition returns participant scores and stimulus loadings that are easily inspected numerically and visually. The biplots obtained from a PCA are somewhat harder to interpret. A biplot is essentially a representation of different spaces associated with a matrix (i.e. the column space of the stimuli and the row space of the participants). Within the space defined by two components, the response of a participant to a stimulus cannot be derived from the simple distance between participant scores and stimulus loadings, but from the orthogonal projection of the participant score on the vector defined by the stimulus loadings.

MDS is a more intensive technique that requires some consideration before applying it to data. Considerations at the level of the data include the measurement level and whether the matrix should be considered row-conditional or not. MDS does not supply nested solutions, meaning solutions of different dimensionalities could return vastly different dimensions. Finding the correct dimensionality entails fitting many models with increasing dimensionality. Visual 44 inspection of the plot should always be performed as MDS can return degenerate solution when care is not taken. Such solutions arise more easily when less constraints are posed on the data, such as defining the data as ordinal, or when the data are sparse, like in the multidimensional unfolding model. When the data is handled carefully, MDS does allow for a good representation of the data, often in a lower dimensionality than PCA. Moreover, the solutions are more interpretable as the preference of a subject for a stimulus is simply the distance between both points in the representation, not an orthogonal projection. MDS returns a visual representation only, making it difficult to perform more intensive checks for anomalies aside from a thorough visual scan.

Both techniques are rather complementary and the preference for one of the two can be determined by the context of the study or the considerations of the researcher. Conclusions based on these techniques should be handled with care as both are mainly exploratory, descriptive tools that might guide future research, but warrants caution when statistical inference is involved. Statistical inference requires assumptions about the data that are not all that evident for PCA and MDS. The main assumption of independent observations, for instance, does not hold when analyzing component scores or coordinates as all of the data is used to fit the model. A person’s component scores or coordinates can be influenced by the inclusion of other persons or stimuli in the model, especially in smaller samples. It therefore makes little sense to run statistical tests on scores or coordinates for the aim of inference rather than description.

4.3. Other techniques

4.3.1. Non-linear decompositions of proportions

The main implicit measure investigated above was the AMP. This is a rather simple measure that produces data that can be represented easily in a matrix with participants as rows and stimuli as columns. The entries are simply the proportions of the “more pleasant” responses given to a stimulus by a participant. A PCA applied to these data represents the principal components as linear combinations of the original variables. Since the data in question are proportions retrieved from a repeated Bernouilli process (i.e. emitting one of two responses), the current application of PCA might have some issues. While proportions are bounded between 0 and 1, their reconstruction on the basis of a few principal components might yield estimates that do not 45 respect these boundaries. Hence, it might be advised to run the analyses on transformed data that are used in more traditional analyses of proportions. The log of the odds ratio is perhaps the most popular such transformation and can be applied easily in the current context. The resulting model is very similar to the logistic regression model. The k-rank approximation of the matrix of log odds-ratios is given by the model,

푘 푥̂푖푗 log ( ) = ∑ 푎푖푙푔푙푏푗푙 1 − 푥̂푖푗 푙=1 While the log odds-ratios are still modeled in terms of linear combinations, the odds-ratios are modeled in terms of a product of exponentials,

푘 푘 푥̂푖푗 = 푒푥푝 (∑ 푎푖푙푔푙푏푗푙) = ∏ exp (푎푖푙푔푙푏푗푙) 1 − 푥̂푖푗 푙=1 푙=1

Such an interpretation would have desirable properties, as the modeled proportions are guaranteed to not exceed the boundaries of 0 and 1. Difficulties arise, however, when the proportions in the data are 0 or 1, as the log of the odds associated with these proportions are infinite.

Such analyses were not reported in the current report because a score on the AMP is calculated as being the difference between two proportions and is therefore a linear combination (e.g. 1 ∗

푝푤ℎ푖푡푒 − 1 ∗ 푝푏푙푎푐푘, with 푝푤ℎ푖푡푒, the proportion of “more pleasant” responses after a white face was presented and 푝푏푙푎푐푘, the proportion of “more pleasant” responses after a black face was presented). A PCA on the pure proportions was therefore performed to keep some level of consistency with how the data produced by the measure is traditionally analyzed. Further research efforts could be directed at the specific scoring algorithm for the AMP and its shortcomings.

4.3.2. Correspondence analysis

Another method that might be more appropriate given the distribution of the data, is correspondence analysis (CA; Greenacre, 2007). This method can be applied to counts or proportions and decomposes the variance of the deviations of the data from what would be 46 expected if the general response tendencies of the participants are independent of the effect the stimuli have on these response tendencies. Suppose, for instance, that white stimuli generally evoke more “more pleasant” responses than black stimuli. Independence is assumed if this pattern is similar across participants after weighting for the general response tendencies of the participants. A CA is performed by subjecting the standardized deviations of the data from independence to a singular value decomposition. The corresponding solution can be inspected with biplots in a similar fashion (see Appendix). Because the data are converted to standardized deviations from dependence, there is no intercept component anymore. Although this method might be more appropriate for proportion data, the standardization across participants and stimuli might cause one to overlook certain idiosyncrasies.

4.3.3. Factor analysis

A multivariate reduction technique that has not been implemented in the current research, though widely popular in psychological research, is factor analysis (FA, Kim & Mueller, 1978). Like PCA, this technique can capture many variables into a small set of other variables, although there are some notable differences. While PCA is a rather descriptive technique that summarizes the data into a few principal components that are linear combinations of the variables, FA is a theoretical model in which the variables of the data are linear combinations of underlying, latent factors and additional unique factors. FA also models the covariance between the observed variables only, whereas PCA tries to capture the total variance of the variables in the dataset. Since the purpose of this research was the description of data rather than the validation of a model of theoretical, unobservable factors, FA was dropped in favor of PCA. Moreover, FA has traditionally been known to have some issues that make its implementation a little cumbersome (e.g. Chatfield & Collins, 1980). First, FA is a theoretical model that requires some assumptions about the distribution of the data as well as the relationships between the latent factors and other parts of the model. Second, FA can be run with a multitude of different techniques and parameter settings, making it hard to easily find an appropriate solution. Third, FA is initially run on the correlation matrix of the variables while the scores of the participants on these components need to be calculated afterwards. PCA, in contrast, quickly returns component scores through its singular value decomposition. Hence, PCA was preferred for its descriptive approach and its simplicity compared to FA. 47

4.5. Other measures: multi-way data

The current demonstration was limited to the AMP and a simple variant thereof. This measure was selected for its popularity as well as its simplicity, but is by no means the only implicit measure available. The most popular implicit measure, for instance, was developed in the 90s and is called the implicit association test (IAT; Greenwald, McGhee, & Schwartz, 1998). The procedure of the IAT involves a mixture of two categorization tasks presented to the participant. One task is a fairly simple evaluative categorization task wherein evaluative stimuli (i.e. the words “murder” or “party”) are to be categorized as “good” or “bad”. The second task is another categorization task wherein the stimuli of interest, such as pictures of black or white faces, are to be categorized according to their relevant category (e.g. “black” or “white”). Crucially, the same two buttons are used for both tasks, meaning the tasks can overlap in ways that influence performance. For instance, one possible overlap results in one button for “good” and “white” stimuli and another button for “bad” and “black” stimuli, whereas another possible overlap results in one button for “good” and “black” stimuli and the other button for “bad” and “white” stimuli.

Figure 22. Four possible trials of the implicit association test. The response labels of both tasks are presented in the top corners, their position corresponding to the position of the button for that category. The trials on the left have a task overlap in line with automatic preferences for whites over blacks, whereas the trials on the right do not. 48

Performance is generally better when the task overlap is in line (i.e. compatible) with somebody’s automatic preference, e.g. white faces over black faces, compared to when the overlap is not in line with it (i.e. incompatible). Someone with a strong automatic preference for white faces over black face will therefore perform better with an overlap of “white” with “good” and “black” with “bad” compared to an overlap of “white” with “bad” and “black” with “good”. Participants perform the two tasks with the two possible overlaps and the (standardized) difference in performance between both overlaps is used as a measure of automatic preference. Since a difference in performance can be measured through reaction times and errors, several algorithms have been proposed that combine both these modalities into a single preference score (Greenwald, Nosek, & Banaji, 2003).

The data produced by such tasks cannot be represented in a two-way matrix without some form of aggregation, as they actually consist of four modes. Whereas the first and second mode correspond to the participants and the stimuli, respectively, the third mode corresponds to the task overlap (i.e. compatible vs. incompatible), and the fourth mode to response modality (i.e. mean reaction time vs. proportion of errors). The simplest way to subject these data to a multivariate reduction technique, would be to apply a scoring algorithm the each stimulus that is presented to a participant. The subsequent scores can be represented in a two-way matrix, much like the AMP data presented in Study 1. Such an approach inevitably leads to a loss of, possibly vital, information, but can still provide some indications regarding the patterns in the data.

Methods have been developed, however, that would allow for the decomposition of such data without having to resort to collapsing across modes (e.g. Kroonenberg, 2008). The four modes of the data can be represented in a four-way array with each mode corresponding to a way in the matrix. The analysis of such multi-way data is somewhat elusive, as some concepts that have rather straightforward underpinnings in two-way data (e.g. matrix ranks) are often poorly understood and under continuous discussion in the multi-way case. Multi-way decomposition methods are available that can be considered extensions of ordinary PCA, although some of its desirable properties often break down when there are more than two ways to the data. Suppose, for instance, an S-rank approximation of a two-way matrix with PCA, based on a singular value decomposition, yields the familiar aforementioned model, 49

푥푖푗 = ∑ 푎푖푠푔푠푠푏푗푠 + 푒푖푗 푠=1

With the values in A representing the left singular vectors, or scores, and the values in B the right-singular vectors, or loadings. The diagonal matrix G with the eigenvalues can be conceptualized as a core matrix that links the scores of the participant mode (i.e. matrix A) to the scores of the stimulus mode (i.e. matrix B). In the multi-way case, this model can be extended with extra score matrices for each additional mode in the array and the core matrix G is expanded to become a core array with extra ways to link the scores of all modes. In the three-way case, the simplest model adhering to this structure is the parallel factor analysis (PARAFAC) model,

푥푖푗푘 = ∑ 푎푖푠푏푗푠푐푘푠푔푠푠푠 + 푒푖푗푘 푠=1

In this model, the core array G is a super-diagonal array, with all values occupying the main diagonal (i.e. 푔푠푠푠 for all 푠 in 1, … , 푆), that connects the vectors corresponding to the scores of a given component for each mode. This particular model presupposes parallel proportional profiles, which means that the data pattern associated with any two modes is proportional over the levels of the other mode. For implicit measures data, this implies that the participants’ reactions to the stimuli follow similar patterns in both conditions but differ in how strongly the pattern is pronounced. The PARAFAC model is perhaps the most restrictive multi-way decomposition model and might not yield solutions that adequately fit the data. Less restrictive approaches are available, with the TUCKER3 model being the most permissive,

푃 푄 푅

푥푖푗푘 = ∑ ∑ ∑ 푎푖푝푏푗푞푐푘푟푔푝푞푟 + 푒푖푗푘 푝=1 푞=1 푟=1

Compared to the PARAFAC model, this model has a lot more parameters and therefore allows for better fits to the data. The core array G is not superdiagonal and cubic anymore, which allows each mode to be summarized by different sets of components and for different components to be combined to reconstruct the data. 50

Multi-way decomposition techniques could therefore provide for an interesting means to investigate the data structure associated with implicit measures data of multiple modes. The extra complexity associated with multi-way arrays unfortunately translates to extra complexity associated with multi-way decomposition methods. Fitting such models often requires careful considerations with regard to preprocessing, model selection, and component selection. For instance, the models are very sensitive to the way the data is centered and normalized. Although some schemes are recommended, failing to properly preprocess the data can lead to degenerate solutions. Furthermore, the many parameters involved in such models, especially the TUCKER3 model, make model selection and component selection rather tedious. In the TUCKER3 model, every mode can be represented by a different set of components that can be linked in various manners through the core array. Finding an optimal set of components for each mode can therefore turn into a rather long-winding endeavor. This is especially so, because the models do not return nested solutions. A new model has to be fitted for each possible configuration of components and there is absolutely no guarantee that the first components of two different solutions reflect similar data patterns. The application of such techniques is therefore a lot less evident than the straightforward PCA that was applied to the AMP data in the current report.

4.5. Conclusions

Since many researchers in social psychology often overlook careful scrutiny of the data before proceeding with statistical analyses, two techniques were demonstrated that allow for quick yet insightful summaries of data obtained from implicit measures. Both PCA and MDS can be used to capture differences in response tendencies emitted by participants and response tendencies evoked by stimuli. The demonstration showed that the application of these techniques can provide valuable insights in the data. The current demonstration is limited to implicit measures data that is easily represented in a two-way matrix, however. Future efforts should be directed at multi-way decomposition methods to data obtained from implicit measures that are more complex.

51

References

Allport, G.W. (1935). Attitudes. In C. Murchison (Ed.), A handbook of social psychology (pp. 798-844). Worcester, MA: Clark University Press.

Borg, I., & Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Springer Science & Business Media.

Bornstein, R. F. (1989). Exposure and affect: Overview and meta-analysis of research, 1968– 1987. Psychological bulletin, 106, 265-289.

Chatfield, C., & Collins, A.J. (1980). Introduction to multivariate analysis. Springer- science+business media, B.V.

De Leeuw, J. (2005). Multidimensional Unfolding. In B.S. Everitt and D.C. Howell (Eds.). Encyclopedia of Statistics in Behavioral Science, 3, 1289-1294. New York, N.Y.: Wiley.

De Leeuw, J., & Mair, P. (2009). Multidimensional scaling using majorization: SMACOF in R. Journal of Statistical Software, 31, 1-30.

De Leeuw, J., & Meulman, J. (1986). A special jackknife for multidimensional scaling. Journal of Classification, 3, 97-112.

Eastment, H.T., & Krzanowski, W.J. (1982). Cross-validatory choice of the number of components from a principal component analysis. Technometrics, 24, 73-77.

Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27-38. 52

Gawronski, B., Cunningham, W.A., Lebel, E.P., & Deutsch, R. (2010). Attentional influences on affective priming: Does categorization influence spontaneous evaluations of multiply categorisable objects? Cognition and Emotion, 24, 1008-1025.

Gawronski, B., & Ye, Y. (2015). Prevention of intention invention in the affect misattribution procedure. Social Psychology and Personality Science, 6, 101-108.

Gower, J.C., & Dijksterhuis, G.B. (2004). Procrustes problems. Oxford University Press Inc., New York.

Greenacre, M.J. (2007). Correspondence analysis in practice. Taylor & Francis Group, LLC.

Greenacre, M. J. (2010). Biplots in practice. Fundacion BBVA.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of personality and social psychology, 74, 1464-1480.

Greenwald, A.G., Nosek, B.A., & Banaji, M.R. (2003). Understanding and using the implicit association test: I. An improved scoring algorithm. Journal of Personality and Social Psychology, 85, 197-216.

Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity. Journal of personality and social psychology, 97, 17.

Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.

Kim, J., & Mueller, C.W. (1978). Introduction to factor analysis. Sage publications, Inc.

Kroonenberg, P.M. (2008). Applied multiway analysis. John Wiley & Sons, Inc.

Kruskal, J.B. (1964). Nonmetric Multidimensional Scaling: a Numerical Method. Psychometrika, 29, 115–129. 53

Martin, I. & Levy, A. B. (1978). Evaluative conditioning. Advances in Behaviour Research and Therapy, 1, 57-101.

McConahay, J. B. (1986). Modern racism, ambivalence, and the modern racism scale. In J.F. Dovidio and S.L. Gaertner (Eds.). Prejudice, discrimination, and racism (pp. 91-125). San Diego, CA, US: Academic Press, xiii, 337 pp.

Payne, B.K., Cheng, C.M., Govorun, O., & Stewart, B.D. (2005). An inkblot for attitudes: Affect misattribution as implicit measurement. Journal of Personality and Social Psychology, 89, 277-293.

R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Spence, I., & Domoney, D. W. (1974). Single subject incomplete designs for nonmetric multidimensional scaling. Psychometrika, 39, 469–490.

Thompson, R.F., & Spencer, W.A. (1966). Habituation: A model phenomenon for the study of neuronal substrates of behavior. Psychological Review, 73, 16-43.

Torgerson, W.S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika, 17, 401-419.

Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal component models. Technometrics, 20, 397-405.

54

Appendix

6.1. Checking assumptions different age AMP scores across race

A QQ-plot was used to assess deviations from normality of the difference between the age AMP scores for white faces and the age AMP scores for black faces (Figure 24). Barring the slightly heavy tails the data look approximately normal. A paired Grambsch test for equality of variances in paired samples suggested no difference between the variance of the age AMP scores for white faces and black faces, 푍 = 0.73, 푝 = 0.47.

Figure 23. QQ-plots associated with the difference

of the age AMP scores for white faces and black

faces in Study 1.

55

6.2. Correspondence analysis biplot

Figure 24. Biplot associated with the first two dimensions of a correspondence analysis on the AMP data.