<<

Tilburg University

Stereotype threat and differential item functioning Flore, Paulette

Publication date: 2018

Document Version Publisher's PDF, also known as Version of record

Link to publication in Tilburg University Research Portal

Citation for published version (APA): Flore, P. (2018). and differential item functioning: A critical assessment. Gildeprint Drukkerijen.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 04. okt. 2021 Stereotype Threat and Differential Item Functioning:

Stereotype Threat and A Critical Assessment Differential Item Functioning: A Critical Assessment

Paulette Carien Flore Paulette Carien Flore

Stereotype Threat and Differential Item Functioning: A Critical Assessment

Paulette Carien Flore Colophon

Paulette Carien Flore Stereotype Threat and Differential Item Functioning: A Critical Assessment, 246 pages.

PhD thesis, Tilburg University, Tilburg, the Netherlands (2018)

Graphic design cover and inside: Rachel van Esschoten, DivingDuck Design (www.divingduckdesign.nl) Printed by: Gildeprint Drukkerijen, Enschede (www.gildeprint.nl)

ISBN: 978-94-6233-880-7 Stereotype Threat and Differential Item Functioning: A Critical Assessment

Paulette Carien Flore

Proefschrift ter verkrijging van de graad van doctor aan Tilburg University

opin gezag het openbaar van de rector te verdedigen magnificus, prof.ten dr.overstaan E. H. L. Aarts,van een door het college voor promoties aangewezen commissie in de aula van de Universiteit op woensdag 7 maart 2018 om 14.00 uur door Paulette Carien Flore geboren te Heerjansdam. Promotiecommissie

Promotores: Prof. dr. Jelte Wicherts Prof. dr. Jeroen Vermunt

Overige leden: Prof. dr. Franca Agnoli Prof. dr. Denny Borsboom Prof. dr. Frans Oort Dr. Kim De Roover Voor Niek en Marjolein

Contents

Chapter 1 Introduction ______9 Chapter 2

Domains? A Meta-Analysis______21 Does Stereotype Threat Influence Performance of Girls in Stereotyped Chapter 3

Dutch high school students: A registered report______57 The influence of gender stereotype threat on mathematics test scores of Chapter 4 Current and best practices in conducting and reporting DIF analyses______91 Chapter 5 The psychometrics of stereotype threat______131 Chapter 6 Discussion ______175 Addendum

analyses - chapter three______188 Appendix B:A: ExtraFinal model,tables -psychometric chapter three analyses______and exploratory 192

chapter four______196 Appendix D:C: DetailedStatistical description DIF methods of scalesand purification and groups techniques - chapter four - ______204 ______205 ______207 Appendix E: Literature review - chapter five Appendix F: Extra tables - chapter five References ______213 Dankwoord ______244

Chapter 1

Introduction 10 Chapter 1

1.1 Stereotype threat

Gender gaps favoring males in careers and in cognitive test performance in the- fields of Science, Technology, Engineering and Mathematics (STEM fields) are controversial and widely debated (Halpern et al., 2007). Especially the gen 1 der gap in math testing. The has gender been gapextensively in mathematics studied testand performancediscussed (e.g., remains Hyde, Fennema, Ryan, Frost, & Hopp, 1990; Lindberg, Hyde, Petersen, & Linn, 2010; Stoet & Geary, 2012) controversial (Stoet & Geary, 2012), with some researchers claiming there are andno differences others researchers between claimingmale and that female the gender performance gap only on exists math attesting high ends(Hyde, of Fennema, Ryan, Frost, & Hopp, 1990; Lindberg, Hyde, Petersen, & Linn, 2010),

the distribution (Ganley et al., 2013; Robinson & Lubienski, 2011). On the SAT-M, a widely used American math test that involves high stakes for the test takers, men have consistently been outperforming women, even though the effect sizes are small (Ball, Cribbie, & Steele, 2013). Moreover, women are underrepresented negativein STEM professionseffects that (Cheryangender stereotypes & Plaut, 2010; can haveSchuster on female & Martiny, student’s 2017; test Shapiro per- formance.& Williams, 2012). One of the explanations for these gender differences is the Social psychologists proposed that gender stereotypes can negatively affect - type threat occurs when members of a negatively stereotyped group feel pres- female students test scores, by a phenomenon denoted stereotype threat. Stereo situational threat increases worries about being judged on the basis of a negative sure to perform well on a test in the stereotyped domain (Steele, 1997). Such a

instereotype, math tests which and incan related lower testscognitive measuring resources traits needed such toas performspatial ability well among about whichfemale negative test-takers. stereotypes Stereotype about threat female might ability explain exist. (parts Numerous of) the gender gap have indeed shown that gender stereotypes can lead to performance decrements

on quantitative tests (Doyle & Voyer, 2016; Flore & Wicherts, 2015; Nguyen & Ryan, 2008; Picho, Rodriguez, & Finnie, 2013; Stoet & Geary, 2012; Walton &- Cohen, 2003). In those experiments, female (and often also male) students are typically assigned to an experimental condition, in which students are confront ed with their gender (e.g., by an explicit statement that the math test produces gender differences, or implicitly by circling one’s gender before taking the test),- or a control condition, in which stereotype threat is alleviated (e.g., by reassuring that the test does not produce gender differences, or by emphasizing the non-di- tionsagnostic lacking nature in ofthreat the test).outperformed In line with women the notion in the that experimental stereotype conditionthreat lowers un- female test performance, several studies found that women in the control condi Introduction 11

der threat on math tests or spatial ability tests (Good, Aronson, & Harder, 2008; explanationSchmader, 2002; for this Spencer, pattern Steele, of results & Quinn, is that1999). female These students performance who experienced decrements stereotypeare absent orthreat reversed performed for male below students their (Waltonactual ability & Cohen, because 2003). they One had dominant to cope with negative expectations arising from the negative stereotypes.

- 1 The number of stereotype threat experiments in the literature is quite large (for a recent overview, see Doyle & Voyer, 2016), and as such the effects of ste reotype threat are often seen as robust and hence relevant for many (real-life)- testing situations (S. J. Spencer, Logel, & Davies, 2016). Various studies attempted- to uncover different potential underlying causal mechanisms (e.g., anxiety, wor ries and working memory, mere effort) and focused on important individual dif- eralityferences of thatthe stereotype might moderate threat iteffect (e.g., as domain they have identification, used different test dependent anxiety, group vari- identification, stigma consciousness). The diversity of studies speaks to the gen ables and scoring methods (e.g., math tests and mental rotation tests, number of items correct and accuracy), different manipulations and control conditions- (e.g., explicit and implicit inductions of threat), and different student populations from various countries (including the USA, Italy, the Netherlands, Sweden, Cana da). Experimentally induced performance decrements have not only been found in the lab among college students, but also in children from elementary schools (e.g., Ambady, Shih, Kim, & Pittinsky, 2001; Muzzatti & Agnoli, 2007), middle schools (e.g., Ambady et al., 2001; Muzzatti & Agnoli, 2007) and high schools (e.g., Delgado & Prieto, 2008; Keller, 2002, 2007a; Moè, 2009). These studies provide studiessome evidence inside and that outside stereotype of the threatlab attests not onlyto the occurs popularity in the of superficial stereotype settingthreat. Stereotypeof a threat lab, ranks but inamong real life the school most prominentsettings as phenomenawell. The large of socialnumber psy of- chology,If stereotype and is featured threat in contributes the majority to ofperformance introduction decrements to psychology in high textbooks stakes (Ferguson, Brown, & Torres, 2016). testing, this would be a major threat to fairness and validity of high stakes STEM- tests. Test scores would not only measure the actual ability of female students,- but also their sensitivity to gender stereotype threat. As such, scientists dis cussed and developed interventions, and argued in favor of gender friendly pol- icies to counter stereotype threat effects (Logel, Walton, Spencer, Peach, & Mark,- 2012; Walton, Spencer, & Erman, 2013). Suggested policy changes include expo sure to role models like female teachers (Marx & Roman, 2002), affirmation writ highering exercises admissions (Martens, rates Johns, for stereotyped Greenberg, students & Schimel, than 2006; expected Miyake based et al., solely 2010), on corrections of high stakes tests for the supposed (Walton et al., 2013), and 12 Chapter 1

researchers recently used stereotype threat literature in writing an amicus brief traditional admission criteria (Logel et al., 2012). Psychologists and educational

to inform the US Supreme Court in a case concerning affirmative action at the University of Texas at Austin. In these documents, the researchers stressed that experimental psychologists “confirmed the existence of stereotype threat and- 1 have measured its magnitude, both in laboratory experiments and in the real- world.” (Brief of Experimental Psychologists et al, 2012) and referred to stereo type threat as “the well-documented harm” (American Educational Research As- sociation, 2012). Moreover, stereotype threat research is used to fuel feministic discussionsStudies (Prast, on stereotype 2017). Thus, threat stereotype have been threat criticized research as well. has hadSeveral a major authors im arguedpact on whether(legal) discussions stereotype on threat the use generalizes of high-stakes to the tests. real world and high stakes

testing (Wax, 2009), as most (albeit not all) experiments have been carried out- in low stakes settings (i.e., settings in which students were not rewarded or were minimally rewarded for good performance on the math test). Effects of stereo type threat appear absent in studies with high stakes tests (Stricker & Ward,- recting2004) or for when prior financial ability that incentives is common are inhanded analyzing out fordata correct from stereotype answers (Fryer,threat Levitt, & List, 2008). Various authors have criticized the use of covariates in cor

experiments (Stoet & Geary, 2012; Wicherts, 2005). Moreover, seemingly robust effects are contrasted by several recent failures to find a stereotype threat effect- ibility(e.g., Cherney of the effects & Campbell, of stereotype 2011; Eriksson threat. & Lindholm, 2007; Ganley et al., 2013). These issues and mixed results raise questions about replicability and reproduc

1.2 Problems of replication and

Over the last few years several methodological issues concerning reproducibility

(i.e., obtaining similar analytic results from the same data set) and replicability- (i.e., obtaining similar research findings from a new random ; Asendorpf et al., 2013) have received increasing attention in psychological science.p-hack Re- searchers sometimes engage in Questionable Research Practices (QRPs) in their pursuit of significant results. These QRPs are often denoted by the term ing (Simonsohn, Nelson, & Simmons, 2014) and include cherry-picking findings based on different combinations of predictors, covariates, or outcome variables,- adding extra participants or removing subsets of participants to find significant results, and applying different ways of removing ad hoc to obtain sig nificant findings (Bakker & Wicherts, 2014; John, Loewenstein, & Prelec, 2012; Wicherts et al., 2016). Using these QRPs can seriously inflate appearance of false Introduction

13

positive findings (Type I error rates; Simmons, Nelson, & Simonsohn, 2011) and inflate effect size estimates for genuine effects (Bakker, van Dijk, & Wicherts, 2012; Van Aert, Wicherts, & van Assen, 2016). Many, if not most psychological studies lend themselves to a host of different analyses (Gelman & Loken, 2014;- Simmons et al., 2011; Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016; Wicherts et al., 2016) that might be substantively defensible, yet can be selected oppor- 1 tunistically by researchers to find a significant effect. There is both indirect and- direct evidence for selective reporting of results in research papers (Franco, Mal hotra, & Simonovits, 2016; Lebel et al., 2013) and for use of several QRPs in psy chology (Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli, 2017; John et al., 2012).- A second problem arises when researchers present post hoc hypotheses, based on patterns found in a given data set, as if these hypotheses had been a prio- ri. This practice, known as Hypothesizing After Results are Known (HARKing; Kerr, 1998), violates the common format of reporting results based on a clear se quence of a priori hypotheses and subsequent empirical tests. HARKing, and the- many explorations of data that it might entail in sufficiently rich data sets, could turn false positive results (Type I errors) and exaggerated findings into estab lished theory. Yet these highly selected findings are not expected to be replicable in novel samples. Third, many authors have expressed a concern that whether or not a study is published depends on its results (Ferguson & Brannick, 2012; Ioannidis, 2005, 2008; Rosenthal, 1979). Such arises because- researchers often do not submit non-significant results for publication (Cooper, Deneve, & Charlton, 1997; Coursol & Wagner, 1986; Franco, Malhotra, & Simono- vits, 2014), and because editors or reviewers can also be opposed to publishing non-significant findings (Mahoney, 1977). Publication bias creates an overly pos itive picture of phenomena in the scientific literature (Ioannidis, 2005,, 2008).- Finally, sample sizes in psychological studies are often small (Fraley & Vazire, 2014; Marszalek, Barber, Kohlhart, & Holmes, 2011). Consequently, when study ing small to medium effects, the statistical power of many studies will be too low (Bakker et al., 2012; Maxwell, 2004; Rossi, 1990). Combined with the problems- of publication bias and QRPs in the analysis of data (both more likely to be used and yielding more severe when power is low; Bakker et al., 2012), under powered studies lead to high proportions of Type I errors, substantial biases, and- results that are often hard to replicate (Bakker et al., 2012). Not surprisingly, the findings of some classical psychological experiments have been difficult to repli- nesscate (e.g.,of results Cheung published et al., 2016; in the Open literature Science and Collaboration, the accuracy of 2015; meta-analyses Wagenmakers used et al., 2016). Combined, these four problems raise questions about the robust- . to summarize such results for particular research lines (Bakker et al., 2012; Io annidis, 2008; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012) 14 Chapter 1

1.3 The generalizability of ST

The literature on stereotype threat might be influenced by these methodological- problems as well. Stereotype threat meta-analyses (e.g,. Doyle & Voyer, 2016) show that many studies have small sample sizes, with the vast majority of stud 1 theies includingaverage size 40 orof fewerperformance participants decrements per condition due to (or stereotype design cell). threat Because are most the gender gap in mathematics is small (Ball et al., 2013; Lindberg et al., 2010), and expect that many stereotype threat studies are underpowered. Several authors mentionedlikely subtle that (Doyle publication & Voyer, bias2016; could Nguyen be an & issueRyan, in 2008; stereotype Picho etthreat al., 2013), research we

(Ganley et al., 2013; Stoet & Geary, 2012). When researchers in psychology were- sibleasked that about stereotype their own threat use researchof QRPs, hasmost been have affected admitted by theseto engaging practices in atas leastwell. some QRPs in the analysis of data (Agnoli et al., 2017; John et al., 2012). It is pos - Stereotype threat experiments often involve the use of different manipulations, ofmultiple outliers dependent and participants. variables, When measurement underpowered of (possibly studies several) entail potential many options mod erators, the use of alternative scoring rules, and flexibility concerning removal outcomes are likely to occur. to analyzeFortunately the data several and to solutions find interesting have been results, proposed biases creatingto counter overly these positive meth-

- licationodological bias problems tests in meta-analysis.(Asendorpf et Inal., a 2013; pre-registered Jussim, Crawford, study authors Anglin, commit Stevens, to & Duarte, 2016), like pre-registration, registered (replication) reports and pub

hypotheses, data cleaning procedures, power analyses, data collection plans,- and data analyses by publishing them on a public forum (e.g., the Open Science inFramework) which authors before send the an study introduction is conducted and methods (Wagenmakers section toet aal., journal 2012). for Regis peer tered reports (Chambers, 2013; Nosek & Lakens, 2014) are publishing formats-

study.review, Pre-registrations after which the studyand registered will be carried reports out. prevent If the manuscriptresearchers isfrom “In PrinciHARK- pal Accepted”, the final study will be published regardless of the results of the

ing and the use of several QRPs (e.g., removal to find significant results,- of results). Both techniques are a safeguard against publication bias, as the registered studies are traceable. For the published literature, vari ous meta-analytic techniques can be used to study the likelihood and severity of publication bias (e.g., Ioannidis & Trikalinos, 2007; Sterne & Egger, 2005; Van Aert, Wicherts, & van Assen, 2016), some of which allow estimates of effects after correcting for publication bias (van Aert et al., 2016). Introduction

15

To study the generalizability of stereotype threat and tackle some of these threat studies conducted among school girls. This research among schoolgirls is especiallycommon methodological interesting because problems, they are in carriedthis dissertation out in more we realistic focus on settings stereotype than - the lab, and gives us an understanding of the age at which stereotype threat ef- engagefects start with to occur.the topic Moreover, of mathematics if stereotype after threat having effects been indeed chronically occur confronted at an early 1 age, it is useful to implement interventions as early as possible, before girls dis give a summary of the existing literature on experimental stereotype threat ef- with stereotype threat (Schmader, Johns, & Barquissau, 2004). In Chapter 2, we experimentsfects among schooldiffered girls in theirby outcomes of a (pre-registered) depending on study meta-analysis. characteristics Our goals that were to estimate an average stereotype threat effect, and to study whether the have been identified in stereotype threat theory to affect the severity of the effect likely(type ofin controlthe literature group, on whether gender boys stereotype were present threat amongduring schoolgirls.testing, cross-cultural equivalence, and test difficulty). Finally, we studied whether publication bias is stereotype threat effects on math performance are evident among Dutch high schoolIn students. Chapter 3, Our we aimsconducted were toa large-scale document registeredan average report stereotype to study threat whether per- whether theoretically relevant individual differences moderated the stereotype formance decrement on a math test among 13-14 year old girls, and to study genderthreat effect.were more Specifically, susceptible we tested to stereotype whether threat. students As this who chapter felt anxious was written about math, who strongly identify with math, and who strongly identity with the female toin thecreate form a rigorous of a registered and powerful report, testits full of stereotypedesign was threat pre-registered theory in and a Dutch has been high schoolpeer reviewed setting. by stereotype threat experts. This design was specifically tailored

1.4 Differential item functioning

Because stereotype threat is seen to affect math test performance, studies on Teststereotype takers threatshould raise not be important disadvantaged questions by tests about in fairnessthe sense of thathigh they stakes perform tests, lowerlike selection than is expectedtests or exams based onused the in actual educational cognitive practice ability (Waltonthat the testet al., purports 2013). - - to measure. In psychometrics, test fairness has been traditionally studied un der the header of measurement invariance or measurement equivalence (Mill sap, 2011). Within Item Response Theory (IRT), the study of test fairness across Chapter 1

16

groups is widely studied by considering whether individual items function

equivalently across groups. In IRT, tests of Differential Item Functioning (DIF; Holland & Wainer, 1993) are widely seen as crucial to determine whether tests can be used fairly across different demographic groups (American Educational Research Association, American Psychological Association, & National Council- 1 on Measurement in Education, 2014). DIF analyses allow researchers to check whether item response functions of test items are equal over groups (Mellenber mathgh, 1982a; tests. Millsap Tests on & Everson,gender DIF 1993). have shown DIF in favor of males on geome- Gender DIF analyses have been popular in the assessment of fairness of

try items (Doolittle & Cleary, 2017; Gamer & Engelhard, 1999; Harris & Carlton,- 1993; Li, Cohen, & Ibarra, 2004; Ryan & Chiu, 2001; Taylor & Lee, 2012) and items that require spatial skills (Gierl, Bisanz, Bisanz, & Boughton, 2003), where- as DIF favoring females was found in (basic) algebraic items (Doolittle & Cleary, 2017; Harris & Carlton, 1993; Li et al., 2004). Word problems are typically eas ier for males (Doolittle & Cleary, 1987; Kalaycioglu & Berberoglu, 2011; Ryan & playChiu, a 2001), role in whereas gender DIF. abstract We aim operations to do so byappear testing easier whether for females the experimentally (Bridgeman induced& Schmitt, effects 2006). of stereotype It is interesting threat to induce determine differences whether in stereotypethe psychometric threat propcould- erties of items. Although several studies have hinted at the possibility that ste-

widerreotype DIF threat literature. creates DIF (Arbuthnot, 2005; Wicherts, Dolan, & Hessen, 2005), no studies have formally tested this hypothesis. In Chapter 4, we first focus on the might affect the conclusions on number of items showing DIF and the severity of Testing for DIF can be done in different ways and requires key choices that

enoughDIF for individualpower? How items many in aitems given are test. needed? For instance, When is which the extent DIF method of DIF problem suits the- research questions and data set best? How many participants are needed to have replicability in psychology are also relevant to testing of DIF in given data sets. atic? Several of the problems identified in the discussion on reproducibility and

toBecause test for of DIF its mightcommon create use biasesof significance in the analyses. testing, The power typical is crucial goal of to DIF tests analyses of DIF (e.g., Borsboom, 2006). Also, many potentially arbitrary choices in analyses used one often hopes not - often differs from the goal in experimental studies, in the sense that in the former to find a significant effect (DIF) whereas the goal of the lat ter is to find a significant effect. In order to render DIF results reproducible and methodsreplicable and in novel results samples, of their DIF DIF analyses analyses should in a detailed be reported manner. in sufficientIn Chapter detail. 4 we However, it is currently unknown whether indeed authors of DIF studies report-

first reviewed the methodological literature on what are considered best practic Introduction 17

es in DIF testing and offered guidelines for conducting and reporting of DIF tests. in the wider literature to see whether common practices and reporting practices Subsequently, we conducted an extensive of 200 studies of DIF these results and guidelines can improve future replicability and reproducibility ofin DIF studies.literature fit the recommendations made by DIF experts. The hope is that

1.5 DIF and stereotype threat 1 to our understanding of how ST affects performance on individual items. Most stereotypeItem analyses threat in the researchers field of stereotype focus on threat average can makeperformance a valuable decrements contribution on gets lost on whether stereotype threat could differentially affect performance on individualmath tests items.of the Yetexperimental there are reasonsgroups. Asto expectsuch, a stereotypelot of interesting threat informationto differen- betweentially affect items’ item means performance. and effect For sizes: instance, women O’Brien in the and control Crandall condition (2003) outperstudied- item means in their stereotype threat data set, and found a negative association difference virtually disappeared for easy items. This relationship between item formed women in the stereotype threat condition for difficult items, whereas the by means DIF analyses. difficulty and stereotype threat effects could be tested in a model-based manner-

Linking stereotype threat to studies of (gender) DIF raises many inter- esting questions. First, DIF analyses could be used to test the hypothesis that- stereotype threat causes DIF. Specifically, we could test the commonly stated hy pothesis (e.g., Spencer et al., 1999) that stereotype threat more severely, or per haps even solely, affects performance on difficult and hence challenging tasks and theitems. control This moderationcondition than of stereotype for female threatstudents by initem the difficulty stereotype should threat be condition apparent from DIF analyses if item difficulty parameters are lower for female students in- for particularly difficult items, but not for easier items. It is possible that stereo type threat even improves performance on (and hence difficulty parameters of) easier items, leading to a (partial) cancellation of stereotype threat effects on the sum score (in tests with an even distribution of easier and more difficult- fectsitems). of Second,stereotype we couldthreat test on particularwhether particular types of itemstypes ofcould items shed show light DIF on (e.g., the geometry items, word problems) due to stereotype threat. Knowledge of the ef if gender stereotype threat yields particular patterns of DIF in stereotype threat generalizability of these effects to various types of tests and items. For instance, high stakes tests. If particular items are more susceptible to DIF due to stereo- studies, we could relate those patterns to gender DIF patterns found in actual 18 Chapter 1

stereotype threat could explain the latter DIF. Such knowledge on any potential type threat and these types of items also show gender DIF in high stakes tests, susceptible to stereotype threat and hence fairer for females suffering from it in variousitem-specificity settings. of stereotype threat might help the creation of tests that are less

1 statisticsIn Chapter drawn 5from we studied classical the test psychometrics theory in ten of stereotype stereotype threat threat. experiments Specifically, thatin the were first too study small we for investigated more advanced the effects DIF analyses of stereotype based on threat IRT. Inon the item-level second

large-scale stereotype threat among Dutch high school girls that was study reported in Chapter 5, we explored DIF by means of IRT models on our own DIF due to stereotype threat. reported in Chapter 3. Our goal was to verify whether we could find patterns of

In the final Chapter 6 of this dissertation, we reflect on the main findings in each of the four preceding chapters, and make recommendations for future research into the effect of stereotype threat on test performance in STEM fields. Introduction

19

1

Chapter 2

Does Stereotype Threat Influence Performance of Girls in Stereotyped Domains? A Meta-Analysis

Journal of School Psychology This chapter is published as Flore, P. C. and Wicherts, J. M. (2015). Does Stereotype Threat Influence Performance of Girls in Stereotyped Domains? A Meta-Analysis. 22 Chapter 2

Abstract

Although the effect of stereotype threat concerning women and mathematics has

- been subject to various systematic reviews, none of them have been performed on the sub-population of children and adolescents. In this meta-analysis, we es timated the effects of stereotype threat on performance of girls on math, science and spatialthe type skills of control (MSSS) group tests. that Moreover, was used we instudied the studies. publication We selected bias and study four moderators: test difficulty, presence of boys, gender equality within countries,

2 samples when the study included girls, samples had a mean age below 18 years, the design was (quasi-)experimental, the stereotype threat manipulation was usedadministered random effects between-subjects, and mixed effects and the models. dependent The estimated variable wasmean a effect MSSS size test related to a gender stereotype favoring boys. To analyze the 47 effect sizes, we

bias.equaled We -0.22conclude and significantlythat publication differed bias frommight 0. seriously None of thedistort moderator the literature variables on thewas effects significant; of stereotype however, threat there amongwere several schoolgirls. signs forWe thepropose presence a large of publication replication study to provide a less biased effect size estimate. Meta-analysis on stereotype threat

23

2.1 Introduction mathematics tests could be disrupted by the presence of a stereotype threat. This initialSpencer, paper Steele, inspired and Quinn many (1999) researchers first suggested to replicate that the women’s stereotype performance threat effect on and expand the theory by introducing numerous moderator variables and vari-

- ous dependent variables related to negative gender stereotypes, such as tests of Mathematics, Science, and Spatial Skills (MSSS). This practice resulted in approx imately one hundred research papers and five meta-analyses (Nguyen & Ryan, 2008; Picho et al., 2013; Stoet & Geary, 2012; Walton & Cohen, 2003; Walton & 2 Spencer, 2009). Although four of these systematic reviews (Nguyen & Ryan, 2008;- Picho et al., 2013; Walton & Cohen, 2003; Walton & Spencer, 2009) confirmed the existence of a robust mean stereotype threat effect, some ambiguitiesexcess regard of significanting this effect findings remain. For instance, it has been suggested (Ganley et al., 2013; Stoet & Geary, 2012)p-hacking that the stereotype threat literature is subject to an , which might be caused by publication bias (Ioannidis, 2005; Rosenthal, 1979), (i.e., using questionable research practices to obtain a statistically significant effect; Simonsohn, Nelson, & Simmons, 2013), or both (Bakker et al., 2012). A less controversial but nevertheless interesting issue is the- fectage atonly which emerge stereotype during earlythreat adulthood? begins to influence Both of these performance issues are on addressed MSSS tests: in thisdoes chapter stereotype by means threat of already a meta-analysis influence ofchildren’s the stereotype performance, threat literature or does this in the ef context of schoolgirls’ MSSS test performance. We will introduce these topics by providing a general review of the literature on stereotype threat and the onset of gender differences in the domains of MSSS.

2.1.1 Stereotype Threat

membersThe effect of astereotype stigmatized threat group refers tend to tothe perform ramifications worse ofon an stereotype activated negativerelevant stereotype or an emphasized social identity (Steele, 1997). Individuals who are astasks negatively when confronted stereotyped with group. that negativeLater experiments stereotype showed(Steele &similar Aronson, effects 1995). for In their seminal paper, Steele and Aronson (1995) focused on ethnic minorities- other stigmatized groups, including women in the quantitative domain (e.g., Am- bady, Paik, Steele, Owen-Smith, & Mitchell, 2004; Brown & Josephs, 1999; Oswald & Harvey, 2001; Schmader & Johns, 2003; Spencer et al., 1999). In these experi ments, women were either assigned to a stereotype threat condition, where they were exposed to a gender-related stereotype threat (e.g., a written statement that men perform better on mathematics tests than women), or to a control condition, 24 Chapter 2

stereotypewhere they threat were conditionnot exposed averaged to such lower a threat. scores When than womenparticipants who were subsequently assigned completed a MSSS test (e.g., a mathematical test), women who were assigned to the

to the control condition (Ambady et al., 2004; Brown & Josephs, 1999; Oswald & Harvey, 2001; Schmader & Johns, 2003; Spencer et al., 1999). The results of these studies were deemed important, because researchers suspected that stereotype threat could be a driving force behind the decision of women to leave the science, technology, engineering, and mathematics (STEM) fields (Cheryan & Plaut, 2010; Schmader et al., 2004). These developments led to an expansion of the stereotype 2 threat literature, in which several moderator and mediator variables were studied. Of all the studied moderator and mediator variables, we will summarize- those variables that have been studied most frequently. Item difficulty appears to moderate the effects of stereotype threat, with difficult items leading to stron ger effects (Campbell & Collaer, 2009; O’Brien & Crandall, 2003; S. J. Spencer et al., 1999; Wicherts et al., 2005). Test-takers who are strongly identified with- the relevant domain, in this case the domain of mathematics, science or spatial skills, appear to show stronger stereotype threat effects (Cadinu, Maass, Frige- rio, Impagliazzo, & Latinotti, 2003; Lesko & Corpus, 2006; Pronin, Steele, & Ross, 2004; J. R. Steinberg, Okun, & Aiken, 2012). Another theoretical modera tor is gender identification; the effects of stereotype threat are generally more- severe for women who are highly gender-identified (Kiefer & Sekaquaptewa, 2007; Rydell, McConnell, & Beilock, 2009; Schmader, 2002; Wout, Danso, Jack son, & Spencer, 2008). However, the latter results were contradicted in a Swedish study (Eriksson & Lindholm, 2007). Moreover, the effects of stereotype threat appear stronger within a threatening environment (e.g., in the presence of men, or when negatively stereotyped test-takers hold a minority status) compared to a safe environment (e.g., in the presence of women only, or when holding a Themajority presence status; of Gneezy,role models Niederle, also appears& Rustichini, to moderate 2003; Inzlicht, the effect Aronson, of stereotype Good, & McKay, 2006; Inzlicht & Ben-Zeev, 2003; Sekaquaptewa & Thompson, 2003).

protectthreat, in females such a wayfrom that the roledebilitating models thateffects contradict of stereotype the stereotype threat on (i.e., MSSS women test who are good in mathematics or men who lack mathematical skills) appear to

performance (Elizaga & Markman, 2008; Marx & Roman, 2002; Marx & Ko, 2012; McIntyre, Paulson, Taylor, Morin, & Lord, 2011; Taylor, Lord, McIntyre, & Paulson,- 2011). Finally, several researchers suggested that the stereotype threat effect is- (partly) mediated by arousal (Ben-zeev, Fein, & Inzlicht, 2005), anxiety and wor ries (Brodish & Devine, 2009; Ford, Ferguson, Brooks, & Hagadone, 2004; Ger stenberg, Imhoff, & Schmitt, 2012; Osborne, 2001, 2007), or the occupation of working memory (Beilock, Rydell, & McConnell, 2007; Bonnot & Croizet, 2007; Rydell, Rydell, & Boucher, 2010; Schmader & Johns, 2003). Meta-analysis on stereotype threat

25

The literature on the effects of stereotype threat has been summarized by five meta-analyses that covered heterogeneous subsets of studies (Nguyen & Ryan, 2008; Picho et al., 2013; Stoet & Geary, 2012; Walton & Cohen, 2003; Walton & Spencer, 2009). These broad-stroke meta-analyses estimated a small to medium significant effect before moderators were taken into account, with standardized mean differences ranging from 0.24 (Picho et al., 2013) to 0.48 (Walton & Spencer, 2009). These findings seemed to confirm that the effect is rather stable, although most of these meta-analyses reported heterogeneity in effect sizes (Picho et al., 2013; Stoet & Geary, 2012; Walton & Cohen, 2003). In effectsfact, the than previous others. meta-analyses Although these included large diversescale meta-analyses tests, settings, are and interesting stereotyped to 2 goups, which makes it hard to pinpoint exactly why some studies show larger portray an overall picture, a more homogeneous subset of studies is preferred when dealing with specific questions, like the degree to which the stereotype womenthreat related and their to gender supposed also inferior influences capacity MSSS of performance solving mathematical in schools. orThus, spatial we adressed this issue by selecting a specific stereotyped group and stereotype (i.e., result in a less heterogenous set of effect sizes. These design elements enabled tasks) and a specific age group (i.e., those younger than 18 years), which should - lescence.us to describe the influence of stereotype threat on MSSS test performance for females in critical periods of human development, namely childhood and ado

2.1.2 Stereotype Threat and Children Although the effects of stereotype threat on women was traditionally studied

- within adult populations (S. J. Spencer et al., 1999), multiple studies over the Studieslast 15 yearson children have beenand adolescents carried out in with schools children contribute and adolescents to the literature as partici for at pants (e.g., Ambady, Shih, Kim, & Pittinsky, 2001; Keller & Dauenheimer, 2003).- least three reasons: (1) to find out at which age the stereotype threat effect ac whethertually emerges, variables (2) thatto study moderate the stereotype the stereotype threat threateffect ineffect the naturalin adult setting samples of similarlythe classroom moderate instead the of stereotype the laboratory threat setting, effect among and (3) children. to address the question The primary research on stereotype threat with children as participants -

(i.e., studies that we included in our meta-analysis) roughly shared a similar de sign, although the details of the designs varied somewhat. Typically, the studies- were conducted by means of an experiment or a quasi-experiment involving a stereotype threat condition and a control condition as predictor variable, some times in combination with a third or fourth condition (Cherney & Campbell, Chapter 2

26

between-subjects factor. Some variations exist in the implementation of the ste- reotype2011; Picho threat & andStephens, control 2012). conditions. These Theconditions stereotype were threat typically manipulation designed aswas a administered either explicitly or implicitly. The explicit stereotype threat manip- ulation usually involved a written or verbal statement that informed participants

whereas the threat manipulations triggered the gender ste- reotypethat the withoutMSSS test explicitly they were mentioning about to the complete gender producedgap. Further gender examples differences, of the two types of stereotype threat manipulations are illustrated in Table 2.1.

2 Table 2.1 Types of stereotype threat manipulations. Manipulation Manipulation Example Examples of papers condition Explicit Verbal or written “It [the test] is comprised of a collection statement that boys - are superior to girls on produce gender differences in the past. Cherney & Campbell the test Maleof questions participants which outperformed have been shown female to (2011); Keller & Dauen participants.” heimer (2003) Explicit Verbal statement that “Boys are really good at this game.” boys are really good in the test Cimpian, Mu, & Erickson (2012) Implicit -- their gender Participants filling out Stricker and Ward (2004) Implicit Visual depiction of a Showed pictures of male scientists/mathe- stereotypical situation maticians - Good, Woodzicka, & Wingfield (2010); Muz Implicit Priming female identity The story described a girl using a number of zatti & Agnoli (2007) traits that were stereotypically feminine in Tomasetto, Alparone, & Cadinu (2011) participants’ cultural context (e.g., long blond Implicit -- hair, blue eyes, and colorful clothes). a geometric problem Framing the question as Huguet & Régner (2007); Huguet & Régner (2009)

The control condition was designed to either nullify or not nullify stereo-

thattype thethreat. MSSS In test the they nullified were control about to condition complete the did stereotype not produce threat gender was differenc actively- removed, generally by a written or verbal statement which informed participants was provided. Further examples of the two types of control conditions are illus- tratedes, whereas in Table in the2.2. non-nullified control condition no gender related information The in studies of stereotype threat among schoolgirls to -

date were MSSS tests; most studies involved a mathematical test properly ad justed to the age and ability level of the participants (e.g., Keller & Dauenheimer, 2003; Muzzatti & Agnoli, 2007). A few studies used the Mental Rotation Task Meta-analysis on stereotype threat 27

Table 2.2 Types of control conditions.

Control Information Example Examples of papers condition No Threat No information given -- with regards to the relationship between Delgado & Prieto (2008); gender and performance Muzzatti & Agnoli (2007) on the test Verbal or written statement that girls are which have been shown not to produce gender Nullified superior to boys on the differences“It is comprised in the of past. a collection The average of questions achieve - Cherney & Campbell test (2011) achievement of female participants.” ment of male participants was equal to the Verbal or written state- ment that girls and boys Nullified how‘‘In such pictures tasks, and boys objects and girls look are when equally they Neuburger et al. (2012) the test skilled. Both have an equal ability to imagine 2 perform equally well on are rotated. Therefore, such tasks are exactly Education about the “Research has shown that men perform better equally difficult or easy for girls and boys.” stereotype threat effect than women in this test and obtain higher Nullified scores. This superiority is caused by a gender Moè (2009)

dostereotype, with lack i.e., of ability.”by a common belief in male superiority in spatial tasks, and has nothing to Written description of “Marie was described as a successful student a counter-stereotypical in math” Nullified situation Bagès & Martinot (2011) Visual depiction of a “Participants were randomly assigned to one counter-stereotypical of three experimental conditions by inviting Nullified situation them to color a Galdi et al. (2013)

calculation whereas a boy fails to respond” picture, in which a girl correctly resolves the

to(e.g., mathematics Moè & Pazzaglia, and gender 2006; Neuburger,stereotypes. Jansen, Remaining Heil, & dependent Quaiser-Pohl, variables 2012; Titzewere et al., 2010) which measured children’s spatial abilities, a concept tightly linked- - the performance on a physics test (Marchand & Taasoobshirazi, 2013), a chem istry comprehension test (Good et al., 2010) or recall performance of a geomet ric figure (Huguet & Régner, 2009). These tests generally consisted of 10 to 40 questions. 2.1.3 Developmental aspects of Stereotype Threat The onset and development of the effects of stereotype threat on girls in math- ematics throughout development is an interesting issue; however, few solid conclusions have been reached (Aronson & Good, 2003; Jordan & Lovett, 2007). To explore possible theories on how age might influence stereotype threat, we- recollect the most important moderators that were identified in the research on young adults and subsequently consider whether these could influence ste 28 Chapter 2

reotype threat differently throughout the development of children. The most -

important moderators among adults are gender identification, domain identi fication, stigma consciousness, and beliefs about intelligence (Aronson & Good, 2003). Thus, women who strongly identify with both the academic domain of mathematics (Cadinu et al., 2003; Lesko & Corpus, 2006; Pronin et al., 2004; J. strongerR. Steinberg performance et al., 2012) decrements and the female compared gender to women(Kiefer &who Sekaquaptewa, less strongly 2007; iden- Rydell et al., 2009; Schmader, 2002; Wout et al., 2008) are expected to experience

tify with those domains. Additionally, women who believe that the stereotypes 2 areregarding purported women to show and mathematicsstronger stereotype are true threat (Schmader effects. et The al., 2004)current and knowl that- edgemathematical about the ability development is a stable of andthese fixed four characteristic traits can be (Aronsonused as guidance & Good, for 2003) the expectations of the impact of stereotype threat throughout different age groups

Gender(Aronson identification & Good, 2003).

Gender identification is present at an early age. At the age of 3 years, a majority childrenof children are are not able only to ablecorrectly to correctly label themselves label their to gendertheir gender and distinguish (Katz & Kofkin, men 1997). A study on 3- to 5-year-olds (C. L. Martin & Little, 1990) showed that these boys preferring masculine sex-typed toys and girls preferring feminine sex-typed from women but also prefer sex-typed toys that correspond to their gender (i.e., - toys). When children reach the age of 6 to 7 years, they master the concept of gender constancy; and so understand that gender is stable over time and con potentiallysistent (Bussey vulnerable & Bandura, to performance 1999). Based decrements on these studiescaused byone stereotype could argue threat. that because gender identity is already stable at a young age, even young children are-

However, Aronson and Good (2003) proclaimed that although children are al toready stereotype aware of threat. their gender from an early age on, they do not form a coherent sense of the self until adolescence, which lowers younger children’s vulnerability Stigma consciousness -

The studies on development of awareness of the stereotype (stigma conscious ness) have showed mixed results. Various studies showed that children believe that boys are either better in mathematics or are identified more strongly with- the field of mathematics compared to girls, for ages 6 to 11 (Cvencek, Meltzoff, & Greenwald, 2011; Eccles, Wigfield, Harold, & Blumenfeld, 1993; Lummis & Ste venson, 1990) and ages 14 and 22 (Steffens & Jelenec, 2011). In Steffens and Jelenec (2011), older participants endorsed the stereotypes more strongly than Meta-analysis on stereotype threat

29 the younger participants. A meta-analysis on affects and attitudes concern- ing mathematics showed that adolescents and young adults from different age groups (11 to 25 years old) all see mathematics more as a male domain (Hyde et al., 1990). These gender stereotypes are also present in the classroom; teachers tend to see boys as more competent in mathematics (Q. Li, 1999), they expect mathematics to be more difficult for girls (Tiedemann, 2000), and they expect- that failure in mathematics for girls more likely originates from a lack of ability, regardingwhereas failure stigma for consciousness boys originates has from also abeen lack found of effort more (Fennema, recently: Peterson, some studies Car penter, & Lubinski, 1990; Tiedemann, 2000). However, counterintuitive evidence- 2 failed to find convincing evidence that children explicitly believe in the tradition al stereotype (Ambady et al., 2001; Kurtz-Costes, Rowley, Harris-Britt, & Woods, 2008), other studies found that children believe in non-traditional stereotypes (Martinot, Bagès, & Désert, 2012; Martinot & Désert, 2007), and another study- found that teachers do not hold stereotypical beliefs (Leedy, LaLonde, & Runk, 2003). Additionally a more recent study found that when it comes to overall ac holdademic those competency stereotypes 6- toas 10-year-oldswell. A stereotype hold thethreat stereotype manipulation that girls addressing outperform this boys (Hartley & Sutton, 2013), and these children actually believe that adults stereotype actually negatively influenced the performance of boys on a test that included different domains, including mathematics. Moreover, a showed that over different grades, teachers either rated the girls in their- classes significantly higher in mathematical ability than boys, or rated girls and- boys as roughly equivalent in mathematical ability, even when there was a sig regardingnificant gender mathematics gap in performance and gender in on recent a mathematics studies might test indicatefavoring thatmales the (Rob gen- inson & Lubienski, 2011). Some argue that this evidence against the stereotype little research has addressed whether gender stereotypes are comparable over der stereotype as we know it is outdated (Martinot et al., 2012). Also, relatively time (e.g., during the 1980s vs. during the 2010s) or across different countries or Domainsmaller cultural identification units (as we addressed in the section Moderators).

Few studies have been conducted on the development of academic identification,- or domain identification, in children (Aronson & Good, 2003). A study by Keller (2007) on 15-year-olds indicated that domain identification moderated the ef- fect of stereotype threat on math performance. Specifically, girls in a stereotype- threat condition who considered themselves as low identifiers in the mathemat ical domain performed better on difficult math items, whereas girls who con sidered themselves as high identifiers in the mathematical domain performed worse on difficult math items. Although little attention has been given to domain Chapter 2

30

affect and attitude of girls towards mathematics over different age groups could identification in the context of stereotype threat and development, research on

provide information on how domain identification might fluctuate. For instance, the gender gap of positive attitudes towards and self-confidence in mathematics is virtually non-existent for children between the ages of 5 to 10 years but grows wider in older age groups, with boys being more positive and self-confident than girls (Hyde et al., 1990). Thus, it seems that, generally, adolescent girls have less- confidence in and fewer positive attitudes towards mathematics compared to boys of their age, which might be an indication that older girls also identify them 2 lessselves vulnerable less with to the the mathematical effects of stereotype domain. threat In the compared context of to stereotype pre-teenage threat, girls. this pattern of findings would lead us to expect that adolescent girls are actually Beliefs about intelligence The literature on beliefs about intelligence and academic ability describes rather straightforwardly how those beliefs change throughout the development of chil- dren. Children younger than 7 years do not yet comprehend that intelligence and ability are personal traits that are stable over time and that the role of effort in ac-

ademic performance is limited (Droege & Stipek, 1993; Stipek & Daniels, 1990). At this age, children confuse intelligence and ability with social-moral qualities:- a good or nice person equals a smart person and vice versa (Droege & Stipek,- formances1993; Heyman, and Dweck,overestimate & Cain, their 1992). position Because on youngacademic children performances do not yet relative see ac ademic abilities as fixed traits, they tend to be overly optimistic about their per

to their classmates (Nicholls, 1979). When children reach the age of 7 or 8, their theories seem to shift, in such a way that older children believe in more temporal constant abilities (Kinlaw & Kurtz-Costes, 2003). At this age, the children predict more stable levels of intelligence (Dweck, 2002; Wigfield et al., 1997), and they believe less in the role of effort (Stipek & Daniels, 1990). Additionally, they are better able to distinguish ability from social or moral abilities (Droege & Stipek, about1993; Heymantheir future et al., academic 1992; Stipek performances & Daniels, and1990). their As positiona consequence, within beginningthe class- at approximately age 7 to 8 years, children are less optimistic and more realistic imply that stereotype threat would only have an effect on children who are at leastroom 7 compared to 8 years to old. their If indeed peers (Ecclesthese notions et al., 1989; about Nicholls, abilities 1979).are crucial These for findings stereo-

type threat, younger children most likely do not even see mathematical ability wouldas a fixed have trait; the hence,capacity there to understand would be littlethat effortreason will for notthem necessarily to feel threatened compen- sateby stereotypes for a lack of regarding ability and mathematical hence be susceptible competency. to stereotype In contrast, threat. older children Meta-analysis on stereotype threat

31

-

Although studies on the development of gender identity, stigma conscious concerningness, and beliefs these about potential intelligence age-related seem moderating to imply that variables children we discussedbelow the here age ofis 8 or 10 will probably not be influenced by stereotype threat, the line of evidence for stereotype threat among young adults also are relevant among schoolgirls. indirect. That is, it is unclear whether moderators that were found to be relevant

In addition, the conclusion that children below the age of 8 or 10 will probably- antnot tobe collateinfluenced all the by evidence stereotype that threat speaks is into contrastthe ages withat which the theory stereotype on domain threat identification, which would actually predict the opposite. It is therefore import 2 effects among schoolgirls actually emerge. In our meta-analysis, we therefore- cated(a) explored in stereotype whether theory age isas a being moderator relevant of forthe stereotype stereotype threat. threat effect among schoolgirls and (b) studied the moderators (at the level of studies) that are impli

2.1.4 Moderators of stereotype threat Test difficulty

In our meta-analyses we considered, in addition to the exploratory moderator of age, four confirmatory moderators on the basis of theory and previoustest results diffi- culty(Nguyen & Ryan, 2008; Picho et al., 2013; Steele, 2010). The first moderator we hypothesized to have an influence on the effect of stereotype threat is . Studies on the adult population showed that test difficulty is an important moderator (e.g., Nguyen & Ryan, 2008; Spencer et al., 1999). The moderation of test difficulty on the stereotype threat effect is often explained in terms of- arousal (Ben-zeev et al., 2005), although psychometric reasons may also play a role (Wicherts et al., 2005). Studies showed that the stereotype threat effect ap pears to be mediated by arousal or anxiety (Ben-zeev et al., 2005; Delgado & Prieto, 2008; Gerstenberg et al., 2012; Osborne, 2001); thus, the more anxious or aroused participants are, the worse they will perform on a mathematical test. Relatively difficult items are more threatening than easy items; therefore, they lead to a higher state of arousal, which in turn will result in a larger gender gap in mathematical test performance (Delgado & Prieto, 2008; O’Brien & Crandall, whereas2003). These arousal findings leads correspondedto enhanced performance to traditional when findings the oftask social is well facilitation, learned which showed that arousal leads to diminished performance on a difficult task,

(Markus, 1978; Zajonc, 1965). The moderating role of test anxiety might be- explained by the fact that solving difficult questions requires a larger working memory capacity than solving easy questions (Beilock et al., 2007). When worry ing thoughts provoked by stereotype threat occupy part of the working memory, solving a difficult question becomes problematic, whereas easy questions are still Chapter 2

32

solvable because they do not require a large working memory capacity (Eysenck & Calvo, 1992). This mechanism leads to score reduction for difficult tests but not for easy tests. With the former in mind, we expected that the effect of stereotype tothreat which would those be in stronger the sample in studiesanswer thatitems use in thea relatively test correctly. difficult Psychometrically test compared to studies that use a relatively easy test. We defined difficulty here as the degree

advanced analyses that formally model the item difficulties are beyond the scope Presenceof this meta-analysis of boys because they require the raw data (see Chapter 5). The second variable that we predicted to moderate the stereotype threat effect 2 among schoolgirls is the absence or presence of boys during test-taking. Several studies showed that female students tend to underperform on negatively stereo- typed tasks in the presence of male students who are working on the same task

. This effect might be explained (Gneezy et al., 2003; Inzlicht & Ben-zeev, 2000; Inzlicht & Ben-Zeev, 2003; Picho whoet al., hold 2013; the Sekaquaptewa minority in a group& Thompson, than for 2003) women who are in a same-sex group by the salience of gender identity; gender becomes more salient for women salience of gender identity might lead to stronger effects of stereotype threat. People(Cota & who Dion, hold 1986; a minority Mcguire, or Mcguire, token status & Winton, within 1979). a group In turn, tend the to sufferheightened from - tered when women simply watch a gender unbalanced video of a conference in a cognitive deficits (C. G. Lord & Saenz, 1985), a phenomenon that is even regis the activation of gender identity and reduced cognitive performance due to so- cialmathematical pressure caused domain by (Murphy, a minority Steele, status & Gross, then leads2007). to The worse combination performance of both for

predicted the stereotype threat effect among schoolgirls to be stronger in studies women confronted with stereotype threat in a mixed-gender setting. Thus, we which no boys were present during test administration. in which boys were present during test administration, compared to studies in Cross-cultural gender equality The third moderator we studied was cross-cultural gender equality -

the selected stereotype threat studies took place. Recent studies showed, or markedthe de cross-culturalgree in which differences women are in deemed the gender equal gap to in men mathematical in the several performance nations across where

countries (Else-Quest, Hyde, & Linn, 2010; Mullis, Martin, Foy, & Arora, 2012;- Organisation for Economic Co-operation and Development [OECD], 2010). In the cross-cultural study on 15-year-old students carried out by OECD (i.e., the Pro gramme for International Student Assessment or PISA) within 65 countries boys significantly outperformed girls on the mathematical test in 54% of the countries, Meta-analysis on stereotype threat

33

whereas in 8% of the countries girls outperformed boys. In 38% of the countries,- no significant difference between the two sex groups were found. Comparable are the Trends in International Mathematics and Science Study (TIMSS) stud ies (Mullis et al., 2012) on fourth graders (ages 9 to 10) within 50 countries, in which boys outperformed girls in 40% of the countries, girls outperformed boys in 8% of the countries, and no significant differences were found in 52% of the countries. However, the results of the TIMSS studies for eighth graders in 42 countries were different: in 31% of the countries, girls outperformed boys, while in only 17% of the countries, boys outperformed girls, and in 52% of the- cerningcountries the no gender significant gap in differences mathematics emerged. were proposed Overall, the to besex associated differences with for the 2 majority of countries were quite small. The differences between countries con gender equality and amount of stereotyping within countries (Else-Quest et al., 2010; Guiso, Monte, & Sapienza, 2008; Nosek et al., 2009). Some studies showed that gender equality is associated with the gender gap in mathematics for school- aged children (Else-Quest et al., 2010; Guiso et al., 2008). Gender equality also has as a negative relation with anxiety, and a positive relation with girls’ self-con predictedcept and self-efficacyby cross-national concerning differences the mathematicalin Implicit Association domain (Else-QuestTest-scores on et the al., 2010). In addition, the gender gap in mathematical test performance could be that the stereotype threat effect among schoolgirls would be stronger for studies gender-science relation (Nosek et al., 2009). Based on these results, we expected conducted in countries with low levels of gender equality compared to countries with high levels of gender equality. To operationalize this variable, we used the- Gender Gap Index (Hausmann, Tyson, & Zahidi, 2012), which is an index that incorporates economic participation, educational attainment, political empow usederment, before and ashealth moderator and survival variable of inwomen the meta-analysis relative to men. on stereotype Higher scores threat on andthe GGI indicate a higher degree of gender equality. Geographical regions have been regions within the United States of America. mathematical performance by Picho et al. (2012); however, they only studied Type of control condition The last moderator we studied concerned the type of control condition partic- ipants were assigned to. Stereotype threat experiments involve the use of two ranked by severity of stereotype threat. The condition that supposedly ranks or more conditions that differ in stereotype threat, such that conditions can be of a situation where participants do not receive any gender related information lowest on stereotype threat severity is the control condition, which exists either - (e.g., Delgado & Prieto, 2008; Muzzatti & Agnoli, 2007), or a so-called nullified control condition. This nullified control condition is designed to actively re Chapter 2

34

move the stereotype threat, usually by informing test-takers that girls perform equally well as boys or even that girls outperform boys on the mathematical test who(Cherney are assigned & Campbell, to a condition 2011; Neuburger in which etno al., additional 2012). There information are indications has been thatgiv- test-takers who are assigned to a nullified control condition outperform those-

areen (Campbell confronted & with Collaer, a MSSS 2009; test Smith their & gender White, identity 2002; Walton already & becomes Cohen, 2003; salient Wal by ton & Spencer, 2009). This effect is explained by the fact that whenever women-

the well-knowneffect of stereotype stereotype threat (Smith among & White, schoolgirls 2002); to giving be stronger no additional in studies informa that 2 tion would thus entail a form of implicit threat activation. Therefore, we expected condition without additional information. involved a nullified control condition compared to studies that involved a control

2.1.5 Publication Bias and p-Hacking Although the existence of the stereotype threat effect seems widely accept-

claimed to be. Based on recent published and unpublished studies that failed ed, there are some reasons to doubt whether the effect is as solid as it is often the literature on the stereotype threat effect in children might suffer from pub- to replicate the effects of stereotype threat, Ganley et al. (2013) suggested that

lication bias, a claim that had also been made for the wider stereotype threat literature involving females and mathematics (Stoet & Geary, 2012). Publication- bias refers to the practice of primarily publishing articles in which significant results are shown, thus leaving the so-called null results in the file drawer (Ioan nidis, 2005; Rosenthal, 1979; Sterling, 1959), a practice that can lead to serious inflations of estimated effect-sizes in meta-analyses (Bakker et al., 2012; Sutton, Duval, Tweedie, Abrams, & Jones, 2000). - According to Ioannidis (2005) a research field is particularly vulnerable to publication bias if the field (1) features studies with small sample sizes; (2) con cerns small effect sizes; (3) focuses on a large number of relations; (4) involves studies with a large flexibility in design, definitions, and outcomes; (5) is popular becauseand so features all six characteristics many studies, are and present (6) deals to somewith extenttopics relevantin stereotype to financial threat reor- political interest. The field of stereotype threat is susceptible to publication bias,-

search. For instance, most studies (39 out of the 47 studies) included in our cur rent meta-analysis had a total sample size smaller than 100; the averaged effect sizes found in the recent meta-analyses lie between 0.24 (Picho et al., 2013) and 0.45 (Walton & Spencer, 2009), which are classified as small to medium effect Meta-analysis on stereotype threat

35 sizes1 -

(J. Cohen, 1992); and the use of multiple dependent variables and covari ates is common practice (Stoet & Geary, 2012), despite problems associated with- covariate corrections (Wicherts, 2005). Furthermore, the design is often flexible with different kinds of manipulations, control conditions, and moderators. More over, the number of published studies attests to the popularity of the topic, and several stereotype threat researchers called for affirmitive action basedFisher on their vs. theresearch University (e.g., by means of a policy paper [Walton, Spencer, & Erman, 2013] or- licationthe Amicus bias Brief in our of meta-analytic Experimental data Psychologists, set. 2012, for the case of ). With the former in mind, we expected to find indications of pub we assume that the outcomes of the included studies are unbiased. Unfortunate- 2 If we want to draw conclusions based on the outcomes of a meta-analysis, ly the outcomes of some studies might be distorted due to questionable research practices (QRPs) in collection of data, analysis of data , and reporting of results. The term QRPs defines a broad set of decisions made by researchers that might- positively influence the outcome of their studies. Four examples of frequently lowersused QRPs the arep (1) failing to report all the dependent variables, p(2) collecting ex tra data when the test statistic is not significant yet, (3) excluding data when it- -value of the test statistic, and (4)p rounding down -values (John et Pal., 2012). The practice to use these QRPs with the purpose of obtaining a sta tistically significant effect is referred to as “ -hacking” (Simonsohn et al., 2013). -hacking can seriously distort the the scientific literature because it enlarges- the chance of a Type I errorp (Simmons et al., 2011), and it leads to inflated effect- sizes in meta-analyses (Bakker et al., 2012). If many researchers who work with thein the p same field invoke -hacking, then an effect that does not exist at the pop ulation level might become established. Simonsohn et al. (2013) have developed -curve: a tool aimed to distinguish pwhether a field is infected by selective- reporting,p or whether results are truthfully reported. When most researchers within a field truthfully reported correct p-values, a distribution of statistical sig pnificant-hack will -values be left should skewed. be right skewed (provided there is an actual effect in the population), whereas the distribution of -values for a field in which researchers

- d 1 Although widely used,d Cohen’sof 0.1 for rules the stereotype of thumb for threat small, effect medium, among and schoolgirls large effects could may be substantial not be entirely in the appro sense priate here. Set against the typical effect sizes for gender differences in mathematics (e.g., = 0.16, Hedges & lightNowell, of earlier1995), evenmeta-analyses a of the stereotype threat effect the same effect size estimate of d = 0.1 could be seenthat itas may small. then The explain core issue a substantial for understanding part of the the gender potential gap, effectall other of publication things being bias equal. is that When stereotype considered threat in

to underpowered studies. effects are small in relation to the sample sizes typical for (Bakker et al., 2012), leading Chapter 2

36

2.2 Method

Search strategies

- A literature search was conducted using the databases ABI/INFORM, PsycINFO,- calProQuest, and educational Web of Science literature. (searched The keywords in March that 2013), we usedand ERIC in the (searched literature in search Janu ary 2014). Combined, these five databases cover the majority of the psychologi-

(in conjunction with the phrase “stereotype threat”, which needed to be pres ent in the abstract) were “gender,” “math,” “performance,” or “mental rotation,” 2 and “children,” “girls,” “women,” or “high school.” This search strategy resulted- in several search strings that were connected by the search term “AND,” such as “ab("stereotype threat") AND children AND gender.” In addition two cited-refer ence searches on Web of Science were conducted; we targeted the oldest paper that we obtained from the first part of our literature search (Ambady et al., 2001) and the classical paper on stereotype threat and gender by Spencer et al. (1999). weAdditionally, obtained twowe performed extra articles. a more informal search on Google Scholar for which we usedAn importantthe same keywords part of a meta-analysisas our other database is the search searches. for unpublished With this strategy, studies -

or data (i.e., grey literature). We automatically searched parts of the grey litera ture by our search on Google Scholar and using databases PsycINFO, ERIC, and ProQuest; they do not only contain published papers but also dissertations and conference proceedings. Moreover, in order to find unpublished studies, we used three additional strategies. First, we e-mailed the first authors of the included screenedpublished the papers abstracts with ofthe poster question presentations whether they held possessed at the last any 10 unpublishedconferences data or were familiar with unpublished studies by other researchers. Second, we- - of the Society for Personality and Social Psychology (SPSP), selected those ab stracts that mentioned stereotype threat and children, and e-mailed the first au thor that worked on the project in question. Finally, we posted an open call for data on both the SPSP forum (www.spsp.org) and the Social Psychology Network forum (www.socialpsychology.org). We did not receive any papers through the- second and third strategies; however, we obtained seven responses through the- first strategy, which provided us with five additional studies. Five authors indi cated that they had no unpublished works. Ultimately, we included five effect siz es (11%) in the meta-analysis that were a product of unpublished studies. In our literature search, we obtained one Italian study (Tomasetto, Matteucci, & Pansu, 2010) that was translated by the first author. Meta-analysis on stereotype threat

37

Inclusion criteria studies in which schoolgirls were included in the sample and where the gender stereotypeWe included threat study was samples manipulated. based on We five excluded criteria. studiesFirst, we that selected focused only on those only boys or studies that concerned another negatively stereotyped group (e.g., ethnic minorities in other ability domains). Second, because we focused on studies with- dentschildren were and randomly adolescents, assigned we 2disregarded to the stereotype those threat studies condition for which or controlthe average con- dition.age within This the constraint sample meantwas above that 18.we Third,included we neither used experiments correlational in studies which stunor studies that failed to administer a viable stereotype threat. A viable threat was - 2 either accomplished using explicit cues that address the ramifications of the gen der stereotype (e.g., “Women perform worse on this mathematical test”) or using stereotypeimplicit cues threat that aremanipulation supposed to was activate treated gender as a stereotypesbetween-subjects (e.g., instructions factor and thusto circle excluded gender studies on a test in which form). this Fourth, variable we wasincluded treated only as studiesa within-subjects for which facthe- selected variables using the procedures described in the next section. tor. Fifth, the dependent variable had to be the score on a MSSS test. We coded the Coding Procedures The selection and coding of the independent and dependent variables was carried out following a number of rules. In some studies participants were assigned not only to a stereotype threat or control condition but also to an additional crossed factor. We treated the groups formed by the additional factor as different popu- lations when this factor was a between-subjects factor. Whenever the addition- 3 al factor was a within-subjects factor, we took only the level of the factor that, based on the existing theories of stereotype threat, would be expected to have the strongest effect. For instance, we selected a difficult over an easy test in one study

2 To correct for random assignment on the cluster level instead of the individual level, we used cluster correction for equal cluster sizes (Hedges, 2007), which was applied to five studies. Both corrected and uncorrected effect sizes are Y reportedT − Y C in Table2( n3.− We1)ρ based the adjustment of the effect size on the following formula: d =  •• ••  1− . T 2    ST  N − 2 The decision to use an intra-class correlation of in which calculations of the intra-class correlation for a large sample of schools showed an average of = .220. ρ = .2 was guided by the paperth of Hedges and Hedberg (2007), round this number down and use it in our analysis. ρ This number was rather stable across grades (kindergarten through the 12 grade); thus, we felt confident to In the experiment by Keller (2007), the factor domain identification was obtained by a median split based on the continuous variable domain identification that we were unable to duplicate. Therefore, we chose to calculate the effect 3 size over the entire sample pooled together, ignoring the variable domain identification. Chapter 2

38

control condition or a control condition in which no information had been given regarding(Neuville &gender Croizet, and 2007). performance. The control For conditionstudies that consisted involved of multipleeither a nullifiedtypes of -

control groups, we selected the control group in the following order: (1) a nul lified control condition which described that no differences in performance on- the mathematical test have been found, (2) a nullified control condition which described that girls perform better on the mathematical test condition, (3) a nulli- fied control condition in which test-takers were informed that the sex differences in performance on the mathematical test are due to stereotype threat, (4) a nulli- 2 tionfied controlhad been condition given. In that selecting entailed the a dependent description variable or visualization performance of a stereotypeon a MSSS inconsistent situation, and (5) a control condition in which no additional informa - test we used the following rules: we first selected a test administered after the threat manipulation over a test administered before the threat manipulation, sub sequently we selected published cognitive tests over self-constructed cognitive tests, and finally we selected math tests over other tests (i.e., spatial tests, physics usedtests, thegeometrical reported percentagerecall tests, of or correct chemistry answers tests). or alternativelyWe coded performance the average onsum a MSSS test via the official scorings rule for the test; if this rule was not reported, we - score (i.e., the raw mean number of correct answers per condition). In addition to the independent and the dependent variable, six other vari- cultables test were resulted coded. in Test a higher difficulty score was on codedthis moderator by 1 minus variable. the proportion We calculated of correct test answers within the control group of girls in the study sample; thus, a more diffi

difficulty using the data from the control group of girls only instead of the entire usesample the becausedata of girlssome in (but the not experimental all) studies group included because boys inthe their effect samples of stereotype and the test difficulty needed to be comparable across samples. Additionally, we did not with yes when boys were present during test administration or alternatively with nothreat when would boys wereprobably not present.distort the The actual type of difficulty. control group Presence was codedof boys with was nullified coded

control condition without such an active threat removal was coded as no informa- tionwhenever the control condition consisted of an active threat removal, whereas a

The . exploratoryCross-cultural variable gender type equality of manipulation in the country was where coded the by studyeither took explicit place or was im- plicitcoded as by indicated the country’s in Table score 2.1. on Age the was Gender coded Gap by using Index the (Hausmann mean age etin al.,the 2012). entire

- sample; however for papers that only reported an age range we took the midpoint of this range. Test difficulty, age, and cross-cultural gender equality were includ ed as continuous moderators in the analysis, whereas presence of boys, type of control group, and type of manipulation were included as categorical moderators. Meta-analysis on stereotype threat

39

- ditional information from the authors via email. We sent the authors one remind- er whenWhenever they failed the topapers respond. provided When insufficient we failed to information, obtain all information we requested needed ad

Missing pieces of information on moderator variables were treated as missing to calculate the effect size, we excluded the paper from that particular analysis. - velopedvalues, which a coding were sheet. excluded4 pairwise from the analysis. - To assure the coding procedure would be as objective as possible, we de The coding process was first carried out by the first au werethor. Torescored assess byinter-rater two independent agreement, raters five for variables all studies (type except of control for unpublished condition, presence of boys, cross-cultural gender equality, kage, and type of manipulation)- 2 studies that were not reported in paper form ( = 43). The inter-rater agree ment was assessed by calculating Fleiss’ exact kappa (Conger, 1980; Fleiss, 1971)- for categorical variables and the two-way, agreement, unit-measures intraclass measurescorrelation reached (Hallgren, satisfactory 2012; Shrout levels & of Fleiss, agreement 1979) for for the continuous nominal variables variables type us ing the R-package irr (Matthias Gamer, Lemon, Fellows, & Singh, 2012). Those of control condition (Fleiss’ exact κ = .76) and presence of boys (Fleiss’ exact κ- = .68) as well as for continuous variables cross-cultural gender equality (ICC = 1.00) and age (ICC = .96). Only the agreement for the variable type of manipula tion was lower (Fleiss’ exact κ = .10), indicating only slight agreement among the thisthree variable coders. is However, not overly as problematic. the type of manipulation Disagreements was in usedscoring as werean exploratory solved by selectingvariable in the this modal study response. and was, The therefore, dependent not our variable main “performancefocus; low agreement on a MSSS on coders because for these variables too much information was not reported in the originaltest” and articles the moderator and needed variable to be “testretrieved difficulty” by e-mailing were not the retrieved authors. by multiple

Statistical Methods We used Hedges’ g means of the following formula: (Hedges, 1981) as effect size estimator, which was calculated by exp erimental control Y•• − Y••  3  Hedges' g = × 1− . S pooled  4(n1 + n2 )− 9 

2 2 S is given by the following formula: (n1 −1)× s1 + (n2 −1)× s2 pooled S pooled = . n1 +n 2 −2 Thus,

4 A list of excluded studies and the coding sheet are available upon request, and available on OSF (https://osf.io/tlutg/). 40 Chapter 2

study samples with negative effect sizes denote the expected performance dec- -

rement due to stereotype threat, whereas positive effect sizes contradict our ex pectations. The model fitted to the data was the random effects model (for the multipleanalyses withoutmoderators moderators) as well as and to thegeneralize mixed effectsto the modelentire (forpopulation the analyses of studies with moderators) because we wanted both to explain systematic by adding automatically weighted by the inverse of the study’s variance. We have (Viechtbauer, 2010). A characteristic of these two methods is that effect sizes are

not weighted the effect sizes with regards to other quality indicators. We estimated 2 populationthese models level with effect the R-package sizes values metafor vary (Viechtbauer, and are normally 2010) distributed.in R version 3.0.2.In this When fitting the random effects model, we automatically assume that the calculate a credibility interval around the average effect size (g ) in addition to case, it is considered good practice (Hunter & Schmidt, 2004; Whitener, 1990) to

the more familiar confidence interval. We calculated the 95% credibility interval, ofwhich this isinterval an estimation are obtained of the using boundaries the standard in which deviation 95% of of values the distribution in the effect of size distribution are expected to fall (Hunter & Schmidt, 2004). The boundaries SDES SDES of g effect sizes ( ), or more specifically adding and subtracting 1.96 times the obtain the boundaries around a single value of g . In contrast, for the 95% confidence interval the standard error is used to the credibility interval gives an indication of the amount.The confidence of heterogeneity interval in gives the distributionan indication of of effect how sizes. the results can fluctuate due to sampling error, whereas We estimated the amount of heterogeneity τ 2 with the restricted maximum

- likelihood estimator, which is the default in metafor (Viechtbauer, 2010) and an- approximately unbiased estimator for the standardized mean difference (Viecht bauer, 2005). To address the issue of publication bias, we used several meth ods. First, we used three methods based on funnel plot asymmetry: the trim and- tionfill method of the three(Duval methods & Tweedie, is desirable 2000; Rothstein, to obtain 2007),robust theresults rank because correlation both testthe rank(Begg correlation & Mazumdar, test 1994), and Egger’s and Egger’s test have test low (Sterne power & whenEgger, the 2005). amount A combina of stud-

- ies in the analysis is small (Kepes, Banks, & Oh, 2012).To take tests into account- that are not based on the funnel plot, we conducted Ioannidis and Trikalinos’s ex ploratory test (2007), which compares the observedp -curveamount to of have significant an indication stud ies and the expected amount of significantp studies based on power calculations (see also Francis, 2013, 2014). Finally, we created a of the practice of specific types of -hacking within the field (Simonsohn et al., Meta-analysis on stereotype threat 41

A p p-values within a set of studies.5 So the p scores2013). of the-curve experimental consists ofgroup only and statistically the control significant group signifcantly differed from -curve tanalysis includes only thep-curve 15 studies resembles for which a right the skewed mean aeach left otherskewed (based curve on suggests a -test and that α some = .05). researchers If the have invoked p - curve, this finding suggests that our set of findings has evidential value, whereas We pre-registered the hypotheses and inclusion criteria of our-hacking meta-analy (Si- monsohn et al., 2013). sis via the Open Science Framework (https://osf.io/bwupt/). 2.3 Results 2

Our literature search and the call for data yielded 972 papers that were further- screened. Based on the inclusion criteria, 26 papers (i.e., studies) or unpublished reports were actually included in the meta-analysis, which resulted in 47 inde pendent effect sizes (i.e., study samples). Additional information concerning the process is listed in Figure 2.1. These 26 papers provided us with a wealth of new information because only 3 of these papers (12%) were also included in the most recent meta-analysis on this topic (Picho et al., 2013). The overlap withN =the four older meta-analysesn = is equal to or smaller than 12%. The total sample,- obtained by simply addingST all participants of the included studies, consisted of tion and nC = - 3760 girls, of which 1926 girls were assigned to the experimental condi 1834 girls were assigned to the control condition. The most import Overallant characteristics Effect of the included study samples are summarized in Table 2.3. - ples in the papers are considered independent. In accordance with our hypothe- To estimate the overall effect size, we used a random effects model in which sam g = z = - p < sis as well as the former literature, we found a small average standardized mean who have been exposed to a stereotype threat95 on average score lower on the MSSSdifference, tests compared-0.22, to girls3.63, who have .001, not CI been= -0.34; exposed -0.10, to indicating such a threat. that girlsFur- thermore, we found a significant amount of heterogeneity using the restricted

p p-values 5 Over the past years it turned out only specific QRPs (e.g., ad hoc outlier removal, sampling of participants until a significant effect is reached,p incorrect rounding of -values) lead to a bump in the distribution of near the .05 cutoff (Hartgerink,p- van Aert, Nuijten, Wicherts, & van Assen, 2016; Van Aert et al., 2016), that can pbe detected by techniques as -curve or p-uniform (Simonsohn et al., 2014; Van Aert et al., 2016). Other QRPs (e.g., selecting the smallest value when multiple conditions have been used) can actually lead to very small -values, and cannot be detected with the use of those techniques (Ulrich & Miller, 2015). 42 Chapter 2 Manipulation Implicit Implicit Implicit Implicit Implicit Implicit Explicit Explicit Explicit Explicit Implicit Explicit Explicit Explicit Explicit Explicit Explicit Implicit Implicit Implicit Implicit Implicit Explicit Explicit GGI .673 .673 .673 .673 .698 .698 .737 .737 .737 .727 .673 .737 .737 .737 .737 .737 .737 .737 .698 .698 .698 .698 .763 .763 Difficulty .636 .668 .594 .500 .508 .552 .500 .370 .458 .365 NA .620 .230 .360 .560 .550 .480 .782 .589 .538 .598 .578 .531 .705 Boys Yes Yes Yes Yes Yes Yes No Yes No Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes 2 CC No information No information No information No information Nullified Nullified Nullified Nullified No information No information Nullified Nullified No information No information No information No information No information No information No information No information No information No information Nullified Nullified

a g 0.199 0.028 -0.891 0.557 -0.705 -0.864 0.293 0.507 -0.656 -0.270 (-0.277) -0.620 0.137 0.276 -0.158 0.165 0.141 -0.268 -0.693 -0.867 -0.742 0.010 (0.010) -0.808 (-0.815) -0.457 0.040 N 38 59 41 49 63 59 124 135 48 168 80 110 115 99 29 65 76 34 92 20 136 87 35 55 Status Unpub. Unpub. Unpub. Unpub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Country Italy Italy Italy Italy France France USA USA USA Spain Italy USA USA USA USA USA USA USA France France France France Germany Germany Age 10.92 12.92 14.01 16.03 10.58 10.58 16.02 16.02 5.98 15.5 6.47 13.5 12.5 13.5 9.5 13.5 17.5 14.81 12 12 12 12 15.7 15.9 No. 1A of 1 1B of 1 1A of 1 1B of 1 1A of 1 1B of 1 1A of 1 1B of 1 2 of 1 of 1 1 of 3 2A of 3 2B of 3 3A of 3 3B of 3 3C of 3 1 of 1 1 of 2 2A of 2 2B of 2 1 of 1 of Year ------2011 2011 2011 2011 2012 2008 2013 2013 2013 2013 2013 2013 2013 2010 2009 2007 2007 2007 2003 2007 - Characteristics and statistics of studies included in the meta-analysis. in the included of studies and statistics Characteristics Study Authors & Muzzatti Agnoli, Altoè & Muzzatti Agnoli, Altoè & Pastro Agnoli, Altoè & Pastro Agnoli, Altoè Bagès & Martinot Bagès & Martinot & Campbell Cherney & Campbell Cherney per Cimpian, Mu, & Ericksonlowered performance impaired sistence, & Prieto Delgado Galdi, Cadinu, & Tomasetto et al. Ganley et al. Ganley et al. Ganley et al. Ganley et al. Ganley et al. Ganley Good et al. Huguet & Régner Huguet & Régner Huguet & Régner Huguet & Régner & Dauenheimer Keller Keller 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Table 2.3 Table Meta-analysis on stereotype threat

43 Manipulation Explicit Explicit Explicit Explicit Explicit Implicit Implicit Implicit Implicit Implicit Implicit Implicit Explicit Implicit Explicit Explicit Implicit Explicit Implicit Implicit Implicit Implicit Implicit GGI .737 .673 .673 .673 .673 .673 .673 .673 .673 .673 .673 .673 .763 .698 .723 .723 .737 .763 .673 .673 .673 .673 .737 Difficulty .310 .572 .643 .554 .582 .509 .663 .610 .663 .364 .305 .325 .741 .200 .330 .390 .522 .272 .338 NA NA NA .730 Boys Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes No No No No 2 CC Nullified Nullified Nullified Nullified Nullified No information No information No information No information No information No information No information Nullified No information No information No information No information Nullified Nullified No information No information No information No information

a g -0.576 (-0.581) -0.541 -0.497 -0.620 -0.266 0.047 0.230 0.132 -0.424 0.028 0.148 -1.197 -0.143 -0.639 -0.744 -0.135 -0.160 (-0.160) 0.273 -0.125 -0.652 -0.339 -0.322 -0.252 N 90 49 24 23 71 35 68 64 42 42 48 30 72 45 38 51 730 84 118 33 64 27 74 . CC= control condition. Boys = presence of boys (yes) or not (no). GGI = Status Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Pub. Unpub. control condition control + N + Country USA Italy Italy Italy Italy Italy Italy Italy Italy Italy Italy Italy Germany France Uganda Uganda USA Germany Italy Italy Italy Italy USA threat condition threat Age 16 15.5 17.97 17.97 17 7.2 8.4 9.4 10.4 8.2 10.2 13 10.18 7.3 15.5 15.5 17.5 10.47 15.59 5.43 6.05 7.47 11 No. 1 of 1 of 1A of 1 1B of 1 1 of 2 1A of 2 1B of 2 1C of 2 1D of 2 2A of 2 2B of 2 2C of 2 1 of 1 of 1A of 1 1B of 1 1 of 2 1 of 1 of 1A of 1 1B of 1 1C of 1 1 of Year 2012 2012 2009 2009 2006 2007 2007 2007 2007 2007 2007 2007 2012 2007 2012 2012 2004 2010 2010 2011 2011 2011 - Continued Study Authors & Taasoobshirazi Marchand Moè Moè Moè Moè & Pazzaglia Muzzatti & Agnoli Muzzatti & Agnoli Muzzatti & Agnoli Muzzatti & Agnoli Muzzatti & Agnoli Muzzatti & Agnoli Muzzatti & Agnoli et al. Neuburger & Croizet Neuville Picho & Stephens Picho & Stephens & Ward Stricker Titze et al. et al Tomasetto et al. Tomasetto et al. Tomasetto et al. Tomasetto Twamley Status = published versus unpublished papers. N = N = N unpublishedpapers. Statuspublished versus = The primary number is the corrected effect size; the number in parentheses is the uncorrected effect size. effect is the uncorrected size; the number in parentheses effect The primary number is the corrected 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Table 2.3 Table Note. a cell with missing data. indicates NA Gender Gap Index. a 44 Chapter 2

Search by Databases Call for data Informal search: )

(n= 965 Google Scholar (n= 5) (n= 2)

Potential papers obtained by literature search

(n=972) Papers excluded Potential papers screened by title for (undoubled results) 2 retrieval (n=248)

(n=724) Papers excluded Potential papers screened by abstract for (other topic) retrieval (n=452)

(n=272) Papers excluded - Potentially appropriate papers to include in (other topic, adult sample, non meta-analysis experimental, wrong dv) (n=164) (n=108) Papers excluded - Papers included in the meta-analysis (other topic, adult sample, non experimental, wrong dv) (n=81) (n=27) Papers excluded Papers with useable information (insufficient information) (n=1) (n=26) Figure 2.1 Flow chart of the literature search.

τˆ 2 Q = p <

95 maximumsizes. This likelihoodestimated estimator,heterogeneity = accounts0.10, (46) for a 117.19,large share .001,of the CI total= 0.04;vari- 0.19, whichI2 = indicates there is variability among the underlying population effect

ability, 61.75%. The 95% credibility interval, an estimation of the boundaries in which 95% of the true effect sizes are expected to fall, lies between Meta-analysis on stereotype threat

45

which-0.85 each and effect 0.41 was (Viechtbauer, estimated. 2010). This range constitutes a wide interval. The forest plot (Figure 2.2) depicts the effect sizes against the precision with

Agnoli, Altoè & Muzzatti, NA 0.20 [−0.44, 0.84] Agnoli, Altoè & Muzzatti, NA 0.03 [−0.49, 0.55] Agnoli, Altoè & Pastro, NA −0.89 [−1.53, −0.25] Agnoli, Altoè & Pastro, NA 0.56 [−0.01, 1.13] Bagès & Martinot, 2011 −0.70 [−1.21, −0.20] Bagès & Martinot, 2011 −0.86 [−1.41, −0.32] Cherney & Campbell, 2011 0.29 [ 0.01, 0.58] Cherney & Campbell, 2011 0.51 [ 0.16, 0.85] Cimpian, Mu & Erickson, 2012 −0.66 [−1.24, −0.08] 2 Delgado & Prieto, 2008 −0.27 [−1.00, 0.46] Galdi, Cadinu & Tomasetto, 2014 −0.62 [−1.07, −0.17] Ganley et al., 2013 0.14 [−0.24, 0.51] Ganley et al., 2013 0.28 [−0.09, 0.64] Ganley et al., 2013 −0.16 [−0.55, 0.24] Ganley et al., 2013 0.16 [−0.56, 0.89] Ganley et al., 2013 0.14 [−0.35, 0.63] Ganley et al., 2013 −0.27 [−0.72, 0.18] Good, Woodzicka & Wingfield, 2010 −0.69 [−1.39, −0.00] Huguet & Régner, 2009 −0.87 [−1.30, −0.44] Huguet & Régner, 2007 −0.74 [−1.65, 0.16] Huguet & Régner, 2007 0.01 [−0.43, 0.45] Huguet & Régner, 2007 −0.81 [−1.37, −0.24] Keller & Dauenheimer, 2003 −0.46 [−1.13, 0.22] Keller, 2007 0.04 [−0.49, 0.57] Marchand & Taasoobshirazi, 2012 −0.58 [−1.17, 0.02] Moè, 2012 −0.54 [−1.11, 0.03] Moè, 2009 −0.50 [−1.31, 0.32] Moè, 2009 −0.62 [−1.46, 0.22] Moè & Pazzaglia, 2006 −0.27 [−0.73, 0.20] Muzzatti & Agnoli, 2007 0.05 [−0.65, 0.75] Muzzatti & Agnoli, 2007 0.23 [−0.25, 0.71] Muzzatti & Agnoli, 2007 0.13 [−0.36, 0.62] Muzzatti & Agnoli, 2007 −0.42 [−1.04, 0.19] Muzzatti & Agnoli, 2007 0.03 [−0.58, 0.64] Muzzatti & Agnoli, 2007 0.15 [−0.42, 0.72] Muzzatti & Agnoli, 2007 −1.20 [−1.98, −0.41] Neuburger et al., 2012 −0.14 [−0.61, 0.32] Neuville & Croizet, 2007 −0.64 [−1.24, −0.04] Picho & Stephens, 2012 −0.74 [−1.40, −0.09] Picho & Stephens, 2012 −0.14 [−0.69, 0.42] Stricker & Ward, 2004 −0.16 [−0.35, 0.03] Titze, Jansen & Heil, 2010 0.27 [−0.16, 0.70] Tomasetto, Matteucci & Pansu, 2010 −0.12 [−0.59, 0.34] Tomasetto, Alparone & Cadinu, 2011 −0.65 [−1.35, 0.05] Tomasetto, Alparone & Cadinu, 2011 −0.34 [−0.83, 0.16] Tomasetto, Alparone & Cadinu, 2011 −0.32 [−1.09, 0.44] Twamley, 2009 −0.25 [−0.71, 0.21]

RE Model −0.22 [−0.34, −0.10]

Figure 2.2 The forest plot of included effect sizes. −2 −1 0 1 2

Observed Outcome Chapter 2

46

Moderator Analyses We submitted the data to separate mixed effects meta-regressions for each of the four moderators and used the REML estimator to obtain the residual τˆ 2 - - gression analyses for each moderator variable separately are presented (i.e.,in Table un explained variance in underlying effect sizes). The results of the simple meta-re

2.4, where the variables presence of boys and control condition were treated as categorical variables, and the remaining variables were treated as continuous- variables. None of the moderators were statistically significant.2 Additionally, the QM p = τˆ = . QE = p

ploratoryfor systematic variable variety age, in the(1) magnitude= 0.65, = .42,of the effect= .10 ,sizes (45) due 112.80,to differences 001, in did not turn out be statistically significant, indicating that we found no evidence QM = p = 2 τˆ QE = p < . moderationage. Additionally either. the exploratory variable type of manipulation, (1) 3.16, .08, = .09, (45) 103.87, 001, did not result in a statistically significant Sensitivity Analyses

- To verify the robustness of our results (notably the estimated effect size), we ran several sensitivity analyses, as is recommended for meta-analyses (Green- house & Iyengar, 2009). Specifically, we verified the robustness of our results- ferentwith respect estimates to the of τuse2 of a different statistical meta-analytic model, an alterna tive heterogeneity estimator, re-analyses of the random effects model using dif , diagnostic tests, and different subsets of effect sizes. First, of g = z p < .001. Using the DerSimonian–Laird estimator yielded in a fixed effects model, we also6 found a statistically significant mean effect size g = -0.16,z = - = -4.35,p < a similar effect size estimate2 as the restricted maximum likelihood estimator, τˆ = 95 Q p < -0.22, 3.66, .001, CI = -0.34; -0.10, with roughly the same2 amount of also reran the original analysis with three different amounts for95 τˆ : the originally estimated heterogeneity,τˆ 2 the upper bound 0.10, around (46) τˆ 2 , = 117.19, .001, CI = 0.04; 0.19. We interval around τˆ 2 . and the lower bound of the confidence- The results of these analyses are summarized in Table 3.6. Although the estimated effect sizes varied slightly, they all were negative and dif fered significant from zero.

interpreting this result due to the heterogeneity we found amongst effect sizes. 6 Although we report this analysis for the sake of robustness of the estimated effect size, we would not advocate Meta-analysis on stereotype threat 47

Table 2.4 Results of the univariate mixed effects meta-regression per moderator.

2 2 2 Variable k N Intercept Slope SE z p 95% CI QE τ QM I R coefficient 47 -0.80 .07 Boys 47 -0.28 0.08 -0.21 0.10 0 GGI 3760 -2.23 2.83 1.85 1.53 .13 6.46 107.33* 0.09 2.34 60% 3760 0.15 0.54 .59 0.36 117.08* 0.29 62% 0.42 .28 1.28 0.10 1.18 .02 (factor) Control 47 .80 -0.22 0.10 0 Difficulty 43 3556 -0.43 0.45 1.09 -0.37 105.28* 63% 3760 -0.23 0.03 0.13 0.25 0.29 115. 17* 0.06 62% Note.(factor) *p < .001.

GGI = Gender Gap Index. Variable Boys present is scored as present=1, not present =0. Variable control condition is scored as nullified = 1, no information = 0. 2 Table 2.5 Results of the multivariate mixed effects meta-regression with four moderators included.

Variable Slope coefficient SE z p 95% CI Intercept -2.07 .17 2.10 .27 -1.82 1.52 -1.36 -5.06 0.91 0.18 -0.27 GGI 2.30 1.09 6.41 1.20 Boys (factor) -0.05 .79 -0.39 0.30 0.14 0.22 -0.24 Difficulty 0.52 0.43 .23 -0.33 1.37 Note.Control (factor) -0.03 .83 0.31

GGI = Gender Gap Index. Variable Boys present is scored as present=1, not present =0. Variable control condition is scored as nullified = 1, no information = 0.

Table 2.6 Sensitivity Analysis: estimating the effect using different amounts of heterogeneity.

g SE z p 0.0447𝟐𝟐𝟐𝟐 -0.20 <.001 0.1001𝛕𝛕𝛕𝛕� -0.22 <.001 0.05 -4.06 -0.24 0.08 .002 0.06 -3.63 0.1940 -3.10

- a studentizedWe also residual considered larger potential than 2. Running outliers, theby inspecting analysis without the studentized this study resid gave uals, and found that the second study of Cherney and Campbell (2011) displayed an estimated effect size of g = z = p - ated different subsets to see whether-0.24, the -4.05,effect is

p k = g = z p k = -4.15, < .001, 34, than samples gathered in the United States of America, -0.05, = -0.48, = .63, 13. Additionally we created subsets of young (younger g = z p k = than 13 years) and older (13 years or older) participants; the estimated effect size g = z p k = 22. Using an was larger in samples with younger students, -0.25, = -2.92, = .004, 25, than in samples with older students, -0.20, = -2.19, = .03, g = z p k = g = z p = alternativek = cut-off at the age of 10 yielded similar results (for younger students,- -0.24, = -2.06, = .04, 11, and for older students, -0.22, = -3.07, .002, 36). These subset analyses are exploratory analyses and should be inter Excesspreted asof such;Significance however, Results they might be an inspiration for future research. 2

We used several methods to test for the presence of publication bias. First, we ran several tests on the funnel plot (see Figure 2.3) to assess funnel plot asymmetry.- putedAccording on the right to sidethe estimationsof the funnel of plot. the Actual trim and imputation fill method of those (Duval missing & Tweedie, effect 2000), the funnel plot would be symmetric if 11 effect sizes would have been im g = z = p - sizes (Duval & Tweedie, 2000) reduced the estimated effect size to -0.07, cantly from zero whereas our original effect size estimation of g = -1.10, = .27, CI95 = -0.21; 0.06. Because this altered effect size did not differ signifi -0.22 did, this 0 p > .10 .10 > p > .05 .05 > p > .01 p < .01 0.116 0.231 Standard Error Standard Error Standard 0.347 0.462

−1.5 −1 −0.5 0 0.5 1 1.5 Observed Outcome

Figure 2.3 The contour-enhanced funnel plot ofObserved the included Outcome effect sizes. Meta-analysis on stereotype threat

49

z = p = - pattern is a first indication that our results might = -.27 be , distortedp = by publication bias. Both Egger’s test (Sterne & Egger, 2005; -3.25, .001) and Begg and Mazum dar's (1994) rank correlation test, Kendall’s τ .01, indicated funnel plot preciseasymmetry. study This samples. finding The indicates relation that between imprecise imprecise study samples samples (i.e., and study the effect samples siz- eswith is illustrateda larger standard in Figure error) 2.4 usingon average a cumulative contribute meta-analysis to a more negative sorted by effect the samthan- withpling thevariance smallest of the sampling samples variance (Borenstein, and proceeds Hedges, Higgins,adding the & Rothstein, study with 2009). smallest remainingThis cumulativesampling variance process andfirst re-analyzing carries out auntil “meta-analysis” all samples areon includedthe sample in the meta-analysis. The drifting trend of the estimated effect sizes visualizes the 2 effect that small imprecise study samples have on the estimations of the mean ef- N N fect. We created subsets to estimate the effects of large study samples ( ≥ 60) and g = z = - p k =small study samples ( < 60). We found a stronger effect in the subset of smaller study samples, -0.34, 3.76, < .001, CI95 = -0.52; -0.16, CrI95 = -0.96; 0.27, g = z = - p = k = 24, and a small and nonsignificant effect for the subset of larger study samples, 95 95 -0.13, 1.63, .10, CI = -0.29; 0.03, CrI = -0.75; 0.49, 23. - fects Finally,than would Ioannidis be expected and Trikalinos’s based on exploratorythe cumulative test power(Ioannidis of all & studyTrikalinos, sam- 2007) 2showed= thatp this= .004. meta-analysis7 contains more statistically significant ef- ples, χ (1) 8.50, The excess of statistically significant findings is an dueother to indicator the practice of publication of p-hacking bias we (Bakker created et a al.,p 2012; Francis, 2012). To check the alternative explanation that thep-curve excess depicts of statistically the theoretical significant distribution findings ofis p -curve (Figure 2.5) as described ofby p Simonsohn et al. (2013). The -values when there is no effect present (solid line),p-values the theoretical in our meta-analysis distribution -values when an effect is present and the tests have 33% power2 (dotted line),p < and the observed distribution of the significant of(dashed practices line). like The p-hacking. observed8 distributionp-curve was right-skewed,might not be sensitiveχ (30) = 62.87,to several .001, which indicated that there is an effect present that is not simply the result However, types of p-hacking (Van Aert et al., 2016). Overall, most publication bias tests indicate that the estimated effect size is likely to be inflated.

7 To calculate the cumulative power we used the estimated effect size obtained by the random effects model,- |g|= 0.2226. Although we detect a significant difference between the observed and expected significant study samples based on this effect size, the test is rather sensitive. For an effect size of 0.27, the test is no longer statis tically significant. Also note that this analysis assumes a common fixed effect size, which might affect the results (see Francis, 2013). χ2 p

8 The test for the left-skewed distribution is not statistically significant, (30) = 18.24, = .95. Chapter 2

50

Study 41 −0.16 [−0.35, 0.03] + Study 7 0.05 [−0.39, 0.50] + Study 8 0.19 [−0.20, 0.59] + Study 13 0.21 [−0.09, 0.50] + Study 12 0.19 [−0.05, 0.43] + Study 14 0.14 [−0.08, 0.36] + Study 19 0.01 [−0.31, 0.33] + Study 42 0.04 [−0.24, 0.33] + Study 21 0.04 [−0.21, 0.29] + Study 11 −0.02 [−0.28, 0.24] + Study 17 −0.04 [−0.28, 0.20] + Study 47 −0.06 [−0.28, 0.17] + Study 37 −0.06 [−0.27, 0.15] + Study 43 −0.06 [−0.26, 0.13] + Study 29 −0.08 [−0.26, 0.11] + Study 31 −0.06 [−0.23, 0.12] + Study 16 −0.05 [−0.21, 0.12] 2 + Study 32 −0.04 [−0.20, 0.12] + Study 45 −0.05 [−0.21, 0.10] + Study 5 −0.08 [−0.24, 0.08] + Study 2 −0.08 [−0.23, 0.08] + Study 24 −0.07 [−0.22, 0.08] + Study 6 −0.10 [−0.25, 0.05] + Study 40 −0.10 [−0.25, 0.05] + Study 22 −0.13 [−0.28, 0.02] + Study 35 −0.12 [−0.26, 0.03] + Study 4 −0.10 [−0.24, 0.05] + Study 26 −0.11 [−0.25, 0.03] + Study 9 −0.13 [−0.27, 0.02] + Study 25 −0.14 [−0.28, 0.00] + Study 38 −0.15 [−0.29, −0.01] + Study 34 −0.15 [−0.28, −0.01] + Study 33 −0.15 [−0.29, −0.02] + Study 1 −0.14 [−0.28, −0.01] + Study 3 −0.16 [−0.30, −0.03] + Study 39 −0.18 [−0.31, −0.04] + Study 23 −0.18 [−0.31, −0.05] + Study 18 −0.19 [−0.32, −0.06] + Study 30 −0.19 [−0.31, −0.06] + Study 44 −0.19 [−0.32, −0.07] + Study 10 −0.20 [−0.32, −0.07] + Study 15 −0.19 [−0.31, −0.07] + Study 46 −0.19 [−0.31, −0.07] + Study 36 −0.21 [−0.33, −0.08] + Study 27 −0.21 [−0.33, −0.09] + Study 28 −0.22 [−0.34, −0.10] + Study 20 −0.22 [−0.34, −0.10]

−0.4 −0.2 0 0.2 0.4 0.6

Overall Estimate

Figure 2.4 Cumulative meta-analysis sorted from smallest to largest sampling variance. Meta-analysis on stereotype threat

51

100

Observed

Null of 33% power

80 Null of zero effect

60 −values

40 2 Percentage of p

20

0

0.01 0.02 0.03 0.04 0.05 p−value Figure 2.5 The p-curve of the included studies.

2.4 Discussion

Analyzing 15 years of stereotype threat literature with children or adolescents as test-takers, we found indications that girls underperform on MSSS tests due- timatedto stereotype a small threat. effect Consistent of -0.22. Thewith estimations findings by Nguyenof heterogeneity and Ryan indicated (2008), Picho that thereet al. (2013),was a large Walton share and of Cohen heterogeneity (2003), and among Walton population and Spencer effect (2009), sizes. We we ranes

- multiple sensitivity analyses, and most of these tests indicated that the mean toeffect corroborate size is rather predictions robust drawnagainst from fluctuations stereotype due threat to alternative theory with decisions regards re to garding the analyses or the removal of influential studies. Yet our results failed - edthe the moderating effect of stereotypevariables. None threat. of theExploratory four variables analyses (difficulty, with moderators presence of as boys, age type of control group, and cross-cultural gender equality) significantly moderat or type of manipulation did not yield significant moderation either. However, we Chapter 2

52

stereotype threat. did find some strong indications that publication bias is present in the field of- - In future research, the exploratory moderators age and type of manipula- tion deserve more attention. With regards to the variable age, the effect of stereo type threat overall appears to be rather stable over different ages. However, sur withprisingly, older the children. subset Ananalyses additional indicated subset that analysis the estimated on our data effect using size only for samples with children younger than 13 was slightly larger than the effect size for samples- g = z p k = 7. This with early grade school children (i.e., younger than 8 years old) shows a rela 2 tivelypredict large that estimatedvery young mean children effect would size, not yet-0.48, be sensitive= -4.30, to detrimental< .001, effects ofoutcome stereotypes: is rather preadolescent counterintuitive, children because have notthree obtained theories a coherenton stereotype sense threatof the

self yet (Aronson & Good, 2003), young children fail to understand that effort will- not necessarily compensate for a lack of mathematical abilities (e.g., Droege &- ableStipek, type 1993; of manipulation Stipek & Daniels, also deserves1990), and extra older attention. children Although endorse gendertype of manipstereo- types more strongly than younger children (Steffens & Jelenec, 2011). Thep vari

ulation did not have a statistically significant effect on stereotype threat ( = .08), the intercoder agreement for this variable was suboptimal, and most likely the power for the test of this variable is low. In other words, the circumstances under which we measured this variable were not ideal, and future inspection of it might be valuable. Due to these issues, we conclude that the type of manipulation and- tionedage are by variables the real requiring likelihood more of publication attention in bias. the stereotypeAll three tests threat based literature. on funnel Unfortunately, the robustness of the stereotype threat effect can be ques

plot asymmetry—trim and fill (Duval & Tweedie, 2000), Egger’s test (Sterne & Egger, 2005), and Begg and Mazumdar's rank correlation test (Begg & Mazumdar, 1994) —indicated that publication bias was present. Additionally Ioannidis and- Trikalinos’s (2007) exploratory test highlighted an excess of significant findings, which can be due to publication bias. These findings might not be entirely reli able when heterogeneity between effect sizes is present (Ioannidis & Trikalinos, methods2007). However, of the selected this test studiesis deemed only appropriate vary in details (Francis, and 2013)the population if the included is re- experiments, used methods, and selected populations are similar. Because the

stricted to schoolgirls, we see no reason to disregard the results of Ioannidis and meanTrikalinos’s effect. exploratoryThat result test.is striking Moreover, because when smaller we compared studies arethe subsetsassociated of largewith lowerstudy samplespower to anddetect small an effect study and samples, more onlysampling the latter variability. obtained Taking a significant all afore-

mentioned tests into account, we conclude that the stereotype threat literature Meta-analysis on stereotype threat

53 among children and adolescents is subject to publication bias.

Currently, we have no good explanation for the large amount of heterogeneity- between study samples. None of the four confirmatory moderator variables did explain a significant amount of variance; of those,z = the variablep = cross-cultural gen estimatedder equality effects closest between approached subsets significance, of study samples with less conducted equality being inside predictive and out- sideof stronger the United stereotype States alsothreat indicates effects, thatB= 2.83, there are1.53, cross-cultural .13. The differences difference inof the estimated effects. This corresponds to the large cross-cultural gender gap

- searchdifferences attention in mathematical is being devoted performance to understanding (Else-Quest gender et al., differences 2010; Mullis in mathet al.,- 2 2012; OECD, 2010) technology, mathematics, and engineering, increasing re maintains that such gender differences are closely related to cultural variations inematics opportunity achievement, structures attitudes, for girls and and affect. women. The We gender meta-analyzed stratification 2 major hypothesis inter- national data sets, the 2003 Trends in International Mathematics and Science- Study and the Programme for International Student Assessment, representing 493,495 students 14-16 years of age, to estimate the magnitude of gender dif ferences in mathematics achievement, attitudes, and affect across 69 nations throughout the world. Consistent with the gender similarities hypothesis, all of simplethe mean meta-regression effect sizes in denotesmathematics that with achievement a power ofwere .44 verythe test small for (dthe < mod0.15.- A post hoc power analysis (Hedges A lack of& powerPigott, thus2004) might for thebe anomnibus alternative test ofexpla this- 9 erator has quite low power. fornation researchers for the nonsignificant who are planning effects future of this stereotype moderator. threat Unfortunately, meta-analyses power using is difficult to enforce when performing a meta-analysis, but it might be interesting because the studies in the adult population are more numerous and will lead to moreadult powerfulsamples to meta-analyses. consider cross-cultural gender equality as moderator variable A different explanation for the heterogeneity in our analysis is the presence for instance appears to be an important moderator for the stereotype threat ef- of moderator variables that we did not take into account. Domain identification fect, which has been found in adult samples (Cadinu et al., 2003; Lesko & Corpus, report2006; Pronin the degree et al., to 2004; which Steinberg students etidentify al., 2012) themselves as well aswith in samplesmathematics of children or the (Keller, 2007a). The difficulty with this moderator variable is that few studies

τˆ 2 9 We calculated the power using the method of Hedges and Pigott (2004) for the mixed-effects omnibus test, with β = 3 and = 0.6. Chapter 2

54

10

like, which makes it problematic to take the variable into account. In addition, publication bias could have played a role in our failure to find moderation of the publicationstereotype threatbias may effect. obscure Specifically, actual differences because the between effects theseof publication underlying bias effect are sizes.directly proportional to the size of the underlying effect (cf. Bakker et al., 2012),

Limitations - - lyThe interpret amount the of stereotypeunexplained threat heterogeneity effect and isthe the degree first toof whicha few limitationspublication conbias 2 cerning our meta-analysis. Due to this heterogeneity it is difficult to substantive

is a serious issue. Also, publication bias itself can have an effect on heterogeneity isof ateffects play in(Jackson, this literature. 2006). AnotherHowever, limitation with the ofmultiple this study sensitivity is the low analyses power and for thedifferent tests signsof the of moderators. publication Thisbias, limitationwe are rather is mainly confident due thatto the publication small sample bias N d thesizes meta-analysis. within the studies Although (only it is seven unfortunate studies thathad thean tests > 100 for therequired moderators to detect are with sufficient power a of .50) and the limited amount of studies included in

underpowered, it does not affect the conclusion that publication bias is a serious- sisissue with within a fair this amount line of of research.unpublished Finally, studies it would could have have been been informative a good estimator if the data set contained more unpublished studies, especially because a subset analy11 Un-

for the effect of stereotype threat that is not influenced by publication bias. fortunately an extensive grey literature search did not yield more than five effect sizes, which corresponds to a percentage of 11% unpublished effect sizes in our- meta-analysis. However, such a low percentage of grey literature papers within greypsychological literature meta-analyses search is that theseems amount rather of common details in even documents in top journals like conference (Fergu son & Brannick, 2012). The most important difficulties we encountered with the the study in the meta-analysis and authors were often unreachable. We want to abstracts or even doctoral dissertations was insufficient to successfully include author is of vital importance for more reliable future meta-analyses. stress that pre-registration of studies including contact information of the first

10

Studiesbest be modeledseldom indicated at the individual whether level participants for which were the specificallyraw data are selected needed. on this moderator variable. Moreover, identification with the domain and gender roles are both variables that consist of individual differences that can 11 The estimated effect of the unpublished subset was g = z p only on k -0.07 ( = -0.29, = .77), however this effect is based = 5 effect sizes. Meta-analysis on stereotype threat

55

Conclusion

To conclude, we estimated a small average effect of stereotype threat on the- MSSS test-performance of school-aged girls; however, the studies show large interpretingvariation in outcomes, the effects and of stereotypeit is likely thatthreat the on mean children effect and is inflated adolescents due to in pub the lication bias. This finding leads us to conclude that we should be cautious when

STEM realm. To be more explicit, based on the small average effect size in our mathematicalmeta-analysis, performancewhich is most of likelygirls ininflated a systematic due to waypublication or lead bias,women we towould stay not feel confident to proclaim that stereotype threat manipulations will harm 2 clear from occupations in the STEM domain. Of course, we do not challenge the fact that stereotypes might strongly influence a person’s life under unfortunate thatcircumstances; stereotype however,threat manipulations we want to avoid have theon instantunjustifiable test performance generalization within that stereotype threat, based on the evidence at hand (i.e., the average small effect thatthis meta-analysis),future research generallyis needed leadsto disentangle to lower maththe effects grades of and stereotype women threat leaving from the STEM field. Due to the scientific and societal importance of the topic, we urge- publication bias. As directions for future research we propose simple, large rep oflication the actual studies, effect preferably of stereotype administered threat among cross-culturally. schoolgirls. InA power our opinion, calculation only forstudies a one-tailed with large t sample sizes will contribute to acquiring an accurate picture-

-test indicated that, with an effect size of 0.223, roughly 250 par Openticipants Science are needed Framework per condition or the What to achieveWorks Clearinghouse a power of .80. to Inavoid addition, publication these biasstudies and should related be biases appropriately introduced registered during the (Wagenmakers analyses of the et data. al., 2012) via the

Chapter 3

The influence of gender stereotype threat on mathematics test scores of Dutch high school students: A registered report

This chapter was “In Principal Accepted” as Flore, P. C., Mulder, J., and Wicherts, J. M. (2017). The registered report. Comprehensive Results in Social Psychology. influence of gender stereotype threat on mathematics test scores of Dutch high school students: A Chapter 3

58

Abstract

The effects of gender stereotype threat on mathematical test performance in the classroom have been extensively studied in several cultural contexts. The- ory predicts that stereotype threat lowers girls’ performance on mathematics

N tests, while leaving boys’ math performance unaffected. We conducted a large-- scale stereotype threat experiment in Dutch high schools ( = 2,064) to study the generalizability of the effect. Specifically, we set out to replicate the over- all effect among female students and to study four core theoretical moderators, namely domain identification, gender identification, math anxiety and test dif- ficulty. Among the girls, we found neither an overall effect of stereotype threat on math performance, nor any moderated stereotype threat effects. Most vari ance in math performance was explained by gender, domain identification, and 3 math identification. We discuss several theoretical and statistical explanations for these findings. Stereotype threat and Dutch high school students

59

3.1 Introduction

- dressedSince the both first the studies generalizability on the negative of the effect effect of and stereotype important threat theoretical on women’s mod- math performance (Spencer, Steele, & Quinn, 1999), numerous studies have ad erators (Spencer, Logel, & Davies, 2016). Although several meta-analyses of published studies highlighted relatively robust effects (Nguyen & Ryan, 2008; Picho et al., 2013; Walton & Spencer, 2009), some researchers have voiced their potentiallyconcern about overestimated the improper effects use of stereotypecovariates thatthreat leads due to to inflatedpublication Type bias I error and relatedrates in biasingstereotype factors threat regarding studies how (Stoet researchers & Geary, 2012; analyze Wicherts, their data 2005), and presentand the impede our understanding of psychological phenomena like the effects of stereo- their results (Flore & Wicherts, 2015; Ganley et al., 2013). These problems can 3 type threat on test performance, and raise questions about the generalizability- matoryof the effect stereotype across threatcultural studies. settings and age groups. Such issues can be (partly) resolvedMost by of pre-registration the research on (see gender e.g., stereotypeWagenmakers threat et al.,in the 2012) math of domainlarge confir con- on high school students could potentially have a negative long-term impact on cerned college students, however it is clear that early effects of stereotype threat - girl’s identification with mathematics and hence their later performance in this domain and related domains (viz. Science, Technology, Engineering, and Mathe matics or STEM fields). Several studies have addressed stereotype threat effects- among schoolgirls in diverse cultural contexts (see Flore & Wicherts, 2015 for throwa review), important and the light results on the are generalizability somewhat mixed. of gender It is clear stereotype that studies threat in effects actu toal classmundane settings settings (instead that ofare lab relevant settings) for among pupils’ high later school academic populations careers. wouldMore- generalizability of stereotype threat effects in classroom environments that have hithertoover, a large-scale been studied study only in ina newa limited cultural number context of countries. adds to knowledge about the - mate of the effects of negative gender stereotypes on the mathematical test per- formanceIn this among registered Dutch report,high school we aimed students. to obtain Additionally a reliable we andaimed unbiased to replicate esti the moderating effects of variables domain identification gender identification , math anxiety and test difficulty in a large sample of Dutch high school(Keller, students. 2007a), (Schmader, 2002) (Delgado & Prieto, 2008) (Keller, 2007a) Chapter 3

60

3.1.1 Stereotype threat and underlying mechanisms

Stereotype threat theory predicts that members of a negatively stereotyped group will underperform when that stereotype is made salient or relevant for

teststhe task when at hand. reminded In their of theseminal negative paper stereotype on stereotype stating threat, that SteeleAfrican and Americans Aronson (1995) described how African Americans underperformed on cognitive ability-

have lower intellectual abilities than European Americans. Similarly, when con fronted with the negative stereotype concerning their in-group, women were- found to underperform on mathematics tests (e.g., O’Brien & Crandall, 2003; Spencer, Steele, & Quinn, 1999) and driving tests (Yeung & von Hippel, 2008), el derly were found to underperform on memory tests and cognitive tests (Lamont, Swift, & Abrams, 2015) and students from lower socio-economic backgrounds 3 were found to underperform on intelligence tests (Désert, Préaux, & Jund, 2009; Spencer & Castano, 2007). Based on theory, members of positively stereotyped- servedgroups that(e.g., members men or European of positively Americans) stereotyped are expectedgroups performed to remain slightly uninfluenced better onby the stereotype stereotype threat relevant manipulations. task when confronted However, with Walton negative and Cohen stereotypes (2003) about ob -

leadingan out-group, most researchersa phenomenon to notthey have named explicit stereotype predictions lift. However, on the effect this lift among phe non-stereotypednomenon is theoretically groups in not their as studies.well developed as the stereotype threat effect, Of the many negative stereotypes that have been studied in the context of ste-

onreotype this topic threat, have the produced stereotype similar that results:women the are estimated not as good averaged in mathematics effect size as ranges men (Spencer et al.,d 1999) is one of the dmost frequently studied. Multiple meta-analyses-

from small ( = 0.24) to medium ( = 0.48), indicating that women tend to under- perform when they are exposed to explicit or implicit stereotype threats (Doyle &- Voyer, 2016; Nguyen & Ryan, 2008; Picho et al., 2013; Walton & Cohen, 2003; Wal ton & Spencer, 2009). The effect sizes within these meta-analyses show a consider able amount of heterogeneity, indicating that the magnitude of the effect sizes varies across studies (Nguyen & Ryan, 2008; Picho et al., 2013), possibly due to moderators. 3.1.2 Moderators

mostSpencer, relevant Logel andindividual Davies (2016)characteristics and Inzlicht of female and Schmader test-takers (2012) that reviewed are thought the main moderators of the effects of stereotype threat. Here, we focus on the three important factor in determining whether tests are affected by stereotype threat. to moderate susceptibility to stereotype threat and consider test difficulty as an Stereotype threat and Dutch high school students

61

Domain identification Theory predicts that members of negatively stereotyped groups will only under- - perform on stereotype relevant tasks if they are highly identified with the con performancestruct that the for task women is supposed who consider to measure the subject(Keller, of2007a; mathematics Steele, 1997; to be Steeleimport &- Aronson, 1995). Notably, stereotype threat will only undermine mathematics test- becauseant to them. they For are women probably who less are interested weakly identified in good resultswith mathematics, in mathematics the negacom- paredtive stereotype to women will who not strongly trigger identifyanxiety orwith negative mathematics. thoughts This during theoretical test taking, pre- diction is supported by several studies showing that women with high domain

- identification under threat average larger performance decrements than women with low domain identification (Keller, 2007a; Lesko & Corpus, 2006; J. R. Stein berg et al., 2012). The meta-analytic evidence in favor of the moderating effect of- 3 typeddomain domain identification showed is larger somewhat stereotype mixed. threat Walton effects and Cohenthan studies (2003) that found did that not studies with samples consisting of highly identified participants in the stereo select samples of highly domain-identified group members. Yet Nguyen and Ryan (2008) found that samples of moderately math-identified women were more Genderstrongly identification influenced by stereotype threats than highly math-identified women. A second moderator that received attention in the stereotype threat literature - bership of the stereotyped group to be an important part of their self-identity is group identification, i.e., the degree to which the test-takers consider mem strongly(Schmader, identify 2002). with The their moderating gender have effect little of genderreason identificationto feel threatened follows by the negativesame logic female as the stereotype. moderating Several effect of studies domain have identification: shown that women indeed who math do per not- formance is generally less affected by stereotype threat for women who believed that gender was not an important part of their identity, compared to women for whom gender was an important part of their identity (Schmader, 2002; Wout et al., 2008). However, other studies failed to find moderating effects of gender- encedidentification by negative (Cadinu stereotypes et al., 2003; compared Eriksson to women & Lindholm, who were 2007), more or strongly even found gen- women having lower levels of gender identification to be more strongly influ

Mathder identified anxiety (Kiefer & Sekaquaptewa, 2007). A third construct implicated as both a moderator and a mediator of stereotype - threat is math anxiety. First, the gender differences in mathematical test perfor Chapter 3

62

-

mance could be partly mediated by state anxiety (Osborne, 2001) and state anx womeniety is sometimes not only scored (albeit lowernot always; on the Steele mathematics & Aronson, tests 1995; compared Schmader to men & Johns, and 2003) found to mediate the stereotype threat effect: under stereotype threat-

women in the control condition, but they also showed higher scores on physio genderlogical anxiety stereotypes measures to their like own skin perception conductance, of anxiety blood pressure more strongly and lower than scores wom- on skin temperature (Osborne, 2007). Women in threat conditions tend to link-

en in low threat conditions or men (Johns, Schmader, & Martens, 2005). Final- ly, state anxiety mediates the relationship between coping sense of humor and mathematics test performance for women (Ford et al., 2004). Instead of study- ing state anxiety as mediator, trait math anxiety can be treated as a moderator variable of the stereotype threat effect. Overall, there is a gender gap in report 3 ed math anxiety, with girls reporting a higher levels of math anxiety than boys math(Else-Quest anxiety et scores al., 2010). were Aassociated study on withSpanish stronger high decrementsschool students under showed stereotype that math anxiety moderated the stereotype threat effect, in the sense that higher

Testthreat difficulty (Delgado & Prieto, 2008).

Finally, studies have shown that gender stereotype threat is moderated by math test difficulty in both college samples (O’Brien & Crandall, 2003; Spencer et al., 1999) and school samples (Keller, 2007a; Neuville & Croizet, 2007). In most of these samples, stereotype threat effects were stronger for difficult tests than- for easier tests (Neuville & Croizet, 2007; Nguyen & Ryan, 2008; Spencer et al., 1999). Use of easy tests can actually lead to improved scores for girls under ste- reotype threat, probably due to heightened motivation and lower threat posed by such easier tests (O’Brien & Crandall, 2003; Spencer et al., 2016). Some re- sultingsearchers in suspectedlarger performance that students decrements who work under on difficult stereotype tests threat. might experienceA third ex- more physiological arousal (Ben-zeev et al., 2005; O’Brien & Crandall, 2003), re working memory than easier tests. Because working memory can be occupied byplanation suppression is that of more negative difficult thoughts tests require concerning more the controlled stereotypes attention or other as part situa of-

tional pressures (Beilock & Decaro, 2007; Beilock et al., 2007; Schmader & Johns, 2003), test takers under threat might experience greater difficulty solving the more difficult problems. This would result in larger performance decrements on the more difficult tests. Stereotype threat and Dutch high school students

63

3.1.3 Stereotype threat in school aged children

Although the theory of stereotype threat has been well established based on lab studies, the critique that these studies were limited in terms of generalizability anddrove middle stereotype schools threat showed researchers that the salience into the of classroom gender lowered (Aronson mathematical & Dee, 2012; test Wax, 2009). A first study in the United States on stereotype threat in elementary andperformance 10. Ambady of girls et al. (Ambady argued that et al., this 2001). might However, have been this due finding to the washigher limited degree to age groups of 5–7 and 11-13, and did not appear among students aged between 8 of chauvinism regarding gender in the latter age group, but this explanation has received little attention in further studies on stereotype threat. Nonetheless, the effects of stereotype threat for schoolgirls was also found in other countries, like- France (Bagès & Martinot, 2011), Germany (Keller, 2007a; Keller & Dauenheimer, 3 2003), Italy (Muzzatti & Agnoli, 2007), Spain (Delgado & Prieto, 2008), and Ugan- da (Picho & Stephens, 2012). However, in several similar experiments the null hypothesis was not rejected (e.g., Agnoli, Altoè, & Muzzatti, n.d.; Cherney & Camp bell, 2011; Ganley et al., 2013; Stricker & Ward, 2004). As with adult samples, the results of the experiments on schoolgirls are mixed; the estimated effect sizes of the simpleexpected effect direction (i.e., the to standardizeda medium effect mean in differencethe opposite of girls direction. in the Combiningstereotype thethreat information condition ofand all girls available in the stereotypecontrol condition) threat experimentsranged from fora large school effect aged in girls yielded an average estimated effect size of 0.22 in the expected direction, but also substantial heterogeneity in underlying effects (Flore & Wicherts, 2015). 3.1.4 Methodological considerations

- Three methodological and statistical issues in the replicability debate (Asendorpf et al., 2013) are particularly relevant for stereotype threat research: pre-registra- tion, a priori power analyses and multi-level analysis. First, pre-registration has received little attention in articles on stereotype threat (for exceptions see Fin nigan and Corker, 2016, Gibson, Losee, and Vitiello, 2014, and Moon and Roeder, 2014). There are several upsides to pre-registered studies. Notably, when a study is pre-registered it is easier to certify that statistically significant results were actually based on a priori hypotheses and pre-specified analyses thereof. This counters biases caused by hypothesizing after results are known (i.e., HARKing, Kerr, 1998) and ad hoc analyses of the data that are focused on finding desirable publication(usually significant) of results results regardless (Wagenmakers of the outcome. et al., 2012; Wicherts et al., 2016). Moreover, pre-registration ameliorates the effects of publication bias by assuring Chapter 3

64

- ples of schoolchildren gathered in stereotype threat experiments are relatively Second, it is crucial to conduct proper a priori power analyses. The sam

small and power analyses are not often reported (for exceptions, see Stricker and Ward, 2004; Titze, Jansen, and Heil, 2010). Because the average effect sizes in the field have consistently been shown to be small to medium, we suspect that many ofstereotype effect sizes threat under studies various reported scenarios in the with past publication were underpowered, bias. Prior power leading anal to- ysesinaccurate enable effect informed size estimatesdecisions regardingwithout publication the sample bias sizes and needed inflated for estimates studying relatively subtle effects.

schools in the analysis of the data from stereotype threat studies. An assumption Third, it is important to consider the clustered nature of data gathered in the independence of observations. If students from the same classroom are in- of common statistical techniques like AN(C)OVA or linear is 3 cluded in the analysis, this assumption is likely violated. Positive dependencies inflate Type I error rates if left uncorrected. Depending on the severity of the violation, the effective sample size of the study will be lower than the observed analyticsample size approach. (i.e., a larger intraclass correlation coefficient will lead to a smaller effective sample size). Thus, the nested structure of the data requires a multilevel- istered experiment is not designed to “prove” or “disprove” the general exis- In the present study, we incorporated these three improvements. Our reg common stereotype threat manipulation in the Dutch high school population in actualtence ofclassrooms. the stereotype The Dutchthreat arephenomenon, fairly regular but in rather terms to of study gender the stereotypes effects of a

contributes to much needed information about when and among which students (D. I. Miller, Eagly, & Linn, 2015) and studying stereotype threat in this context-

stereotype threat affects mathematics test performance. On top of that, we be lieve that the method we use (i.e., pre-registration, a priori power analysis, and multilevel analysis when observations are dependent) could solve some existing- monlyproblems used in inthe the field stereotype if adopted threat in future literature. stereotype We usedthreat an studies. experimental para- In our registered study, we used materials and procedures that are com - digm that involved both an explicit stereotype threat manipulation (Spencer et- al., 1999) and a control condition in which the negative stereotype was active ly nullified (Smith & White, 2002). We selected a sample of high-achieving stu dents, for which the effects of stereotype threat are expected to be strongest due- icsto higher test in levels regular of domainclassrooms. identification We did so (Steele, because 1997; the presenceJ. R. Steinberg of boys et al., has 2012). been foundMoreover, to yield in our larger study, decrements boys and girls in girls’ worked mathematics simultaneously test performance on the mathemat due to Stereotype threat and Dutch high school students

65

interaction effect between stereotype condition and gender on the number of stereotype threat (Huguet & Régner, 2007). Our main hypothesis was to find an correct questions on the math test. We expected a simple effect for girls, with higher performance for girls in the safe control condition. Based on theory, we had no specific expectation for the simple effects among boys. 3.2 Method

Participants The participants were students attending the second year of Dutch high school school system. We selected average to high achieving students by including class- es(typically from the 13 second to 14 year highest olds), education which is levelequivalent “Hoger to Algemeen the eighth Voorgezet grade in the Onder USA-

- 3 wijs” (i.e., senior general secondary level, or HAVO) and highest education level “Voorbereidend Wetenschappelijk Onderwijs” (i.e., pre-university secondary ed mixeducation, classes or VWO) of inpotential the Dutch HAVO high and school VWO system. students In our in pre-registeredthe Dutch provinces sampling of plan, we aimed to randomly select schools from a list of high schools offering- - Noord-Brabant, Utrecht and Zuid-Holland. However, in practice we had to de- viate from this plan, because a large portion of contacted schools (83.33%) de clined to participate. After consultation, the editors and we agreed to use a con venience sample at the level of schools, instead of the random sample of schools pre-registration.that we had hoped to select. Additionally, we included two schools outside of our target provinces. Besides these two changes, our sampling plan followed the - lowedPrincipals by another of theemail schools if needed. were Wheneverfirst contacted these by three e-mail. means In cases of contacting where we failed to receive a reply within a week, we contacted the schools by phone, fol were unfruitful, we contacted other schools. Additionally, some schools were parentscontacted and in studentsa more informal of HAVO/VWO manner, classes although in thewe schoolalways wereasked asked for permission a week in of the principal. Once the principals of the schools agreed to participate, both oradvance her schoolwork to object if during they did data not collection. want (their Participating child) to participate. students wereIf the asked student to completeand/or the the parents entire setobjected, of materials that student during regularwas allowed classes to in quietly regular work classrooms. on his

We planned to sample schools until we had at least 946 girls in our sample (see section Power for the specifics on this number). The committee of Tilburg School of Social and Behavioral Sciences approved our study (registration no. EC-2015.53). Chapter 3

66

Procedure

To heighten the chances of finding an effect, we chose an optimal implementation- of the experimental paradigm according to stereotype threat theory. Specifically, test-takingwe used an12 explicit threat manipulation, combined with a nullified threat con trol condition (Steele, 1997). Moreover, both boys and girls were present during (Inzlicht & Ben-Zeev, 2000; Sekaquaptewa & Thompson, 2003) and we selected classes consisting of average to high achieving students (Steele, 1997). Students received a bundle of materials in a closed envelope. The material consisted of two parts: the first part contained the mathematics test including an introduction in two versions that differed across conditions (an instruction heightening stereotype threat in the experimental condition and a nullification- eralsentence psychological in the control scales. condition). To assign students The second to conditions part of the we materials used a within-clus contained- background questions such as genderindividually and age, randomly the manipulation assigned to check, either and the sevste- 3 reotype threat condition or the control condition within their class. ter approach,A female i.e., experiment students leaderwere who was blind to the condition the students

13 were in instructed students to first read the introduction carefully, to solve the math problems and finally to fill out the questionnaire. We emphasized that it was important that students would complete all questions in the bundle, but that they could quit the experiment halfway by putting a mark on the first page. The Dutchstudents but were translated allowed here 20 inminutes English]: to finish“With thethis test, mathematics and 10 minutes test we towant finish to measurethe questionnaire. the ability The level introduction of high school started students. with Thisthe following test has been piece used of text in the[in past. It turned out that students with good grades on this test had on average

would like to know how well high school students in the Netherlands perform onhigher this gradestest.” In in the high stereotype school and threat had acondition better chance the introduction to pass their continued final exam. with We “The most recent study carried out four years ago showed that boys and girls do

average grade on the test between boys and girls.” A similar explicit manipulation not perform equally well on this mathematics test. There was a difference in the

has been successfully implemented in past studies (e.g., Delgado & Prieto, 2008; Keller & Dauenheimer, 2003; Picho & Stephens, 2012). In the control condition

12 This was the case for the majority of classrooms. We encountered one classroom solely consisting of girls.

Although some studies suggest that math performance of women will deteriorate to a stronger degree when

13 in effect sizes between studies run by female experiment leaders and studies run by male experiment leaders male experiment leaders run the study (Marx & Roman, 2002), a recent meta-analysis showed that differences experiment leader. are negligible (Doyle & Voyer, 2016). Based on this finding we felt confident to have our study run by a female Stereotype threat and Dutch high school students

67 the introduction continued with “The most recent study carried out four years

There was no difference in the average grade on the test between boys and girls.” ago showed that boys and girls perform equally well on this mathematics test.

A similar nullified control condition has been successfully implemented in past studies (e.g., Keller & Dauenheimer, 2003; Marchand & Taasoobshirazi, 2012; Neuburger, Jansen, Heil, & Quaiser-Pohl, 2012). All instructions and materials towere select in Dutch among and four are options available the on correct the OSF year (https://osf.io/yt83j/). in which the mathematics test had been Tostudied check beforewhether as the written students in the read introduction the introduction, of the we test. asked The the written students in- struction ended with a warning that students were not allowed to use a calcula- tor. Additionally, students were informed that wrong answers would be punished with a correction for guessing. This was done to induce a prevention focus, which has been found to yield stronger stereotype threat effects (Keller, 2007b; Keller & 3 Bless, 2008; Ståhl, Van Laar, & Ellemers, 2012). Moreover, correction for guessing- edwas to (until contribute recently) to creating routinely an implemented atmosphere insimilar high stakesto real-life testing high-stakes environments test- like GRE testing (Educational Testing Service, 2016), and as such was expect - ing. After all students finished reading the introduction and answered the check question, the experiment leader gave them a sign to start working on the math ematics test. Students who finished the mathematics test early were instructed to wait for a signal from the experiment leader, after which they were allowed to continue with the second part of the study. In the second part of the study, students first filled in their age, ethnicity (based on whether both parents were- born in the Netherlands or somewhere else), and gender. Subsequently they were asked to answer the following question as a manipulation check: “Previous ly boys and girls performed equally on this mathematics test”, which was an item in multiple-choice format that could be answered with either “yes, boys and girls performed equally on this test”, “no, boys and girls did not perform equally on whichthis test” the or students “I don’t could know”. answer This question by selecting was one followed of the byfollowing the item options: “who do “boys you think usually performs better on mathematics tests like these? Boys or girls?”, on get better grades on math tests”, “girls get better grades on math tests”, “boys and girls get equal grades on math tests” and “I don’t know”. After answering these- manipulation checks, students finished the post-test questionnaire consisting of theirfour scales: assessment gender enclosed identification, in the mathenvelope anxiety and andto wait two silentlyscales of until domain everyone iden tification. After finishing those questionnaires, students were asked to hand in was finished. Chapter 3

68

Materials The main dependent variable was the score on the mathematics test. We strived to construct a mathematics test with desirable psychometric properties. Spe- -

cifically, we included items with desirable item properties. To this end, we con structed a mathematics test consisting of 20 items selected from the 2003 TIMSS Netherlands.study (M. O. Martin,We used Mullis, reliably Gonzalez, estimated & itemChrostowski, parameters 2004). based This on TIMSSthis large study in- involved large samples of eighth grade students from 48 countries, including the - ternational data set (M. O. Martin et al., 2004) to construct a test with items that varied in difficulty and had relatively high discrimination parameters. The diffi andculty 12 parameters items in the of contentthe selected domain items Number. ranged Because from -0.174 of the to unavailability 1.157 in the overall of the TIMSS sample. Our test consisted of 8 items in the content domain Geometry

3 translate2003 version the items(Annemiek into Dutch. Punter, All Personal items were communication, multiple choice September items with 14, four2015), or we asked two Dutch mathematics teachers with excellent English proficiency to split the mathematics test in an easy test consisting of the ten items with the five answer categories. To examine the moderating effect of test difficulty, we

lowest item difficulty parameters, and a difficult test consisting of the ten items- with the highest item difficulty parameters (as estimated in the TIMSS sample). In addition to this mathematics test participants filled out two scales assess Theseing different four constructs dimensions are of considered domain identification as moderators (12 of items), the stereotype a scale measuring threat ef- gender identification (4 items), and a scale measuring math anxiety (10 items).-

fect among the girls. The first scale of domain identification measured the impor tance of mathematics according to the students (e.g., “I think mathematics will- help me in my daily life”). The second scale of domain identification measured positive affect with regards to mathematics (e.g., “I enjoy learning mathemat ics”). Both scales were retrieved from the 2003 TIMSS study (M. O. Martin et al., 2004). We slightly modified the gender identification scale used by Schmader (2002) to fit the population of high-school students. The scale consisted of four items (e.g., “being a girl/boy is an important part of my self-image”). Finally, we used the Math Anxiety scale (Prieto & Delgado, 2007) to measure math anxiety (e.g., “before taking a math exam I feel nausea”). Although this scale originally contained 18 items, we created a shorter version to deal with time constraintsdoes notby selecting apply to meten to items does with apply sufficient to me. The variance scales werein the translated item difficulty into Dutch parameters. by the Answers to all scales were given on a five points Likert-scale ranging from by the third author. first author, and those translations were checked for deviations from the original Stereotype threat and Dutch high school students

69

Pilot study To ensure that the materials were appropriate for the targeted population we conducted a pilot among 76 high school students from three classes of a school in the province of Zuid-Holland (21 girls, 54 boys, 1 gender unknown). With these pilot data, we checked whether floor or ceiling effects occurred, whether the- ulationitems had checks desirable were successful.psychometric For properties, the pilot study whether we carried the time out allotted the exact for pro the- ceduredifferent as parts described of the above study apartwas sufficient, from three and minor whether details. instructions14 Scale analyses and manip were

The mean number of correct items on the math test was M = 12.41 out of 20 conducted using R packages “CTT” (Willse, 2014) and “Scale” (Giallousis, 2015). checkitems (SDcorrectly. = 2.74), Scale with reliability individual of scores the four ranging psychological from 7 to scales 18. Of rangedthe 76 students, from ac- 96% answered the read check correctly and 74% answered the manipulation test anxiety 3 ceptable (Cronbach’sliking math α = .68 and Cronbach’simportance math αgender identification = .67) to good (Cronbach’s α = .82 and Cronbach’s α = .81). Three items of the test anxiety scale showed item-rest correlations smaller than .30, and showed Confirmatory Factor Analysis single factor loadings smaller than .30 (items 5,- 7 and 8). We decided to replace the test anxiety scale with a math anxiety scale, based on both psychometric arguments (i.e., reliability of the scale was some what low, some items showed low factor loadings) and theoretical arguments (i.e., the math anxiety scale is more likely to moderate stereotype threat than analysesthe test anxiety of the scale).latter three The item-rest scales showed correlations satisfactory for gender results identification we did not items alter thesewere allscales. .30 or higher, as were the standardized factor loadings. Because the scale - instructionsThe times in the allotted pilot. for the mathematics test (20 minutes) and the question naire (10 minutes) were both sufficient. We experienced no problems with the

14 - First, for the manipulation in the pilot we used the sentence “The most recent study carried out in 2012 showedPro- cedurethat boys and girls do not perform equally on this mathematics test”. To ensure the children read the manipu lation carefully, we altered the manipulation for the main study to the sentence mentioned in the section . Second, we originally planned 25 minutes for the mathematics test, but most children were finished before 20 minutes were up, and started to become restless. Therefore, we changed the amount of time for the- mathematics test to 20 minutes. Third, we used a test anxiety scale (Arvey, Strickland, Drauden, & Martin, 1990) in our pilot as potential moderator, but replaced it with a math anxiety scale for the main study (Prieto & Delga do, 2007). 70 Chapter 3

Statistical analysis

Main analysis - F-test to test for differences in mathematical performance betweenIn Figure the3.1 classes.we present If this an F overview-test showed of our a p planned analyses. For our main anal ysis, we first used an -value < .05 we planned to conduct a multilevel analysis with the observed individual scores as first level and the class level as the second level. Here we planned to use a random intercepts model, with fixed slopes for the main effects and the interaction effect. We also planned theto include classroom. two Forsecond-level individual predictor i in classroom variables: j gender of the teacher (GT) and class composition (CC), which was defined as the percentage of girls present in

Level 1: , we defined the model as: 1: = + + + ( × ) + .

3 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚ℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋1𝑖𝑖𝑖𝑖�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋2𝑖𝑖𝑖𝑖�𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋3𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 1: = + + + ( 2 × ) + . 2: = + + + , ~ (0, ) 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚ℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋1𝑖𝑖𝑖𝑖�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋2𝑖𝑖𝑖𝑖�𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋3𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 2 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 We𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 assumed𝛽𝛽𝛽𝛽00 𝛽𝛽𝛽𝛽01� that𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖� the𝛽𝛽𝛽𝛽02 scores�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖� 𝑂𝑂𝑂𝑂 0are𝑖𝑖𝑖𝑖 mutually𝑂𝑂𝑂𝑂0𝑖𝑖𝑖𝑖 𝑁𝑁𝑁𝑁 independent𝜏𝜏𝜏𝜏𝜋𝜋𝜋𝜋0 N(0,σ ). On the second = 2: = + + + , ~ (0, ) 𝜋𝜋𝜋𝜋1𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽10level the model was defined as: = = 2 𝜋𝜋𝜋𝜋2𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽20𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽00 𝛽𝛽𝛽𝛽01�𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖� 𝛽𝛽𝛽𝛽02�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖� 𝑂𝑂𝑂𝑂0𝑖𝑖𝑖𝑖 𝑂𝑂𝑂𝑂0𝑖𝑖𝑖𝑖 𝑁𝑁𝑁𝑁 𝜏𝜏𝜏𝜏𝜋𝜋𝜋𝜋0 = 𝜋𝜋𝜋𝜋3𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽30𝜋𝜋𝜋𝜋1𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽10 = 𝜋𝜋𝜋𝜋2𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽20

𝜋𝜋𝜋𝜋3𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽30 These= analyses1 + ( were1) run with the R-package lme4. In the case that the F-test

𝐷𝐷𝐷𝐷𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 𝐿𝐿𝐿𝐿for𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝐿𝐿𝐿𝐿 the𝑒𝑒𝑒𝑒𝑚𝑚𝑚𝑚 class𝜌𝜌𝜌𝜌 𝐶𝐶𝐶𝐶effect𝐾𝐾𝐾𝐾 − would show a p - = 1 + ( 1)

𝐾𝐾𝐾𝐾 -value > .05, we planned to ignore the nest 𝐷𝐷𝐷𝐷𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 𝐿𝐿𝐿𝐿𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝐿𝐿𝐿𝐿𝑒𝑒𝑒𝑒𝑚𝑚𝑚𝑚 𝜌𝜌𝜌𝜌 𝐶𝐶𝐶𝐶 − - 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 ysesed structure, with the andguess to correctedconduct a score standard on the two-way complete ANOVA math instead test as theof a dependent multilevel | analysis.| As preregistered, all analyses were carried out thrice. First, we ran anal- 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 > 2.24 𝑋𝑋𝑋𝑋 − 𝑀𝑀𝑀𝑀𝐿𝐿𝐿𝐿variable.|𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 For| the second analysis, we ran the analysis with the ten easiest ques 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 > 2.24 tions on the math test as dependent variable, and for the third analysis we used 𝑋𝑋𝑋𝑋 − 𝑀𝑀𝑀𝑀𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 the dependent variable consisting of the ten most difficult questions. We used a- 𝜌𝜌𝜌𝜌 � 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 guess correction based on formula scoring (Frary, 1988). We expected a significant interaction between the stereotype threat con 𝜌𝜌𝜌𝜌 � andition analysis and gender, of simple with effects. a smaller We hypothesized effect for the thateasy girls subtest in the than stereotype for the difficult threat

conditionsubtest. If wouldthis interaction score lower was on significant the mathematics at α= .05 test we than planned girls into theproceed control to

ascondition, exploratory. and planned to test this with a one-sided test at α= .05. We had no hypothesis for the simple effects analysis for boys, thus we treated this analysis Stereotype threat and Dutch high school students 71

Main analyses

F-test dependency

F-test non-significant F-test significant

Two-way Multilevel analysis ANOVA effects for

Interaction Interaction Interaction(random class) Interaction effect non- effect effect non- effect significant significant significant significant

Stop Simple effects Stop Simple effects

Moderator analyses 3

Significant three-way interaction?

No Yes

Stop

Two-way Two-way Two-way interaction interaction interaction significant? significant? Significant? Mean – Mean Mean

No ( 1SD)Yes No ( ) Yes No( + 1SD) Yes

Stop Simple Stop Simple Stop Simple effects effects effects

Final model

Include all moderator variables that showed a significant three- way interaction

Figure 3.1 Overview of the analyses. 72 Chapter 3

-

Additionally, we registered to test multiple competing inequality and equal ity constrained hypotheses using the Bayes factor (Jeffreys, 1961; Kass & Raftery, 1995). Bayes factors have the advantages that they can be straightforwardly used- for simultaneously testing multiple (i.e., more than two) non-nested hypotheses byand classical that they p allow one to quantify the evidence in the data in favor of a hypothe- sessis (e.g., of interest. the null) relative to another hypothesis. These properties are not shared -values. Table 3.1 presents our pre-registered competing hypothe

Table 3.1 Competing hypotheses Bayesian analysis.

Name Hypothesis Description

No stereotype H0: µthreat/girl = µcontrol/girl control/boy = µthreat/boy threat hypothesis No constraints on the gender mean differences. , µ Equality constraints on the means for conditions. Stereotype threat H1: µthreat/girl < µcontrol/girl threat/boy = µcontrol/boy For girls: mean in ST condition constrained to hypothesis be lower than in the control condition. For boys: , µ 3 No constraints on the gender mean differences. equality constraints on the means for conditions. Stereotype threat H2: µthreat/girl < µcontrol/girl threat/boy > µcontrol/boy For girls: mean in ST condition constrained to and stereotype lift be lower than in the control condition. For boys: hypothesis , µ mean in ST condition constrained to be higher than in the control condition. No constraints on the gender mean differences.

Complement HC: Not H H or H2 The complement of the hypotheses described hypothesis above 0, 1,

Table 3.2 Interpretation Bayes factors.

BFia Evidence against Ha Negligible Positive 1 to 3 Strong 3 to 20 Very strong 20 to 150 Note.> 150 hypothesis Ha. Ha = null or complement hypothesis. BFia = Bayes factor of inequality constrained hypotheses Hi against the null or complement

For the no stereotype threat hypothesis H0

for boys and girls to differ. This no stereotype we threat placed hypothesis equality constraints on the- lymeans be compared for the conditions, to the stereotype while allowing threat hypothesis the means H on mathematicalstereotype test scoresthreat 1 could subsequent and stereotype lift hypothesis H2 with the complement hypothesis H . To compare these hypotheses, and the we used default . CFinally, we compared all of these hypotheses

Bayes factor methodology of Mulder (2014), Gu, Mulder, Deković and Hoijtink (2014), and Gu, Mulder, and Hoijtink (in press). In this methodology, the data are implicitly split in a minimal fraction that is used for prior specification and a Stereotype threat and Dutch high school students

73

default Bayes factors can be used in an automatic fashion without needing to for- maximal fractional that is used for hypothesis testing (O’Hagan, 1995). Therefore Our pre-registered interpretation of Bayes factors follows guidelines presented mulate prior distributions for the anticipated effects (Berger & Pericchi, 1996).

Moderatorsin Kass and Raftery (1995) and is shown in Table 3.2. math anxiety as potential moderators. The moderators were added separately to theWe consideredmodel tested two in versionsthe section of domainmain analyses identification, gender identification, and - , which means we planned to test three models. The moderator variable, the three-way interaction term (i.e., Con- dition x Gender x Moderator) and subsequent second order interaction terms were added as first level predictors. All moderator variables were treated as con betinuous followed variables, by three and analyses were grand-mean to inspect thecentered. interaction of condition and gender 3 on theWe number pre-registered of correctly that answered a potential mathematics significant itemsthree-way separately interaction for students would - with low scores on the moderator (one standard deviation below the mean), av erage scores on the moderator (the mean), and high scores on the moderator (one standard deviation above the mean). In cases of a significant Condition x Gender interaction, we planned to proceed to simple effects to inspect the effect of condition for girls and boys separately. Finally, if more than one moderator variable would show a significant three-way interaction, we planned to run a Powerfinal model with all of those variables included. Because the main focus of this registered report is to replicate the stereotype

- threat effect, we conducted a power analysis for the interaction effect and the thesimple goal effect to obtain for girls. a power Moreover, of at least we conducted .80 for all aanalyses. power analysis for the moderat ing variables.For the interaction All power analyseseffect we were used carried the information out with G*Powerfrom the largest3.1.3 and stereo with-

2 interaction type threat study administered in high schools2 that we are familiar with (Stricker & Ward, 2004). In this sample, the effect size η was largerd than .05, but smaller than .10. A power analysis with η =.05 indicated that we would need aWe total selected sample this size effect of 152. size Subsequentlybecause we took to find precautions an effect tosize maximize of = 0.30 the ineffect the analysis of simple effects (one-sided) for girls we would need 278 participants. than(e.g., theselect averaged average effects to high of achieving the meta-analyses. participants, have members of the other sex present, construct a difficult test), leading us to expect a somewhat larger effect 74 Chapter 3

1 :1 : == ++ ++ ++ ( ( ×× ) +) + . .

𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 0𝑖𝑖𝑖𝑖 1𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 2𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 3𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 Due to the nested structure𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 of𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 the𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿 data𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 we𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚ℎ 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖expectedℎ 𝜋𝜋𝜋𝜋0𝜋𝜋𝜋𝜋𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋 1the𝜋𝜋𝜋𝜋𝑖𝑖𝑖𝑖�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶� 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶observations𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� � 𝜋𝜋𝜋𝜋2𝜋𝜋𝜋𝜋 𝑖𝑖𝑖𝑖within�𝐺𝐺𝐺𝐺�𝐿𝐿𝐿𝐿𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝐿𝐿𝐿𝐿� � 𝜋𝜋𝜋𝜋3𝜋𝜋𝜋𝜋𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝐿𝐿𝐿𝐿 𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝐿𝐿𝐿𝐿

- 2 :2 : == ++ ++ ++ , , ~~(0(,0, ) ) ses are too liberal. We corrected= for this dependency by multiplying the needed2 2 = 0𝑖𝑖𝑖𝑖0𝑖𝑖𝑖𝑖 0000 0101 𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 0202 𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 0𝑖𝑖𝑖𝑖0𝑖𝑖𝑖𝑖 0𝑖𝑖𝑖𝑖0𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋0𝜋𝜋𝜋𝜋0 sampleclasses notsize to under be completely the assumption𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿independent,𝜋𝜋𝜋𝜋 of𝜋𝜋𝜋𝜋 independent𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽 which𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽�𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺�𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 meant� observations� 𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽 �that𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶� these� 𝑂𝑂𝑂𝑂 𝑂𝑂𝑂𝑂with power the𝑂𝑂𝑂𝑂 𝑂𝑂𝑂𝑂 design𝑁𝑁𝑁𝑁analy𝑁𝑁𝑁𝑁 𝜏𝜏𝜏𝜏 𝜏𝜏𝜏𝜏 1=𝑖𝑖𝑖𝑖 = 10 𝜋𝜋𝜋𝜋1𝜋𝜋𝜋𝜋𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽10𝛽𝛽𝛽𝛽 2=𝑖𝑖𝑖𝑖 = 20 K 𝜋𝜋𝜋𝜋2𝜋𝜋𝜋𝜋𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽20𝛽𝛽𝛽𝛽 3 𝑖𝑖𝑖𝑖3𝑖𝑖𝑖𝑖 3030 K effect. To calculate the design𝜋𝜋𝜋𝜋𝜋𝜋𝜋𝜋 effect,𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽 we used the following formula in which is the number of classes, is the number of children within class and ρ is the intraclass correlation (ICC). ==1 1++( ( 1)1 )

𝐾𝐾𝐾𝐾 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶𝐿𝐿𝐿𝐿𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝐿𝐿𝐿𝐿𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝐿𝐿𝐿𝐿𝑒𝑒𝑒𝑒𝐿𝐿𝐿𝐿𝑚𝑚𝑚𝑚𝑒𝑒𝑒𝑒𝑚𝑚𝑚𝑚 𝜌𝜌𝜌𝜌𝜌𝜌𝜌𝜌𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾𝐶𝐶𝐶𝐶−− -

𝐾𝐾𝐾𝐾 We assumed that ρ=.10 and 𝐶𝐶𝐶𝐶 𝐾𝐾𝐾𝐾 𝐶𝐶𝐶𝐶 =25. This will lead to a design effect of 3.4. There fore, to obtain enough power| | for the simple| | effects analysis we multiplied the girls. Because we did not expect a difference>>2.224 .in24 mathematics scores between the calculated sample size (i.e. 278 girls) by 3.4, leading to a required sample of 946 𝑋𝑋𝑋𝑋 𝑋𝑋𝑋𝑋−−𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 3 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 experimental and control conditions for boys, there was no need to conduct a power analysis for these simple effects. Hence, we simply sampled schools until 𝜌𝜌𝜌𝜌 �𝜌𝜌𝜌𝜌 � duringwe obtained the testing enough of thegirls girls. in our sample, while also measuring boys because the theory stipulates no effect for them, and because it is crucial to have boys present to test the three-way interactions by means of a F-test in the context of multiple We also calculated total required sample sizes (i.e., girls and boys together) anxiety. A power analysis for the three-way interaction of moderator variable linear regression for the2 moderator variables domain identification and math R change

domain identification ( = .05, retrieved from Steinberg2 = .02 etretrieved al., 2012) from showed Del- that 152 students were required, whereas a powerpartial analysis for the three-way interaction of moderator variable math anxiety (η gado & Prieto, 2008) showed that 387 students were required. Taking the nested data into account, we found the need for a maximum of 1,316 students (i.e., 387 students times 3.4). Because we planned to sample schools until we acquired- 946 girls in our sample, we expected to end up with a total sample size larger than 1,316. This guaranteed adequate power for the tests of the three-way in teraction for variables domain identification and math anxiety. For the variable problematic.gender identification We assumed we could the effect not find size a ofuseful the three-way effect size interactionestimate of for the gender three- way interaction in the literature, which rendered a well-informed power analysis

identification to not be much smaller than the three-way interactions of domain identification and math anxiety, which meant the power of this particular test- typewould threat be sufficient experiment with in a sampleclass settings consisting to date. of 946 girls and a similar number of boys. Taken together, this made our registered study the largest gender stereo Stereotype threat and Dutch high school students

75

Handling missing data - missingAs pre-registered, values do missingnot give dataus any were information handled as about follows. the First,mathematics we removed ability par of ticipants list-wise who quit the experiment partway through, because those the participants. Second, we wanted to mirror a regular testing session, thus if a participant failed to fill in a (few) item(s) on the mathematics test those items encounteredwould be classified missing as values a wrong on the answer covariates for thatwe removed participant. participants Participants from who the skipped more than 30% of the mathematics test were removed list-wise. If we plannedanalyses toof thatdrop particular classes in moderating which the studentsvariable. wereMoreover, making we noiseanticipated during three test circumstances in which data from specific classes would be worthless. First, we administration, based on an assessment that the majority of students in a class- were talking for more than 2 minutes during test administration. Second, we 3 planned to drop classes in which more 1: than 50% of= the students+ failed to+ com + ( × ) + . this class or the students collectively failed to make a serious effort to complete plete the entire set of materials,𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 because𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝐿𝐿𝐿𝐿either𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 theℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 material𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋 1was𝑖𝑖𝑖𝑖�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 too𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 difficult𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋2 𝑖𝑖𝑖𝑖for�𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋3𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖

2: = + + + , ~ (0, ) the materials. Third, we planned =not to take data into account of students who 2 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽00 𝛽𝛽𝛽𝛽01�𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖� 𝛽𝛽𝛽𝛽02�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖� 𝑂𝑂𝑂𝑂0𝑖𝑖𝑖𝑖 𝑂𝑂𝑂𝑂0𝑖𝑖𝑖𝑖 𝑁𝑁𝑁𝑁 𝜏𝜏𝜏𝜏𝜋𝜋𝜋𝜋0 entered the class more than five minutes= late, because they then would need to 𝜋𝜋𝜋𝜋1𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽10 rush through the material, giving them= a disadvantage on the mathematics test. Handling outliers and sensitivity𝜋𝜋𝜋𝜋 2analyses𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽20

We planned to carry out a set of𝜋𝜋𝜋𝜋 sensitivity3𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽30 analyses to be included in Appendix A. -

First, we checked for robustness by removing = 1 +outliers( 1based) on the Median Ab solute Deviation (MAD)-median𝐷𝐷𝐷𝐷𝐿𝐿𝐿𝐿 rule𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 𝐿𝐿𝐿𝐿(Wilcox,𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝐿𝐿𝐿𝐿𝑒𝑒𝑒𝑒𝑚𝑚𝑚𝑚 2012).𝜌𝜌𝜌𝜌 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 − We subtracted the median score of all observations, to obtain the median of those new scores (MAD). The MADN was then calculated by dividing the MAD by 0.6745. An observation then 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 was flagged as an outlier if it exceeded the following cut-off rule: | | > 2.24 𝑋𝑋𝑋𝑋 − 𝑀𝑀𝑀𝑀𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 - sitivity analyses. Because all of our important variables are based on sum scores Observations flagged as outliers𝜌𝜌𝜌𝜌 � were removed from the data set only for the sen In our second set of registered sensitivity analyses aimed at checking for of scales we did not anticipate many outliers (Bakker & Wicherts, 2014). - robustness, we removed all participants who incorrectly answered the manipu lation check and/or the read check, and reanalyzed the remaining data. Chapter 3

76

3.3 Results

Participants

Data were gathered between SeptemberM 30, = 2016, andSD= March). Due 28, to 2017 a low at re 21- Dutch high schools. The data were from 86 classes and included a total of 2,126 students, typically aged either 13 or 14 ( 13.39 0.62 asponse convenience rate at sample. the level The of schoolsschools (16.67%we visited of were the original situated sample in the ofprovinces schools participated), we deviated from our registered sampling strategy and collected

of Zuid-Holland (4 schools), Noord-Brabant (12 schools), Utrecht (3 schools), Gelderland (1 school) and Overijssel (1 school). We visited 35 VWO classes (the threehighest months. level of These education changes in the in samplingNetherlands), strategy 41 HAVOwere neededclasses, toand obtain 10 HAVO/ a suf- VWO mixed classes. Gathering of the data took six months instead of the planned 3 ficiently large data set. Changes were discussed and approved by the editor of CRSP. In the discussion section, we will consider how these alterations in design datacould on have the influencedmath test. Thisthe results. left us with data from N As decided a priori, we removed students having more than 30% missing set consisted of N = 2,067 students. Three more students were removed because they did not mark their gender, so our final data = 2,064 students. Because students were usually quiet during sometest administration students appeared and classes not to were take neverthe study late, seriouslywe did not by need looks to removeof their entirebook- classes. Some classes were somewhat noisy or appeared less concentrated, and

lets (e.g., showing very clear aberrant answering patterns on the math test like- portaaaaa9aaaaaaaaaaaaaa, results after removing or making data from remarks these studentsin the comment and classes. section that implied they did not take the test seriously). In the section exploratory analyses, we re Descriptives - viations and sample sizes for the main dependent variable guess corrected math For boys and girls in both conditions, Table 3.3 provides the means, standard de - performance, and for sum scores on the moderators math anxiety (scale ranging from 10-50), domain identification (scale ranging from 12-60) and gender iden tification (scale ranging from 4-20). Moreover, this table includes the number- viewcorrect, of maththe number test performance. of items unanswered Note that scores on the on math the mathtest, and anxiety accuracy scale scorewere (the number correct divided by the number attempted) to give a complete over were below the midpoint of the scale as well. low on average, and positively skewed. Scores on the domain identification scale Stereotype threat and Dutch high school students 77

Table 3.3 Averages and standard deviations for math performance (scored in several ways), missing values, and scales Math anxiety, Domain identification and Gender identification.

Guess Number Accuracy Missing M.A. D.I. G.I. corrected correct Mean N Mean N Mean N Mean N Mean N Mean N Mean N (SD) (SD) (SD) (SD) (SD) (SD) (SD) 11.41 0.87

Girls-ST 9.10 510 510 0.60 510 510 19.45 505 33.90 493 12.73 494 0.70 (3.91) (3.10) (0.15) (1.59) (8.41) (8.88) (2.72) Girls-C 9.31 526 11.61 526 0.60 526 526 18.98 522 34.12 509 12.86 511 Boys-ST (3.93) (3.12) (0.15) (1.45) (8.27) (8.50) (2.80) 10.63 519 12.67 519 0.65 519 0.51 519 17.09 513 35.96 503 13.67 503 Boys-C 10.71 12.72 (3.99) (3.13) (0.16) (1.21) (7.82) (8.86) (2.93) 509 509 0.65 509 0.58 509 16.74 503 35.64 496 13.23 485 Note. (3.97) (3.13) (0.15) (1.28) (7.73) (9.24) (2.91)

ST = Stereotype Threat condition, C = Control condition, M.A. = Math anxiety, D.I. = Domain identification, G.I. = Gender identification. 3

Table 3.4 Proportions of stereotypes held by boys and girls.

Which group usually performs better on math tests? Group Boys are better Girls are better Boys and girls are equally good Don’t know/missing Boys .27 .28 .14 .32 .13 Girls .19 .32 .34

Table 3.5 Cohen’s d, Cronbach’s α, greatest lower bound, skewness and kurtosis.

Guess Number Accuracy M.A. D.I. G.I. corrected correct Cohen’s d -0.07

STgirls-Cgirls (95% C.I.) -0.05 -0.03 0.06 -0.03 -0.05 (-0.18;0,07) (-0.19;0.06) ( -0.15; 0.09) (-0.07; 0.18) (-0.15;0.10) (-0.17; 0.07) Cohen’s d -0.02 -0.02

STboys-Cboys (95% C.I.) -0.03 0.05 0.03 0.15 (-0.14; 0.10) (-0.14;0.11) ( -0.15; 0.09) (-0.08; 0.17) (-0.09;0.16) (0.02; 0.27) Cohen’s d -0.20 Girls-Boys (95% C.I.) -0.37 -0.38 -0.35 0.29 -0.23 (-0.46;-0.28) (-0.48;-0.29) (-0.44;-0.35) (0.20;0.37) (-0.29;-0.11) (-0.14; -0.32) - - - - Glb .66 .93 .91 .67 Skewness -0.24 0.07 Cronb. α .59 .92 .86 .55 Kurtosis -0.19 -0.19 1.33 -0.06 Note. 2.58 2.55 2.61 4.50 2.60 3.32 lower M.A. bound = Math with anxiety, package D.I. “psych” = Domain in R. identification, G.I. = Gender identification. The greatest lower bound (glb) is calculated as the maximum value of three different estimation methods of the greatest 78 Chapter 3

- low the midpoints of the relevant scales are also common for Dutch students However, the large scale TIMSS 2003 survey showed that such scores be

in TIMMS (Martin, Gonzalez, & Chrostowski, 2003). As such, low scores on the- current domain identification scale are not out of the ordinary. Table 3.4 reports the proportions of gender stereotypes held by boys and girls, pooled over exper imental conditions. For boys, the option “boys are better” was most popular, but the proportions for “girls are better” and “equally good” were selected almost as better”.often. For Cronbach’s girls, the mostalpha popular for all scales statement and thewas math “equally test good” are reported closely followedin Table by “girls are better”, whereas a muchd smallerto illustrate group differences of girls selected between “boys groups. are - 3.5, together with effect size Cohen’s Reliabilities for the scales were acceptable (gender identification) to high (do- main identification, math anxiety). The lower reliability estimate of the scale gender 3 identification is probably due to the (short) length of the scale. Moreover, a consid erable number of students indicated that they found the gender identification scale- somewhat confusing, so we will be cautious with the interpretation of results with- abilitythis scale. of the In Appendix math test A, might we fitted be compromised a graded response due to model the relative to the three homogeneity psycholog of ical scales to assess the psychometric qualities of those scales in more detail.A more Reli de- 15 the sample (as we tried to select a group of highly identified students). Maintailed analysesstudy of the psychometric qualities of the math test is reported in Chapter 5.

Manipulation check

Overall, 91% of the students answered the read check correctly (“In what year was this mathematics test studied before?”), indicating that a large majority of the students read the introduction to the math test. Moreover, 84% of all students answered the manipulation check correctly (“Did boys and girls performN equally on the math test?”). The option “yes,N there were differences between boys and- ferencesgirls” was between selected boys more and often girls” by was students selected in morethe ST often condition by students ( = 834)in the thancon- students in the control condition ( = 41), and the option “no, there were no dif

1 We can calculate a disattenuated effect size taking this low reliability estimate of the test into account (Hedges & Olkin, 1985), comparing math performance of girls in the stereotype threat condition to performance of girls in the control condition. This would lead to a disattenuated stereotype threat effect 15 Wein the can control calculate condition. a disattenuated This would effect lead size to ataking disattenuated this low reliabilitystereotype estimate threat effect of the size test of. into account (Hedges

& Olkin, 1985), comparing math performance of girls in the stereotype threat condition to performance of girls d − 0.07 size of d = = = −0.09 . This does not change our conclusion that the stereotype ρ(y, y') .59 . This does not change our conclusion that the stereotype threat effect in our

threat sampleeffect in is our very sample small. is very small.

Stereotype threat and Dutch high school students

1: 79 = + + + ( × ) + .

𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑂𝑂𝑂𝑂 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚ℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 𝜋𝜋𝜋𝜋1𝑖𝑖𝑖𝑖�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋2𝑖𝑖𝑖𝑖�𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖� 𝜋𝜋𝜋𝜋3𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑂𝑂𝑂𝑂𝐶𝐶𝐶𝐶𝑚𝑚𝑚𝑚𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐺𝐺𝐺𝐺𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 N N 2 2: = + + + , ~ (0, ) p N = 2 trol condition ( = 898) than students in the ST condition ( = 72, χ 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿(1)=1,418.4,𝜋𝜋𝜋𝜋0𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽 00 𝛽𝛽𝛽𝛽01�𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖� 𝛽𝛽𝛽𝛽02�𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖� 𝑂𝑂𝑂𝑂0𝑖𝑖𝑖𝑖 𝑂𝑂𝑂𝑂0𝑖𝑖𝑖𝑖 𝑁𝑁𝑁𝑁 𝜏𝜏𝜏𝜏𝜋𝜋𝜋𝜋0 N sensitivity= < .001; students who answered “Don’t know” ( = 205) or failed𝜋𝜋𝜋𝜋1𝑖𝑖𝑖𝑖 to 𝛽𝛽𝛽𝛽answer10 analyses = this question ( =14) were excluded from this analysis). In the section𝜋𝜋𝜋𝜋2𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽20 who incorrectly answered the read check and/or the manipulation check. we consider the influence on our main results after removing𝜋𝜋𝜋𝜋3𝑖𝑖𝑖𝑖 students𝛽𝛽𝛽𝛽30 Frequentist approach -= 1 + ( 1) F = , p 𝐾𝐾𝐾𝐾 A first analysis showed that there are significant differences between𝐷𝐷𝐷𝐷𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 𝐿𝐿𝐿𝐿 class𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝐿𝐿𝐿𝐿𝑒𝑒𝑒𝑒𝑚𝑚𝑚𝑚 𝜌𝜌𝜌𝜌 𝐶𝐶𝐶𝐶 − analysises in guess instead corrected of a standard math performance 2x2-ANOVA. ( (85, 1,978) 6.847 < .001). Because of these differences (and following our pre-registration) we used 𝐶𝐶𝐶𝐶 𝐾𝐾𝐾𝐾 multi-level | | We carried out a sequential multilevel regression analysis, in which we> 2.24 added (clusters of) variables in a stepwise fashion. The model that 𝑋𝑋𝑋𝑋 includes− 𝑀𝑀𝑀𝑀𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 all The random intercept model highlights considerable variation due to differences𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 variables equals the model we pre-registered. The results are given in Table 3.6. 3 Adding gender as a predictor variable resulted in a better model compared to 𝜌𝜌𝜌𝜌 � between classes, with a sizable intraclass correlation coefficient of = .192. - the random intercept model, pointing to a significant gender gap with boys outscor ing girls. Adding the main effect of stereotype threat (Model 2), the interaction effect of gender and stereotype threat (Model 3), and the class level variables gender of the present teacher and proportion of boys in the classroom (Model 4) did not result in a significant improvement in model fit. Fit criteria AIC and BIC were lowest for Model 2, thereby confirming that the model with only gender showed the best fit. To see whether students performed differently on the difficult or easy items, we ran the same models using the (guess corrected) easiest ten items, and the most difficult ten items (guess corrected). We observed the same pattern of beresults found when in Table we solely B1 in analyzedAppendix the B. easy items, and when we solely analyzed the difficult items, i.e. Model 2 showed the best fit. The results of these analyses can Bayesian approach - - We calculated approximated adjusted fractional Bayes factors to quantify the ev Noidence other for variables the four werecompeting included hypotheses in this model. in Table Approximated 3.1. Parameters adjusted were fractional estimat ed in R package “lme4”, taking the multilevel structure of the data into account.

Bayes factors were calculated in software package BaIn (Gu et al., in press), and. they are reported in Table 3.7. Note that BaIn provides Bayes factors for each of theu four hypotheses against an unconstrained (reference) hypothesis, denoted by H were used to compute the Bayes factors between key hypotheses H0 1 2 HSubsequently using the transitivity property of the Bayes factor, these that Bayes a stereotype factors c 0 , H , H , and . We found most evidence for the specified null hypothesis H 80 Chapter 3

Table 3.6 Main analyses: Fit measures, deviance, unstandardized regression coefficients and vari- ance components for models without moderators.

Fixed effect Random part Deviance AIC BIC

(Dp-Dc) (df)

t Var. comp. Coefficient M0 Intercept 48.74 Level 2 variance (S.E.) 9.97 3.04 11312.1 11318.1 11335.0 Level 1 variance (0.20) M1 Intercept 48.70 Level 2 variance 13.00

10.73 3.07 11226.8 11234.8 11257.4 (0.22) (85.3)* Level 1 variance (1) Gender -1.52 -9.33 12.45 M2 Intercept Level 2 variance (0.16)

10.76 45.96 3.07 11226.7 11236.7 11264.8 (0.23) (0.1) Level 1 variance (1) Gender -1.52 -9.34 12.45 ST 3 (0.16) -0.06 -0.39 Intercept 10.80 Level 2 variance 11272.2 (0.16)

M3 43.63 3.07 11226.4 11238.4 (0.25) (0.3) -7.07 Level 1 variance 12.44 (1) Gender -1.60 ST -0.14 (0.23) -0.64

(0.22) STxGender 0.16 0.51 M4 Intercept 10.10 14.18 Level 2 variance 11224.8 11242.8 (0.31)

3.01 11293.5 (0.71) (1.6) -7.11 Level 1 variance 12.44 (3) Gender -1.61 ST -0.14 (0.23) -0.64

(0.22) STxGender 0.16 0.51 Prop. gender (0.31) 0.92 0.73

(1.26) Gender teacher.d1 0.38 0.90

(0.42) Gender teacher.d2 0.95 0.68 Note. (1.41)

AIC = Akaike Information Criterium. BIC = Bayesian Information Criterium, ST = Stereotype Threat, Var. comp. = Variance component, M0 = Model 0. Gender is dummy coded with males being teachersthe reference and dummygroup. ST 2 foris dummy both female coded and with male the teachers.control group The difference being the inreference deviance group. between Gender the of the teacher is dummy coded, with male teachers being the reference group,2 dummy 1 for female p c distributed. Models are

previous model (D ) and the current model (D ) is given in brackets, and is χ fit with Maximum Likelihood estimation. Stereotype threat and Dutch high school students 81

Table 3.7 Bayes factors for competing hypotheses.

Hu H0 H1 H2 HC

H0 0 u - 0 1 0 2 0 C 28.177 BF(H , H )= BF(H , H ) = BF(H , H ) = BF(H , H ) = H - (No1 threat hypothesis) 563.0801 u 1 0 1,144.4721 2 481.6771 C BF(H , H )= BF(H , H ) = BF(H , H ) = BF(H , H ) = H - (Stereotype2 threat hypothesis) 19.9842 u 0.0352 0 2 1 40.618 17.0952 C - 0.001 0.421 BF(H , H )= BF(H , H ) = BF(H , H ) = BF(H , H ) = (Stereotype threat and stereo 0.492 0.025 H - typeC lift hypothesis) C u C 0 C 1 C 2 0.002 BF(H , H )= BF(H , H ) = BF(H , H ) = BF(H , H ) = Note.(Complement BF = Bayes hypothesis) Factor. 1.169 0.058 2.376

threat does not exist. Comparing H0 to the competing hypotheses H1 2 c showed clear support for the former hypothesis. There is strong evidence for H , H , and H0 1 -

0 against H2 (i.e., the null hypothesis of no threat effect)against against H H (i.e., the stereotype threat hy 3 pothesis) and very strong evidence for0 H c (i.e., the stereotype threat and stereotype lift hypothesis) and for H (i.e., the complement hypothesis). P 0 P Assuming P equal prior probabilitiesP for the hypotheses (i.e., hypotheses are equally1 likely a2 priori), we calculatedc posterior probabilities: (H |x) = .963, (H |x) = .034, (H |x) = .001, and (H |x) = .002, which can be interpreted as the nullprobabilities hypothesis that of ano hypothesis stereotype is threat true after effect observing in these data.the data. Similarly, as with the Bayes factors, the posterior probabilities show strong evidence in favor of the Moderators - - For all three moderators (math anxiety, domain identification, and gender identi- fication) we carried out a series of multilevel analyses, starting with a simple ran dom intercept model, to which we added the following terms in a stepwise fash ion: (Model1) the moderating variable, (Model2) gender, (Model3) experimental condition, (Model4) two-way interaction effect ST x gender, (Model5) three-way interaction ST x gender x moderator, including all possible two-way interactions, (Model6) gender of the teacher and proportion of girls in the classroom. Table- 3.8 provides model comparison and fit indices. Table 3.8 shows that adding math anxiety to the model improved fit. Subse- quently adding gender to the model improved fit as well. Adding more variables such as the experimental condition or the interactions did not improve fit. In Ta ble 3.9 we report regression parameters for the best fitting model per moderator variable. We still see a negative effect of gender, indicating that (controlled for onmath math anxiety) anxiety girls were performed associated worse with on lower the math scores test on than the mathboys, test.and aThe negative same linear effect of math anxiety indicating that (controlled for gender) higher scores pattern emerged for domain identification; adding domain identification to the 82 Chapter 3

Table 3.8 Main analyses: Fit statistics and model comparison for moderating variables and stereo- type threat.

Math Domain Gender Anxiety identifi- identifi- cation cation χ2 p AIC BIC χ2 p AIC BIC χ2 p AIC BIC (df) (df) (df) Model 0 ------

11199 11216 10953 10970 10933 10950 Model 1 <.001 11112 <.001 10770 .18 (random intercept) 89.07 11134 185.10 10792 1.79 10933 10956 Model 2 70.00 <.001 11044 11072 <.001 82.00 <.001 10881 (moderator) (1) (1) (1) 66.28 10705 10733 10853 10707 10740 0.01 (Gender) (1) (1) (1) Model 3 0.03 .86 11046 11079 0.56 .45 .92 10855 10889 Model 4 11047 10708 10748 0.12 (ST condition) (1) (1) (1) 0.49 .49 11086 0.36 .55 .73 10857 10896

(STxGender) (1) (1) (1)

Model 5 3.60 .31 11050 11106 5.66 .13 10709 10765 3.15 .37 10860 10916 (STxGenderx- (3) (3) (3) 2.18 10714 moderator) 3 Model 6 .54 11053 11126 1.29 .73 10786 1.66 .64 10864 10937 (class-level (3) (3) (3) Note.predictors) AIC = Akaike Information Criterium. BIC = Bayesian Information Criterium.

Table 3.9 Unstandardized regression coefficients for models with moderators estimated with ML.

Math Domain Gender anxiety ident. ident. Fixed Random Fixed Random Fixed Random effect effect effect effect effect effect Coef. t Variance Coef. t Variance Coef. t Variance (S.E.) comp (S.E.) comp (S.E.) comp Intercept Level 2 Level 2 10.74 Level 2

10.63 48.55 3.04 10.63 50.11 2.83 48.54 3.05 Moder. -8.44 Level 1 0.12 Level 1 0.01 Level 1 (0.22) (0.21) (0.22) -0.09 11.96 13.32 11.30 -9.15 12.45 -8.44 -8.21 (0.01) (0.01) (0.03) Gender -1.36 -1.30 -1.52 0.29 Note. (0.16) (0.16) (0.17)

Moder. = moderator, Coef. = unstandardized regression coefficient, Domain ident. = Domain identification, Gender ident. = Gender Identification, Lvl =Level.

random intercept improved fit, and subsequently adding gender to the model improved fit as well. In this model, gender continued to be a significant predictor,- indicating that (controlled for domain identification) girls performed worse on the math test than boys, and a positive linear effect of domain identification indi- cating that (controlled for gender) higher scores on domain identification were associated with lower scores on the math test. For the variable gender identifi cation, the pattern was different: including gender identification did not improve fit, whereas adding gender to the model did increase model fit. Stereotype threat and Dutch high school students

83

Because none of the interaction effects of the moderators with the experi- mental condition and gender were significant, this concludes the main analyses theiras we interaction described themterms in as our predictor pre-registration. variables. In To Appendix ensure valid A, we inferences present a fromfinal model in which we included math anxiety, domain identification and gender and this model, we checked and reported results on model assumptions as described by Snijders and Bosker (2012) in Appendix A as well. In sum, these moderator analyses offered no clear evidence that the effects of stereotype threat were moderated by domain identification, math anxiety, or Sensitivitygender identification. analyses

- In the first round of sensitivity analyses we removed all students who either answered the read check or the manipulation check incorrectly. In total 1,596 stu 3 moderatordents remained analyses. in this The analysis. results ofWe the re-analyzed main analysis the weremain unchangedanalyses (i.e., in thisfitting sensi the- four models to test the overall effect of ST with all items analyzed), and the three - edtivity data analysis. set corroborated Specifically, results we still from found the a regulargender gapmoderator favoring analyses males, andfor allModel three 2 turned out to fit the data the best. Results of this sensitivity analysis using this adjust For the second set of sensitivity analyses we calculated outlying scores for all the moderators (Tables with model comparison statistics are included in Appendix B). methodsscales we section.used as Wemoderator repeated variables the moderator (i.e., math analyses anxiety, without domain outlying identification scores and on gender identification) according to the MAD-Median rule as we pre-specified in the that particular moderator. Again, those analyses corroborated the results from the main analyses (Tables with model comparison statistics are included in Appendix B). In registered reports, researchers make decisions regarding the analyses a- videpriori, these but unanticipatedresults in Appendix issues A.might Including emerge these during variables the study. or altering We explored variables the influence of several variables we did not include in our pre-registration, and pro

(e.g., education level, type of class, presence of the teacher, different scoring of the- tiondomain level identification of the class predicted scale, different math scoringperformance. rules for Since the thesemath test,analyses linear capitalize effect of time) did not yield novel important insights. Unsurprisingly, we found that educa- tory analyses. We do believe these analyses are useful to demonstrate the robust- on chance, their results do not carry the same weight as those from the confirma

16 ness of the results. We shared all used scripts on OSF (https://osf.io/yt83j/).

16 Because of privacy issues, we were not allowed to publish the full data. These data are available upon request. 84 Chapter 3

3.4 Discussion

In this - - high-powered stereotype threat study, we investigated whether a com concludemon stereotype that our threat data show manipulation no evidence influenced of performance the mathematical decrements test due perfor to the stereotypemance of girls threat and manipulation. boys in Dutch A high series schools. of sensitivity Through analyses a series supports of analyses, the rowe- -

bustness of our findings. Based on approximated adjusted fractional Bayes fac tors we conclude that we find strong evidence in favor of the null hypothesis of no stereotype threat when compared to the stereotype threat hypothesis, the stereotype threat/stereotype lift hypothesis, and the complement hypothesis. We found sizeable variation in performance between classes, partly due to the fact that we tested classes from the highest educational level (VWO), the second 3 highest educational level (HAVO) and mixed educational levels (HAVO/VWO). Furthermore, we found that variables domain identification and math anxiety inwere Appendix all significant A describes predictors the interaction of math effectsability. betweenAdditionally, the three we found predictors. a gender Be- gap on the on math test, with boys outperforming girls. A final model included

cause we did not preregister this model, and the model goes beyond the scope- of this chapter (i.e., studying stereotype threat effects), we did not discuss it in more detail. Although individual differences in domain identification, math anx stereotypeiety, and gender threat identification effects in the currentwere expected data. by theory to affect susceptibility to stereotypeThere are threat, several we potential failed to explanations find evidence for that the these lack of variables a stereotype moderated threat

- effect in our sample. We now discuss several potential explanations for this, based on whether effects generalize over units (participants), treatment varia- tativetions, outcomeof the wider measures, population and settings of high-performing (e.g., Shadish, high Cook, school & Campbell, students 2002). in the Netherlands.First, our Because current circumstances sample of high forced school us studentsto use convenience might not besampling represen in-

- ulationstead of as random all HAVO/VWO sampling, students our sample from might schools not with be completelymixed HAVO/VWO representative classes of the population of students we wanted to study (we defined our original pop

in the provinces Utrecht, Zuid-Holland and Noord-Brabant). For instance, 11 of- the schools were situated in villages, and only 10 were situated in (overall small to medium sized) cities. Because large cities are underrepresented in our sam ple, and schools situated in cities probably educate students with more diverse- (ethnic) backgrounds, this might have led to selection bias. However, in gender- stereotype threat studies, students from a minority backgrounds are often re moved from the analyses, using the argument that the gender gap in mathemat Stereotype threat and Dutch high school students

85

lack of diversity should boost a stereotype threat effect instead of suppressing it. ics appears only for Caucasian students (e.g., Johns et al., 2005). If anything, the weWe usedsampled a reasonably from a range broad of schoolssample fromthat does different attest parts to the of generalizabilitythe country. Given of the stereotyperelative homogeneity threat effect of acrossquality the and Netherlands. curricula across schools in the Netherlands,

- Second, it is possible that the students in our sample lack characteristics ofthat students are needed in our for sample stereotype did notthreat believe to occur, the stereotypeincluding the that belief boys in are gender typically ste reotypes or identification with the math domain. It might be that a large share- better in mathematics than girls. When we inquired whether boys or girls usu- ally performed better on math tasks, only a small portion of the girls answered pastthat boysresearch appeared showed to bethat better. even However, in the absence re-analyzing of explicit the data stereotypical for girls who beliefs be lieved that boys usually outperform girls did not change the results. Moreover, 3 amongst 13-year-old students, stereotype threat effects can be found (Muzzatti & Agnoli, 2007). Steele (1997) remarked that students do not need to believe the stereotype themselves for stereotype threat to occur. Additionally, although we selected high-performing high school students, not all students might have- idencebeen highly for a strongeridentified stereotype with the math threat domain. effect for Yet, students when we that added scored a three-wayhigher on interaction (gender x stereotype threat x domain identification) we found no ev the domain identification scale. Moreover, re-analyzing a subset of students that were highly math identified did not result in a stereotype threat effect either (see- Appendix A). - Third, our chosen manipulation of stereotype threat could have been in effective. However, we used a manipulation that had been commonly (and suc showedcessfully) that used most in previous students stereotype read and threatremembered studies the(e.g., description Keller & Dauenheimer, of the math 2003; Picho & Stephens, 2012; Spencer et al., 1999). Our manipulation check- totest, doubt and thewhen effectiveness we removed of studentsthe manipulation. that answered the manipulation check in correctly the results did not change substantively. As such, we have little reason

Fourth, there might be issues with outcome measure used in our study. It could be that the selected math test did not elicit any threat, for instance because usedthe wrong before types in stereotype of items werethreat used testing or because in which the stereotype test was threattoo easy. effects However, were we selected math items from TIMSS 2003, which is a math test that has been geometry items on purpose because women tend to underperform in this topic. found (Keller, 2007a; Keller & Dauenheimer, 2003). We carefully selected a set of

Group averages of the items answered correctly ranged between 57% (for girls Chapter 3

86

in the stereotype threat condition) and 64% (for boys in the control condition),- which admittedly is not the most difficult test, but does reflect a realistic testing Itemsituation. Response Moreover, Theory we modeling did not find and a Differential stereotype Itemthreat Functioning effect when analyses we re-ana we lyzed the data with a subtest of the ten most difficult items. With item analysis,

could describe the influence of stereotype manipulation on an item level in more detail, but these analytic techniques are beyond the scope of this chapter (see Chapter 5). Finally, reliability of the math test was somewhat low, which might be caused by the relative homogeneity of the sample (as we tried to select a group of highly identified students). Controlling for disattenuation did not change our conclusions with regards to the stereotype threat effect (see footnote 16). - Fifth, the setting could have been insufficiently threatening for stereotype threat effects to occur, while the control condition might not have been sufficient 3 performancely safe (i.e., devoid between of threat) the stereotype for girls threatto perform condition well. andSpecifically, the control if stereotype condition threat is not sufficiently removed in the control condition, no differences in math - sentedare expected the mathematics because both test groups as gender will experiencefair: a safe threatcondition (Spencer that has et al.,been 2016). suc- To avoid this problem, we selected a control condition in which we clearly pre

cessfully implemented in the past (C. Good et al., 2008; Keller, 2007a; Keller & shouldDauenheimer, have successfully 2003). We notealleviated that our the manipulation effects of negative check gender provided stereotypes. reassurance that most students in the control condition recalled the test as gender fair, which

Furthermore, there is a possibility that students did not feel motivated to studentsperform wellmight on not the have math tried test, as because hard as theythe stakes would were on a regular not high math enough exam. for Even the students. Because the math test was not graded as part of the regular curriculum, studies are rarely carried out in high stakes environments because of ethical im- though this explanation might sound plausible, experimental stereotype threat to study effects of stereotype threat in a high stakes testing context by placing a plications and practical constraints (Sackett, 2003). A handful of studies tried

fairly subtle manipulation before taking actual placement tests (Stricker & Ward, 2004), or by offering financial rewards for correctly answered items (Fryer et al., 2008). In those studies stereotype threat effects were absent or negligible. Some authors argued that stereotype threat effects did not occur in those settings, or the effects in those settings were not as large compared to lab studies, because it- is (theoretically) impossible to create a stereotype threat safe condition on high stakes tests. This might have caused all girls to underperform, regardless of con dition (Aronson & Dee, 2012; Spencer et al., 2016; Steele, Spencer, & Aronson, 2002). Other authors responded it is just as plausible that women in stereotype threat conditions might be less motivated to perform well on a low stakes test, Stereotype threat and Dutch high school students 87

whereas they are able to overcome this motivational effect on high stakes tests

(Sackett & Ryan, 2012). Because high stakes tests have not shown convincing- reotypestereotype threat threat effect effects, in our and current a substantial study is causednumber by of the low absence stakes oftest high did stakes yield attachedevidence tofor test stereotype performance. threat effects, we are not convinced that the lack of a ste -

Finally, it might be possible that the stereotype threat manipulation sim ply does not influence Dutch children. Even though stereotype threat effects have been found among Dutch college students (Marx, Stapel, & Muller, 2005;- Wicherts, Dolan, & Hessen, 2005) and among students aged 12-16 in Italy, France, Uganda, Spain and Germany (Delgado & Prieto, 2008; Huguet & Régn- er, 2007, 2009; Keller & Dauenheimer, 2003; Muzzatti & Agnoli, 2007; Picho & Stephens, 2012), there is a possibility that our studied population is not suffi- ciently affected by stereotype threat. For the discrepancy with past results, we 3 can think of potential cross-cultural explanations (i.e., in Dutch society this gen der stereotype has little influence on test performance), statistical explanations- retical(i.e., a Type explanations II error occurring),that should generational be tested in later explanations meta-analyses (i.e., this and generation randomized of students is no longer sensitive to stereotype threat) or other yet unknown theo We are convinced that we carried out a powerful and well-designed experiment. Ourexperiments. experiment Post mirrors hoc, it many is difficult of the to past judge stereotype which explanation threat studies is the with right positive one. our study is clearly superior to those earlier studies in terms of statistical power. results in terms of setting, type of test, and stereotype threat manipulation, and- ies of stereotype threat in classroom settings. Results of past studies have been Our findings are not surprising given diverging results of earlier stud heterogeneous (see Flore & Wicherts, 2015 for an overview), with some studies finding large effects for specific groups (e.g., Muzzatti & Agnoli, 2007) and others finding no stereotype threat effect at all (e.g., Cherney & Campbell, 2011; Ganley et al., 2013). Because the divergence in earlier findings is not readily explainable havein terms suggested of theoretically that publication driven moderators, bias and other but does related match biases the patternaffect the expected litera- from publication bias in meta-analyses (Flore & Wicherts, 2015), several authors ture on stereotype threat (Flore & Wicherts, 2015; Ganley et al., 2013; Stoet & Geary, 2012). Because of the severity of biases due to the flexibility in analyzing availablerelatively stereotypesmall experiments threat studies (e.g., see fail Bakker to paint et an al., accurate 2012) and picture a common of the generfailure- alizabilityto report at of leaststereotype some experimentalthreat among results,schoolgirls. meta-analyses based on currently Now that we have a rich theoretical background of stereotype threat

(Inzlicht & Schmader, 2012; Schmader, Johns, & Forbes, 2008; Spencer et al., 88 Chapter 3

2016), it might be time to rigorously study effects of stereotype threat in future confirmatory studies. Direct replications in several contexts, with proper prior threatpower onanalysis math andperformance. a pre-registered With registered methods sectionreports andand analysesother pre-registered specified in advance, will give us a better understanding of the actual influence of stereotype- ditions of stereotype threat: for what type of students do stereotype threat ef- studies we can systematically answer questions concerning the boundary con

fects emerge, in which cultures, in which age groups, and on what topics do the- effects occur? Once the boundary conditions in those studies are clear (e.g., if only extremely high domain identified women underperform on extremely diffi- cult tests) we might wonder whether gender stereotype threat is as important as currentpreviously large-scale claimed, study and reconsider does show whether that the weeffects should of stereotype implement threat general on intermath testventions performance to counter should it (Jordan not be & Lovett,overgeneralized. 2007; Walton et al., 2013). Either way, the 3 With this study we started an effort to testing stereotype threat effects in

a confirmatory fashion using a meticulous design. Other efforts to improve the replicability of stereotype threat studies, like high powered studies (Smeding, Dumas, Loose, & Régner, 2013; Stricker & Ward, 2004), additional pre-registered- replication studies (Finnigan & Corker, 2016; Gibson, Losee, & Vitiello, 2014; Moon & Roeder, 2014) are now starting to appear. We hope this trend will con onlytinue are in collaborationsthe future, and useful might to extend design to studies other withexciting combined formats input like ofadversarial research- collaborations to replicate some of the original stereotype threat findings. Not - ers with different kinds of expertise, they additionally simplify the work because multiple parties need to gather data, sharing the burden of acquiring a large sam- ple. The advantages of large multi-lab (replication) studies are numerous: results acrossare often labs more and robustcultures than can results be studied from systematically.a small study, powerSuch efforts to find shed a signifi light oncant the stereotype nature of threat stereotype is higher, threat and and generalizability can help ameliorate of stereotype its potential threat effects

negative stereotypes. on women’s academic performance in fields in which they are still faced with Stereotype threat and Dutch high school students

89

3

Chapter 4

Current and best practices in conducting and reporting DIF analyses

practices in conducting and reporting DIF analyses. This chapter will be submitted as Flore, P. C., Oberski, D. L., and Wicherts, J. M. Current and best Chapter 4

92

Abstract

groups is a common part of evaluations of psychometric properties of scales and Testing Differential Item Functioning (DIF) with respect to (demographic)

is widely considered crucial for fair use of these scales in clinical, educational, and professional practice. To obtain meaningful assessments of (or the lack of)- dardsDIF in werescales, met it is in equally the current crucial literature that the DIFby performing analyses are a adequatelysystematic reviewexecuted, of currentwell reported, practices and in reproducible. DIF analysis andWe reporting.evaluated Codingthe extent a random to which sample these of stan 200 articles from the empirical DIF literature on a large number of characteristics

revealed that, overall, analysis practices were adequate, and sample sizes were- mostly large enough for adequate power. Reporting practices, however, were clearly suboptimal in a majority of studies, with many DIF studies failing to re ofport analysis details methods on statistical and reporting. results, flagging rules, and DIF effect sizes. Based on our findings, we provide guidelines to improve applied DIF researchers’ choice

4 Reporting DIF analyses

93

4.1 Introduction

- Before comparing groups of people’s scores on psychological, educational, or- clinical scales, it is crucial to rule out the possibility that apparent, observed dif ferences are simply caused by differences in measurement (American Education al Research Association, American Psychological Association, & National Council- on Measurement in Education, 2014). Scales that are measurement invariant (Mellenbergh, 1989; Meredith, 1993; Steenkamp & Baumgartner, 1998) or mea surement equivalent (Davidov, Meuleman, Cieciuch, Schmidt, & Billiet, 2014), do not show bias over groups. With (dichotomous) item scores, MI/ME is typically aimstudied to establish by using the model-based presence or tests absence of Differential of item measurement Item Functioning differences (Holland across & Wainer, 1993). The term DIF may also refer to a suite of statistical practices that- indemographic test fairness groups - especially such as in age, high-stakes race, or gender testing or – across studying groups DIF based when ondesign diag- nostic criteria (e.g., depressed vs. non-depressed adults). Because of its key role widespread practice. ing, refining, and validating psychological tests and questionnaires has become a 4 To study DIF, a number of different statistical methods have been developed (e.g., Holland & Wainer 1993; Davidov et al. 2014). Although all of these methods aim to detect or remove potential DIF, their results might vary (Borsboom, 2006). This divergence naturally leads researchers to question the best course of action when analyzing scales. Should we use stratified chi-square (“Mantel-Haenzel”) tests or latent variable models? Is purification of the scale (i.e., removing items displaying DIF in subsequent tests of DIF on remaining items) important? What- is an adequate sample size to detect DIF with a given method? Which criteria do we use to flag an item as showing DIF? Answers to such questions (respective ly: “latent variable models”, “yes”, and twice “it depends”) can be found in the large methodological literature around DIF (e.g., Clauser & Mazor,. Whether 1998; Guilera, these recommendationsGómez-Benito, Hidalgo, on how & Sánchez-Meca, to best perform 2013; DIF Hambleton, analyses are 2006; implemented Navas-Ara in& Gómez-Benito, 2002; Teresi & Fleishman, 2007; Tay et al., 2015) - practice is an important indicator of the quality of the DIF literature,reproducibility and, conse quently of the quality of scale comparisons in psychologyreplicability and related fields. - Of further importance are the core scientific principles of - (similar research findings from the same data set) and (similar re search findings from a new random sample; Asendorpf et al., 2013). If research baseders fail on to thoseinclude analyses enough could information remain tosubject ensure to reproducibilitydebate among ofscientists their study, and readers cannot check the validity of DIF analyses. Consequently the conclusions practitioners using a particular scale. Similarly, without the option to replicate Chapter 4

94

specific DIF results, readers can never verify by administering the same scale- to novel samples whether an earlier finding of (or a lack of) DIF in that scale might be a potential false positive (“Type I error”) or false negative (“Type II er ror”). For example, if the research finding that a particular high stakes testtrans lacks- DIF items is neither reproducible nor replicable, fairness of the test will remain- ducibilitycontentious. and A replicability second indicator of DIF of tests the depend quality onof transparencythe DIF literature of data is thecollection parency and comprehensiveness of its reporting practices, because both repro

procedures,In spite clarity of the on clear (choices importance made in)of selectingthe used soundstatistical DIF method(s)methods and to testde- DIF, and the specificity of DIF test results. - tailed reporting of DIF methods and results, little is known about these practices- parisonsin the literature. across groups. This is problematicIf suboptimal for methods several arereasons. being First, used muchto test of scales the psy for chological literature relies on scale scores and their (explicit or implicit) com - ableDIF, the“researcher conclusions degrees from ofpsychological freedom”. This studies freedom based to on choose them maybetween be at differstake.- Second, a diversity of methods and the many choices therein create consider 4 ent methods and specifics of the analysis can unintentionally lead to incorrect- or debatable inferences, particularly when the meandering paths taken before arriving at the study’s conclusions are obfuscated by inadequate reporting (Gel man & Loken, 2014). Third, reporting practices, aside from their importance to- erreproducibility researchers cannotand replicability, use this information affect the potentialif they would usefulness like to of use DIF the studies. same For example, if authors fail to report which items of a scale showed DIF, oth therefore crucial to understand common practices in DIF methods and reporting. scale in future research or develop a refined scale for a similar population. It is any other type of statistical analysis using the Null Hypothesis Statistical Testing Moreover, since DIF analyses are prone to Type I errors and Type II errors like - (NHST) framework, it is necessary that researchers report all characteristics of their study that can influence the false positive and false negative rate (e.g., sam ofple how size, researchers DIF effect sizes, use and number report of DIFitems analyses tested, inalpha the level).peer-reviewed literature. To shed light on current DIF practices, we conducted a systematic review

Specifically, we reviewed a large random sample of DIF studies to answer three- key questions concerning how DIF analyses are typically conducted and reported. First, we documented the most popular types of statistical methods in the empir- ical DIF literature. Second, we determined whether methodological choices are optimal for the chosen statistical method (e.g., do authors select enough partici pants; do they use effect sizes as cut-off scores in addition to significance tests; do they use purification?). Third, we documented what relevant information authors Reporting DIF analyses

95

- startreported, with ora discussionfailed to report of relevant in their methodological DIF articles. Furthermore, literature on weDIF aid analysis. future Were usesearch this by existing providing body improved of methodological guidelines forDIF DIF literature analysis. to In construct this chapter, clear we guide first- methods and results of our systematic review in the second half of this chapter. lines to improve future quality and reproducibility of DIF analyses. We report the

4.1.1 Differential Item Functioning An item’s measurement can be considered invariant when the expected value of 17 - that item, controlled for a latent trait of interest, is equal over groups (Mel- lylenbergh, and generally 1989). as: Let X be the tested dichotomous item, θ the latent trait and G a grouping variable. Mellenbergh (1989) defined measurement invariance formal f (X | g,θ ) = f (X |θ ) in which f (X | g,θ ) gives for the all distribution g and θ, of item X given g f (X | g,(1)θ ) the distribution of item X to X. If an and item θ, lacksand measure- 4 given θ, based on the “item response function” (IRF) implicated in a psychometric model that relates θ - nationment invariance, of differing it goals: shows DIF. Thus, DIF occurs when items have different IRFs for1. different groups (Holland & Wainer, 1993). DIF analyses might have a combi 2. To establish whether Equation 1 holds in a population (hypothesis testing); If Equation 1 does not hold, to establish the size of differences in IRFs across- groups (item-level effect size estimation); - 3. If Equation 1 does not hold, to establish the effect that cross-group differ ences in IRFs have on latent trait estimation (trait-level effect size estima tion). - To accomplish these goals, participants are matched on their ability level, or a proxy of it, the matching criterion. This matching criterion can be based on ob- served or unobserved variables (Millsap & Everson, 1993) and so either uses an observed total score on the test under investigation or an estimation of the (un observed) latent trait in a model-based manner (i.e., it is not required to estimate the dataindividuals’ and the latenttype of scores). DIF they DIF are analyses able to uncover. can vary in the number of groups that are compared, the measurement level of the items, multidimensionality of

17 educational research. In this chapter the terms are used interchangeably. Depending on the field the term latent trait is sometimes substituted with the term latent ability, for instance in Chapter 4

96

It is useful to distinguish uniform and non-uniform DIF (Mellenbergh,- 1982). Uniform DIF occurs when the probability of correctly answering an item anis not interaction equal for between the studied the groups group variableand constant and theover latent all scores or observed of the latent trait or score ob served trait (Hanson, 1998; Mellenbergh, 1982). Non-uniform DIF occurs when

is needed to explain the data properly (Mellenbergh, 1982; Millsap & Everson, 1993). Figure 4.1 displays a hypothetical example of uniform and non-uniform- DIF with gender as grouping variable. In the situation of non-uniform DIF, the- IRFs for the studied groups are usually not parallel, however there are some sit uations in which uniform DIF occurs but DIF is not parallel and vice versa (Han Uniform DIFUniform DIF Non−uniformNon−uniform DIF DIF son, 1998), so the two terms are not interchangeable. Not all DIF methods are equally goodUniform atUniform flagging DIFUniform DIFUniform items DIF for DIF both kinds of DIF, as we willNon−uniform discussNon−uniformNon−uniform later. DIFNon−uniform DIF DIF DIF

Uniform DIFUniform DIF Non−uniformNon−uniform DIF DIF 1.0 1.0 1.0 1.0 Boys Boys Boys Boys

1.0 1.0 1.0 Girls1.0 Girls 1.0 1.0 1.0 Girls1.0 Girls 0.8 0.8 0.8 0.8 Boys Boys Boys Boys Boys Boys Boys Boys

θ ) θ ) Girls Girls Girls Girls θ ) θ ) Girls Girls Girls Girls 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 1.0 1.0 1.0 1.0 Boys Boys Boys Boys θ ) θ ) θ ) θ ) θ ) θ ) θ ) θ )

0.6 0.6 0.6 Girls0.6 Girls 0.6 0.6 0.6 Girls0.6 Girls 0.4 0.4 0.4 0.4 0.8 0.8 0.8 0.8

P(Y=1| 4 P(Y=1| P(Y=1| P(Y=1| θ ) θ ) θ ) θ ) 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.6 0.6 0.6 0.6 P(Y=1| P(Y=1| P(Y=1| P(Y=1| P(Y=1| P(Y=1| P(Y=1| P(Y=1| 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 P(Y=1| P(Y=1| P(Y=1| P(Y=1| 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 −10 −5−10 0 −5 5 0 10 5 10 −10 −5−10 0 −5 5 0 10 5 10 0.0 0.0 0.0 0.0 −10 −10−5−10−50 −10−5 05 −50 105 50 10 105 10 −10 −10−5−10−50 −10−5 05 −50 105 50 10 105 10

−10 −5−10 θ0 −5 5 θ0 10 5 10 −10 −5−10 θ0 −5 5 θ0 10 5 10

Figure 4.1 Examplesθ ofθ uniformθ andθ non-uniform DIF. θ θ θ θ

θ θ θ θ

4.1.2 Methodological considerations in DIF studies

- A DIF study involves many decisions that can be summarized in four categories; the statistical methods used, the rules used to flag items as exhibiting DIF, the hy Wepotheses now discussused to thesetest the considerations items, and whether as a foundation purification for is our used proposed to prevent guide bias- lines.on the matching criterion. Ultimately these decisions affect quality of DIF tests. Reporting DIF analyses

97

Statistical Method Methods for testing DIF can be categorized into observed score methods and un- - isobserved used to score gauge methods the ability (Millsap of the & participants. Everson, 1993). Observed For unobserved score methods score methlack a formalods a formal model measurement and typically usemodel the hastotal been test specifiedscore as a and proxy thus of thea latent participants’ variable ability. Due to the large number of statistical methods in DIF testing we limit ourselves to methods and models used to analyze scales with dichotomous indi- cators. Most of these methods have extensions for polytomous indicators as well

(Tay et al., 2015). In the literature, continuous indicators are usually considered in a structural equation modelling framework with latent variables, which is not the focusThere of are our several review reasons (see N. to Schmitt select anand unobserved Kuljanin, (2008), score methodSteenkamp over and an observedBaumgartner score (1998), method and or Vandenberg vice versa. Many and Lance, authors (2000) select for observed a review). score meth- ods because they are relatively easy to implement, do not require model fit, and offer clear guidelines for DIF effect size (Sireci & Rios, 2013). However, the use of an observed score as proxy for ability can lead to biased results, especially with 4 isscales more shorter accurate than because 20 items measurement (Millsap & Everson, error is 1993).taken intoThe accountmain advantage in a mod of- unobserved score methods, particularly for shorter scales, is that the matching el-based manner (Millsap & Everson, 1993; Woods, Oltmanns, & Turkheimer, Observed2009). Score Methods -

Although a variety of observed score methods exist, three observed score meth ods that have received most attention in the literature are the MH test, logistic regression, and SIBTEST (these and other DIF methods are described in more detail in Appendix C). The MH2 test uses contingency tables to test for DIF, usually- monwith theodds observed ratio and total the scoresmore popular of the participants counterpart as MH matching D-DIF cancriterion be used (Dorans as ef- & Holland, 1993). With a χ test, items can be tested for significant DIF. The com fect size measures for the MH method (Dorans, 1989). Although the MH method Logisticwas originally regression developed analysis to detect is another uniform popular DIF, an observed extension score has beenmethod developed to test that can detect non-uniform DIF as well (Mazor, Clauser, & Hambleton, 1994). for DIF (Swaminathan & Rogers, 1990). Significance tests for logistic regressions can be operationalized by means of either the WaldR test,2 likelihood ratio test or the score test (Paek, 2012), although the Wald test is the most commonly used- (Swaminathan & Rogers, 1990). Either Nagelkerke (Zumbo, 1999; Zumbo & Thomas, 1997) or (delta) log (Fidalgo, Alavi, & Amirian, 2014; Mo Chapter 4

98

- sure. Logistic regression has more power to detect non-uniform DIF than uni- nahan, McHorney, Stump, & Perkins, 2007) can be used as an effect size mea

statisticallyform DIF (Hidalgo tests the & López-Pina,weighted mean 2004). differences A third popular between observed groups forscore a separate method is SIBTEST (Shealy & Stout, 1993). SIBTEST is a non-parametric technique that ˆ means of a regression correction. The parameter βUNI itemwith ora z a bundle of items, corrected for differences in the ability distributions by- reflects DIF, can be tested -test, and can be used as a DIF effect size estimate. SIBTEST was original ly designed to detect uniform DIF, but the development of CROSSING-DIF (Li &- Stout, 1996) also enabled tests of non-uniform DIF in SIBTEST. Several other observed score methods to test for DIF exists, including; stan dardization (Dorans & Kulick, 1986), log-linear methods (Kok, Mellenbergh, &- Van Der Flier, 1985), the Breslow-Day and the combined decision rule method- (Penfield, 2003; Prieto-Marañón, Aguerri, Galibert, & Attorresi, 2012), the logis oftentic regression observed lasso score DIF methods method are (Magis used etin al,the 2014), DIF literature. and Angoff’s Delta Plot (An goff & Ford, 1973; David Magis & Facon, 2012). Our review will document how Unobserved Score Methods 4 Unobserved score methods estimate DIF as part of a latent variable model. We

distinguish two types of unobserved score methods (Millsap & Everson, 1993); unobserved score methods with categorical indicators (that we focus on in our systematic review) and unobserved score methods with continuous indicators- observed(that have score been reviewedmethods within the categorical literature onindicators invariance; are N.described Schmitt &within Kuljanin, the 2008; Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000). Often, un model18 framework of item response theory (IRT) modeling. With the use of the Rasch , the two-parameter logistic model (2-PL) or the three-parameter logistic

model (3-PL), tests can be implemented2 for testing DIF in dichotomous items. A plethora of procedures are available to test for DIF in the IRT framework, but three popular procedures2 are Lord’s χ , IRT-LR-DIF and area measures (Millsap,- 2011).

In Lord’s χ procedure, item parameters are estimated for the2 groups sep arately (F. M. Lord, 1980) using a linking procedure to ensure that both groups item parameter bˆ and discrimination item parameter aˆFi are placed on the Fi same metric (Kim & Cohen, 1995). With Lord’s χ the difficulty for the first group can

18 Over the years some controversy arose between researchers who considered the Rasch model as a special case of IRT modeling and researchers who considered the Rasch model as fundamentally different. For an overview

of this controversy, see Andrich (2004). We acknowledge the discussion, however from a simplicity perspective we will discuss the Rasch model and the (other) IRT models simultaneously. Reporting DIF analyses

99 be compared with the corresponding parameters for the second group. A second - - widely used IRT method to test for DIF is IRT-LR-DIF (Thissen, Steinberg, & Wain er, 1993). This approach involves a likelihood ratio test of DIF based on the differ areence freely in fit estimated.between a Acompact third approach model wherein to DIF testing item parameters uses area measuresfor both groups of the are constrained to be equal, and an augmented model wherein item parameters- tion to determine the area between the item response functions of both groups. distance between IRFs in two groups (Raju, 1988). This approach uses integra-

Area measures reflecting DIF can be considered as item-level effect size esti mates in the context of IRT (Meade, 2010), which is especially useful in the 2-PL or 3-PL model. Another effect size estimate is to simply interpret the difference in difficulty estimates (uniform DIF) or discrimination parameters (non-uniform- DIF) between groups as an effect size (L. Steinberg & Thissen, 2006). Other approaches to IRT DIF are available, such as loglinear IRT-LR (Kel derman, 1989), limited information IRT-LR (Muthén & Lehman, 1985), score tests or Lagrange Multiplier tests (Glas, 1998), mixture models (Cohen & Bolt, 2005; De Ayala, Kim, Stapleton, & Dayton, 2002, Frederickx et al, 2010), logistic- mixed models/random item IRT models (Van den Noortgate & De Boeck, 2005; 4 De Boeck, 2008), Rasch Trees (Strobl, Kopf, & Zeileis, 2015), the IRT-C proce dure (Tay, Vermunt, & Wang, 2013), ideal pointAlthough models we (e.g.,only discussGeneralized unobserved Graded Unfolding Model, Seybert, Stark, & Chernyshenko,19 2013), and Langer-Improved Wald tests (Woods, Cai & Wang, 2012). score methods with categorical indicators in an IRT framework, similar results can be obtained using a confirmatory factor analysis (CFA) framework. Stark,- minologyChernyshenko, and practicesand Drasgow in both (2006), frameworks. Glockner-Rist A CFA & approach Hoijtink (2003)would certainlyand Tay, beMeade, appropriate and Cao when (2015) indicators discuss are similarities continuous and and differences the underlying between factors the terare

- multi-dimensional. Well-known techniques are multi-group CFA (French & Finch, 2008; Meredith, 1993; Widaman & Reise, 1997), Restricted Factor Analysis (Bar endse, Oort, & Garst, 2010), or analyses using the Multiple Indicators Multiple Causes (MIMIC) model, with the MIMIC model being especially popular due to its smaller sample size requirements (Woods, 2009). Originally MIMIC modeling only allowed for uniform DIF testing, but recently methods have been extended to include interaction terms, which enable researchers to test for non-uniform DIF as well (Woods & Grimm, 2011).

Some hybrids of observed score methods and unobserved score methods are developed for DIF analyses as

19 classify as observed score method in our review. well (e.g., ANOVA of estimated residuals based on the Rasch model, Andrich, Sheridan, & Luo, 2010), which we 100 Chapter 4

one underlyingMost IRT-based latent DIF trait. or Violationsobserved scoreof unidimensionality methods require increase unidimensionality the Type I of the scale (Millsap, 2011). Unidimensionality holds when a scale measures only-

error for DIF tests in several methods (Mazor, Hambleton, & Clauser, 1998; Mill- antsap, that 2011). essential Because unidimensionality unidimensional holds IRT models when an are IRT relatively approach robust is used to for minor DIF violations of unidimensionality (Drasgow & Parsons, 1983), it is more import that consists of small subfactors. Several extensions of unidimensional DIF meth- analysis (Tay et al., 2015), i.e., data with a general underlying dominant factor

ods have been developed to take intentional multidimensionality into account,- tidimensionalitylike MULTISIB (Stout, readily Li, Nandakumar,into account. &We Bolt, recommend 1997) and checking multidimensional for essential IRT (Camilli, 1992). Methods based on CFA are generally flexible and can take mul-

unidimensionality when a unidimensional approach is used. Moreover, IRT ap proaches usually require that the model fit adequately. It is therefore important to report model fit indices as well (Tay et al., 2015). Once an appropriate method Flaggingis selected, for it isDIF important to consider when items will be flagged for DIF. 4

There are several ways to flag an item for DIF. One can decide to test an item Issuesbased withon significance NHST testing, an effect size cut-off score, or both.

- Null Hypothesis Significance Testing (NHST) is a dominant approach in flagging DIF. Since DIF analyses usually entail multiple tests (e.g., a test for each item sep arately or when multiple group comparisons are made) many statistical methods suffer from an inflated overall Type I error rate, which means that the chance of Thefinding familywise at least Typeone statistically I error rate significant can be controlled DIF item by in a multiplea scale when testing in correcfact no- real underlying DIF is present grows larger than the nominal error rate of 5%. - tion, like the Bonferroni correction (Bland & Altman, 1995), Holm’s procedure (Holm, 1979), or by the Benjamini-Hochberg’s false discovery approach (Ben jamini & Hochberg, 1995; Raykov, Marcoulides, Lee, & Chang, 2013). Observed showedscore methods that under seem most to benefit circumstances most from the theBenjamini-Hochberg Benjamini-Hochberg procedure correction has or Holm’s procedure (Kim & Oshima, 2012). Williams, Jones, and Tukey (1999)- ods are successful in controlling the family-wise Type I error rate and the false discoverymore power rate to when detect the DIF family than of the comparisons Bonferroni is correction, not too large. whereas The other both sidemeth of the coin is that applying a multiple testing correction ultimately leads to a loss of -

power to detect actual DIF (Kim & Oshima, 2012; Penfield, 2001). Under subopti Reporting DIF analyses 101

mal conditions (e.g., small sample sizes and small DIF effect sizes) the Bonferroni correction can become conservative, leading to unacceptably low levels of power to detect DIF (Kim & Oshima, 2012). Solely focusing on statistical significance can be tricky, because a large enough sample size will ultimately lead to flagging of items with negligible DIF effect sizes, whereas a small sample size will lead to Type II errors (Hambleton,- 2006). The challenge in testing DIF is to balance the costs and benefits of both types of error. Some researchers advocate flagging items for DIF based on signif icance tests of multiple DIF methods (Hambleton, 2006; S. Kim & Cohen, 1995). Depending on whether a conservative or liberal flagging criterion is used, this- practice tends to inflate either the Type I error rate or the Type II error rate, especially when the sample size is small and groups have unequal ability distri Powerbutions (Fidalgo, Ferreres, & Muñiz, 2004).

- Discussing the role of power is crucial in DIF testing, because some researchers might be motivated to find as few items with statistically significant DIF as pos sible (Borsboom, 2006; Hambleton, 2006), which is facilitated by low-powered- 4 DIF tests. Calculating the required sample sizes for a desired power rate is not straightforward for DIF analyses, because power depends on many characteris tics of the data (e.g., reliability of the scale, ability distributions of the groups, a priori expected DIF effect size). Recently, power formulas have been developed for several observed score methods (Li, 2014a, 2014b, 2015). Among other things, power of DIF tests are strongly related to sample size, DIF effect size, the length of the scale, and item reliability. A meta-analysis of 3,774 simulation conditions of the MH procedure showed that sample size is an important factor in power rates for the MH method (Guilera, Gómez-Benito, Hidalgo, & Sánchez-Meca, 2013). The 124 simulation conditions with sample sizes smaller than 500 averaged a power of .375, indicating that the sample sizes should be quite large when using the MH test. Moreover, the tests are most powerful when the sizes of the groups are balanced; unbalanced group sizes reduce the power to detect DIF. Similarly, simulation studies showed that observed score DIF methods like logistic regression and SIBTEST require sample sizes of 250/300 per group (Lei & Li, 2013; Narayanan & Swaminathan, 1994) to 500 per group (Bolt & Gierl, 2006; Gierl, Gotzmann, & Boughton, 2004; Herrera- & Gómez, 2008; Narayanan & Swaminathan, 1996; Rogers & Swaminathan, 1993; Swaminathan & Rogers, 1990), under the most favorable circumstances (e.g., bal anced group sizes, equal ability distributions, balanced DIF, which means that 50% of DIF items favor one group and 50% of DIF items favor the other group), whereas some researchers find that 300 participants per group does not lead to sufficient power (Güler & Penfield, 2009). 102 Chapter 4

Overall, IRT methods are known to require large sample sizes compared to observed score methods (Teresi, 2006a). However, DIF analyses under the Rasch- model can handle sample sizes as small as 200 to 300 subjects per group under favorable circumstances (Paek & Wilson, 2011). Even the IRT-LR DIF method im plemented with the 2PL model shows sufficient power with 100 participants in- the first group and 500 participants in the second group (Finch, 2005), although a second study suggested 200 participants or more per group is a required mini Undermum (Woods, many circumstances 2009). These simulationthe MIMIC studies model showedis as powerful that IRT as methods the observed do not necessarily require much larger sample size than the observed score methods.

selectscore methodsan observed (Finch, score 2005), method and for equally a DIF oranalysis more powerfulover an unobserved than IRT-LR score DIF (Woods, 2009). As such, a small sample size does not have to be the reason to

method. Multi-group CFA on dichotomous indicators does require large sample sizes of over 500 participants per group (French & Finch, 2006). Scale length seems to influence power for the MH procedure, with most power found for either short (20 item) scales or long (40+ item) scales, although 4 the Type I error rate with MH was relatively high for short scales (Guilera et al., 2013). In simulation studies, scale length hardly influenced the power of the test in the context of IRT or logistic regression (Finch, 2005; Frederickx, Tuerlinckx, inDe mostBoeck, of &those Magis, simulation 2010; J. Kim studies & Oshima, the shortest 2012; Rogersscales usually & Swaminathan, consisted 1993;of 20 Swaminathan & Rogers, 1990) and SIBTEST (Klockars & Lee, 2008). We note that-

items. If scales shorter than 20 items are used, which is not uncommon in psy chological testing (Tay et al., 2015), those scales might be less reliable than their longer counterparts, and lower reliability is expected to lower the power. For- some techniques authors explicitly call for a minimum scale length, for instance- the MH test and SIBTEST require a matching criterion of 20 items or larger (Mill sap & Everson, 1993; Shealy & Stout, 1993). For IRT methods scales are recom mended to contain at least 10 items (Tay et al., 2015). - cientlyWhen powered NHST tests. is used A disadvantage as (part of) ofthe the flagging focus on rule NHST for DIF,in DIF authors testing need is that to smallerconsider sample whether sizes the will selected lead tonumber a higher of probabilityparticipants of and making items Type lead IIto errors suffi

whereas a large sample size can lead to flagged DIF items with negligible effect sizes and therefore negligible practical significance (Hambleton, 2006). In our review of empirical DIF studies, we documented the number of items and sample sizes in order to check whether DIF tests were sufficiently powerful. Reporting DIF analyses

103

Effect sizes and cut-off scores effectEstimates sizes of are DIF necessary effect sizes for are researchers useful for threewho want reasons to carry (L. Steinberg out meta-analyses & Thissen, summarizing2006). First, effect DIF results sizes can or whobe used want as to input compare for future DIF across power different analysis. samplesSecond, -

(e.g., original vs. replication). Third, effect sizes are an important part of report ing, as they allow readers not only to inspect statistical significance, but practical- significance as well. - To avoid flagging items with trivial DIF effect sizes, various authors recom Effectmended size using cut-off a minimal scores have effect been size proposed as (part of)for severalthe DIF statisticalflagging rule methods. (Hamble For ton, 2006; Monahan et al., 2007; Sireci & Rios, 2013; Teresi & Fleishman,MH D-DIF 2007). cut- instance, in theˆ context of the MH test the ETS guidelines consider SIBTEST a βUNI offs at 1.0 and 1.5 (Dorans & Holland, 1993;∆ Zieky,R2 1993), and in the context of > .059 has been proposed as the cut-off value (Roussos & ∆Stout,R2 > 1996). In the context of logistic regression a > .13 has been proposed as the cut-off value for DIF (Zumbo, 1999) and later a less conservativeβˆ one of .035 (Jodoin & Gierl, 2001). Alternatively, when using the log odds correlation 4 coefficientThere in is theless context agreement of logistic on the regression,best guidelines cut-off for valueseffect sizes of in> 0.426IRT or have CFA been suggested (Fidalgo et al., 2014; Monahan et al., 2007). -

(Sireci & Rios, 2013). In IRT, some psychometricians recommend using the dif ference in item difficulty parameters for both groups as the effect size estimate (L. Steinberg & Thissen, 2006), whereas others advocated the use of one of the proposed area measures (Meade, 2010) based on various cut-off scores to avoid flagging DIF items with a negligible DIF effect size (Raju, 1995). When authors flag for DIF using the difference between parameter estimates as the effect size measure, the cut-off scores vary usually between 0.25 and 1.00 logits (amongst others, Kahler, Strong, Read, Palfai, & Wood, 2004; McAlinden, Pesudovs, & Moore, 2010; Srisurapanont et al., 2012; Weinstock, Strong, Uebelacker, & Miller,- 2009), with a cut-off score of .50 being slightly more popular than other values sizes.(e.g., Lai, In theCella, context Chang, of Bode, the MIMIC & Heinemann, model the 2003; effect Strong, size Kahler,estimate Greene, MIMIC-ES & Schin can ka, 2005; Winke, 2011). DeMars (2011) and Meade (2010) reviewed DIF effect be used (Jin, Myers, Ahn, & Penfield, 2012). Although the reporting of effect sizes is encouraged in both the IRT and CFA contexts by several researchers (DeMars, 2011; French & Maller, 2007; Meade, 2010; Millsap, 2011; Tay, Meade, & Cao, 2015), few suggestions have been made as a effect size estimate to use or which cut-off to use as (part of) a flagging rule. Millsap rightly noted that one-size fits all cut-off scores are not always useful across fields, and their usefulness in IRT is further limited (Millsap, 2011). 104 Chapter 4

Several authors argued that while the use of effect sizes can safeguard

against flagging too many items with trivial DIF effect∆R2 sizes as DIF items, it often comes at the cost of unacceptable power rates (French & Maller, 2007; Jodoin & Gierl, 2001; Zwick, 2012). Especially the use of in the flagging rule using the otherpopular hand cut-off several score authors of .13 advocate appears thatto be effect an overly sizes shouldconservative be used approach because (French & Maller, 2007; Hidalgo & López-Pina, 2004; Jodoin & Gierl, 2001). On

a statistically significant DIF does not necessarily imply practical significance (Hidalgo, Gómez-Benito, & Zumbo, 2014; Meade, 2010). Preferably researchers should not rigidly follow rules of thumb, but either carefully select a cut-off score for effect sizes appropriate for their type of research (e.g., based on the costs of a practically insignificant DIF item over a Type II error in the field) or simply- report effect sizes for descriptive purposes. Gold standard cut-off scores for DIF effect sizes can be problematic, for instance because items with equal DIF ef fect size will have a larger impact on short scales than on longer scales (Millsap, 2011). Instead of solely using DIF effect sizes, some authors propose to check the- impact of a DIF item on the entire scale (Borsboom, 2006) or to investigate the 4 effect of the DIF item on selection accuracy (Millsap & Kwok, 2004), to get an∆ Rin2 dication how the DIF item influences conclusions or decision making. In general, βˆ we recommend that authors report effect sizes for all items (e.g., MH-D-DIF, , , item parameters for both groups, area measures), and that authors use effect and/orsizes as on part effect of theirsize estimates.flagging rule. In our review, we studied whether authors report effect sizes, and whether they base their flagging rule on significance tests Hypotheses and generalizability

useful if researchers had clear hypotheses on which items are expected to show The use of NHST implicates the use of hypotheses. In DIF testing, it would be

DIF. Specific hypotheses concerning DIF are rarely tested (for an exception, see Scheuneman, 1987). Many researchers use DIF testing in a more exploratory way, for instance as a tool to screen scales and clear them of DIF. Given the nature of DIF testing (i.e., often comparing naturally occurring groups instead of groups lackrandomly clear assigneda priori hypotheses. to an experimental Potential condition), hypotheses and are the sometimes typical purpose formulated of DIF testing (i.e., to clear scales of DIF) it is hardly surprising that many DIF studies

topost Type hoc, I errors.a practice Whereas known in as other HARKing types (Kerr, of research 1998). theseThis is Type problematic I errors couldbecause be detectedDIF analysis by replicationoften involves studies a large in novelnumber samples of significance or by practices tests, whichlike cross-vali will lead-

often get removed from scales right away without much replication or cross-val- dation (Hambleton, 2006; Teresi & Fleishman, 2007), in DIF testing these items Reporting DIF analyses

105 idation. This common approach impedes obtaining generalizable knowledge of

- DIF for specific scales, items or groups. Simply throwing away items does not help us understand why DIF occurred. Instead, it would be useful to cross-vali date DIF results in fresh parts of the data (if sample sizes allow this; Hambleton,- 2006) or to replicate them in novel samples (Sireci & Rios, 2013). Using these strategies, we will be more confident that scales actually im- prove by the removal of DIF items, and sound theory concerning DIF can be built. We recommend either formulating theory based a priori hypotheses, or sup crossplying validation cross validation or replication. or replication studies. In our review of DIF studies, we checked whether authors (a) reported a priori hypotheses, and (b) mentioned Purification

The idea underlying purification is that the matching criterion used for testing DIF might itself display DIF; if participants are ordered by ability based on a scale- in which some items might be biased, we expect bias in the total score as well, matchingthereby creating criteria. a Severalcircularity procedures problem (Doranshave been & Holland,developed 1993; to purify Navas-Ara the match & Gó- mez-Benito, 2002). Purification is a mechanism developed to avoid such biased 4 ing criterion, or in IRT terminology to empirically select the anchor (described in Appendix C). In many instances scale purification seems to be beneficial to the power rates of IRT methods (Navas-Ara & Gómez-Benito, 2002), MH (e.g., Clauser et- al., 1993; Fidalgo, Mellenbergh, & Muñiz, 2000; Guilera et al., 2013; Wang & Su,- 2004), and logistic regression (Navas-Ara & Gómez-Benito, 2002), with the clear est benefits from purification with larger effect sizes of DIF and more items show ing DIF (M. D. Miller & Oshima, 1992). However some authors argued that gains- by purification in terms of power are so slim that it might not be worth the extra- effort (in the context of logistic regression; French & Maller, 2007), or that puri fication strategies are even counterproductive (Magis & Facon, 2012). Purifica tion also reduces Type I error rates in IRT methods (González-Betanzos & Abad, 2012; Navas-Ara & Gómez-Benito, 2002), MH (Clauser et al., 1993; Fidalgo et al., 2000; Guilera et al., 2013), and logistic regression (Navas-Ara & Gómez-Benito,- 2002), although some simulation studies found purification strategies to inflate Type I error rates (French & Maller, 2007; Kopf et al., 2015a). Although techni cally the MIMIC approach does not require a purification technique (Jones, 2006; Teresi, 2006a), results of a simulation studies suggested that the use of a purified anchor does improve the performance of that method (Wang et al., 2009). - Even though the usefulness of purification methods have been challenged by some studies, overall there seems to be ample evidence that in many circum stances purification is either beneficial to DIF detection methods, or at least not Chapter 4

106

harmful to DIF detection methods. Especially for observed score methods we -

recommend the use of purification. We attempted to give an overview of puri fication practices in the empirical DIF literature, however it turned out this was difficult to code reliably (see Results section). 4.1.3 Guidelines

highlightsOur preceding the many review decisions of the methodological researchers need literature to make on when the type carrying of DIF out method, a DIF analysis.flagging rules, Opinions power, regarding effect sizes, what hypotheses constitutes and good generalizability, practice considering and purification, statistical

improveDIF methods reproducibility might differ and depending replicability on the for authors, all types characteristics of DIF analyses. of the Table data and4.1 the chosen type of DIF analysis. However, we offer ten basic guidelines that will - provides an overview of the most popular DIF methods and their requirements. - Based on the existing methodological literature and reviews on DIF, we pro vide a general checklist consisting of ten guidelines (GL) that ensure good prac 4 tice and sufficient reporting of DIF analyses. GL1. theDetermine key rationale the mean for studying group differences DIF and they on can the affect latent Type trait I and(Sireci Type & Rios, 2013), because the very existence of group differences is often

II error rates (DIF tests for groups with unequal ability distributions show inflated Type I errors (Guilera et al., 2013) and lower power rates (Jodoin & Gierl, 2001; Li, 2014)). GL2. Preferably, use an unobserved score method (Millsap & Everson,- 1993). Especially when scales are shorter than 20 items, observed score methods can give biased estimates of ability (Millsap & Ever- son, 1993; Teresi & Fleishman, 2007; Tay et al, 2015). - GL3. Check the appropriateness of the selected model by assessing mod el fit (Teresi & Fleishman, 2007; Tay et al, 2015). When using uni dimensional approaches, assess essential unidimensionality of the- scale for each group separately (Teresi & Fleishman, 2007; Sireci & Rios, 2013; Tay et al, 2015) to avoid enhanced Type I error rates. Vio lations of essential unidimensionality or poor model fit could be due to misspecification of the model (e.g., lack of second factor, selection of 1PL when 3PL is needed), and in this case the use of the selected model might not be justified. - GL4. Ensure a sufficiently large sample for the selected method and test- (See Table 4.1). With larger sample sizes, estimation accuracy will in crease (Tay et al, 2015) and power of the DIF analysis will rise (Ham bleton, 2006). Reporting DIF analyses 107

GL5. Construct and report a reproducible flagging rule based on both a significance test (preferably with FDR correction; Thissen, Steinberg, and Kuang, 2002) and an effect size cut-off score (Hambleton, 2006; dueMonahan to multiple et al., 2007; comparisons Sireci & Rios,and that 2013; items Teresi with & Fleishman,negligible DIF2007). ef- This practice ensures that the Type I error rate will not be inflated

fect sizes will not be flagged. GL6. Deal with issues of NHST by explicating DIF hypotheses to strengthen- egiesinferences, attempt by tousing control cross erroneous validation decisions (Hambleton, based 2006), on a single or by DIFcarrying test. out a replication in a new sample (Sireci & Rios, 2013). All of these strat

GL7. Use a purification procedure appropriate for the selected method- (Teresi & Fleishman, 2007; Sireci & Rios, 2013; Hambleton, 2006; Zwick, 2012). Purification can help to avoid bias in the matching cri allterion. items. Report This theadds type to theof implemented reproducibility purification of DIF analyses procedure. and is es- GL8. Report the results of significance tests and effect size estimates for- ta-analyses and reviews of results of multiple DIF studies. sential for assessing practical significance of DIF and for future me 4

GL9. Report psychometric characteristics of the scale, like the number of reproducibilityitems, estimated and item helps parameters, to get a sense and estimated of the power reliability of the coefficientDIF tests. (CTT), person separation index, or test information (IRT). This adds to - GL10. Investigate and report the impact of DIF on the (latent) group ability scores (Borsboom, 2006; Sireci & Rios, 2013; Tay et al., 2015; Tere- ferentsi & Fleishman, implications 2007; for Oberskipractice 2014),and for because our understanding the influence of ofgroup the differencesflagged DIF anditems the could role beof DIFtrivial therein. or large, leading to completely dif

4.1.4 A review of current practice Although the advantages and disadvantages of different statistical methods and pro- - knowledgecedures have whether been studied researchers extensively follow thesethrough best simulation practices. studies,In a systematic and the review meth odological literature provides several guidelines on DIF analyses, we have limited- of 51 DIF studies in organizational research, Tay et al. (2015) found that overall sam- ple sizes were appropriate for IRT DIF analyses, that the median number of items tested was 27 for cognitive scales and 10 for psychological scales, and that the major ity of researchers failed to discuss model fit. In another review of 17 studies for the literature on disabled students (Buzick & Stone, 2011), the majority of analyses were 108 Chapter 4

>.059 >.088

> .13 > .035 2 2 ˆ ˆ β β ES cut-off Medium DIF MH D-DIF > 1.0 DIF Large MH D-DIF > 1.5 & Holland, 1993) (Dorans Log odds > 0.426 (Fidalgo et al., 2014; Monahan et al., 2007), or R 1999) (Zumbo, R (Jodoin 2001) & Gierl, Medium DIF: DIF: Large & Stout, 1996) (Roussos -

2

ˆ β ES MH D-DIF 1989) (Dorans, (delta) log odds (Fidalgo et al, 2014; Mona han et al, 2007), or R Nagelkerke & 1999; Zumbo (Zumbo, Thomas, 1997) (Shealy & Stout, 1993) (Shealy

Minimal number of items 20 (Millsap & Everson, 1993; Millsap, 2011) 20 (Millsap & Everson, 1993) 20 (Millsap & Everson, 1993) 20 (Millsap & Everson, 1993) 20 & Stout, (Shealy 1993) 10 et al, 2015) (Tay

4

Minimal sample size > 250 per group et al, 2013) (Guilera 500 per group & Gómez, 2008; (Herrera & Swaminathan, Narayanan & Swaminathan, 1996; Rogers 2003) 1993; Penfield, 500 per group 2007) & Maller, (French N.A. 250 per group (Lei & Li, 2013) 300 per group & Swaminathan, (Narayanan 1994) 500 per group 2003) (Penfield, 500 per group (Allan S Cohen & Kim, 1993)

Power Power formula Yes (Li, 2015) Yes (Li, 2014) Yes (Li, 2014) N.A. Yes (Li, 2014) N.A. -

Purification beneficial Yes et al., (Guilera 2013) Yes & (Navas-Ara Gómez-Benito, 2002) Barely Barely & Maller, (French 2007) N.A. N.A. Yes & Lauten ( Park 1990) schlager,

    ) ˆ

ˆ β

β (

ˆ 2 σ χ

   

2 Sig test Sig test MH- χ & Holland, (Dorans 1993) test Wald & (Swaminathan 1990) Rogers, Likelihood ratio test ratio Likelihood 2012) (Paek, test Score 2012) (Paek, z-test & Stout, (Shealy 1993) Lord’s 1980) (Lord, Popular DIF methods for dichotomous items. dichotomous for DIF methods Popular MH Logistic regression SIBTEST Method methods score Observed Unobserved score methods score Unobserved IRT Table 4.1 Table Reporting DIF analyses

109 -

ES cut-off N.A. NCDIF & CDIF < .006 der Linden, & van (Raju, 1995) Fleer, odds > 2.0 Proportional or < .05 & Jones, 2007) (Yang 0.3 (small DIF) 0.5 (medium DIF) DIF) 0.7 (large (Jin, Ahn, & Pen Myers, field, 2012) -

-

ES param between Difference estimates eter & Thissen, (Steinberg 2006) E.g., CDIF NCDIF, der Linden, & van ((Raju, 1995) Fleer, coefficients Regression et al., 2009) (Woods odds /proportional & Jones, 2007) (Yang MIMIC-ES (Jin, Ahn, & Pen Myers, field, 2012)

Minimal number of items 10 et al, 2015) (Tay 20 (Khalid & Glas, 2014) 10 et al, 2015) (Tay 10 et al, 2015) (Tay 10 et al, 2015) (Tay

4

Minimal sample size 200-300 (1PL) & Wilson, 2011) (Paek 500 (ref) – 100 (focal) (2PL/3PL) (Finch, 2005) 500 (ref) – 200 (focal) (2PL) 2009) (Woods, 500 per group et al, 2015) (Tay 400 per group (Khalid & Glas, 2014) (2PL/3PL) (Khalid & Glas, 2014) >200 per group (1PL) (2PL) >500 per group >1000 per group (3PL) (Oshima & Morris, 2008) 500 (ref) – 100 (focal) (Finch, 2005) 500 (ref) – 200 (focal) 2009) (Woods, Power Power formula N.A. N.A. N.A. N.A. N.A. -

Purification beneficial Yes ( Finch, 2005; González-Betan 2012) zos & Abad, Yes (Khalid & Glas, 2014) Yes (Oshima & Morris, 2008) Yes et al., (Wang 2009) N.A.

Sig test Sig test test ratio Likelihood (IRT-LR-DIF) (Thissen, Steinberg, 1993) & Wainer, test Score (Glas, 1998) for Significance test measures area van 1988; Raju, (Raju, der Linden, & Fleer, 1995) test Wald test ratio Likelihood et al., 2009) (Woods Continued MH= Mantel-Haenszel test. LR = Logistic Regression. IRT = Item Response Theory. MIMIC= Multiple Indicators Multiple Causes Model. ES= Effect Size. Multiple Model. ES= Effect Causes MIMIC= Multiple Indicators Theory. Response = Item IRT test. Regression. LR = Logistic MH= Mantel-Haenszel MIMIC Method Table 4.1 Table Note. because or information, this to relating literature of unaware are we because applicable, Not = N.A. group. per size sample minimal the = Size Sample Minimal group. = Focal Focal group. = Reference the particular method. Ref is not necessary for the information 110 Chapter 4

carried out with observed score methods like MH, logistic regression or SIBTEST, and smallshowed number a relatively of DIF low studies percentage from a oflimited DIF items section per of study the wider (i.e., most literature studies on showed DIF. 15% DIF items or less). We note that both earlier reviews only covered a relatively- vide a clear and generalizable overview of current practices and decisions made in DIFWe testing. reviewed We randomlyarticles from selected all scientific 200 articles fields reporting that use DIFDIF analysis,tests and to coded pro

thesewhether guidelines seven of because the proposed they are guidelines applicable (i.e., to GL1,both GL2,observed GL4, GL5,score GL6, methods GL8, GL9) were followed in this large sample from the DIF literature. We selected

and unobserved score methods (unlike GL3), and can be readily coded (unlike GL7 and GL10, as discussed in the section on intercoder reliability). In addition items.to these The coded information guidelines, gained we fromretrieved these severalvariables relevant not only characteristics gives an overview of DIF of thestudies, decisions including made the when number researchers of items carry studied out DIFand analyses the number and anof overviewflagged DIF of

used as input for future simulation studies or power estimates. the amount of DIF items flagged in the empirical DIF literature, but can also be 4 4.2 Method

Selecting DIF articles - 20 FigureTo create 4.2. a poolThese of seminal DIF articles publications we first performedconcern both cited observed reference and searches unobserved of in fluential methodological DIF publications in ISI Web of Science , as described in

DIF methods, and were used to define the population of subsequent DIF articles listed in ISI’s database. The pooled list of 2,137 citations yielded 1,212 unique articles. Subsequently, we used the following inclusion criteria: (1) the articles needed to contain athe corresponding phrase “differential DIF analysis item functioning” on that empirical or “item data bias”, set. During(2) the aarticles screening had tobased contain on title empirical we discarded data from articles a human that population, failed to meet and (3)our the selection article

incriteria, our review. like reviews, Articles discussion in the sample articles, that didmeta-analyses not meet our or selectionsimulation criteria studies. were Of replacedthe 825 remaining by randomly articles, drawn we articles. selected Several a random selected sample articles of 200 articlesfeatured to multiple include

DIF comparisons from the same data (e.g., for different types of groups, or for different scales). In those articles, we selected only one DIF comparison (i.e., one

20 th th 2014.

Literature search was conducted between March 30 and April 15 Reporting DIF analyses 111 Wainer (1993): (2003): Penfield 17 papers 647 papers 647 Holland & & Holland ) title papers: by papers: & Ford & Lehman 925) 387) - - ( (1985): (1973): ( 48 papers 54papers Angoff (double papers)(double Muthen Removing Removing (screening (screening Kullick & Finch (2005): (1986): 57papers 92papers Dorans

4 & Stout papers 825 2137 2137 1212 (1993): (1986): potentially 189 189 potentially 134 papers 134 potentially Thissen et al. Stealy relevant papers relevant relevant papers relevant relevant papers relevant (1993) (1995): Kim Kim al.et 21 papers 189 papers 189 Thissen et al. & papers (1990) Rogers (2001): Penfield 20 papers 277 277 Swaminathan Thayer Flowchart of literature search. of literature Flowchart papers Raju (1988) (1990): 67papers 325 325 Holland & & Holland Figure 4.2 Figure 112 Chapter 4

type of group, for one type of scale) based on several selection rules as detailed in CodingAppendix D, and coded the papers for that chosen comparison.

method test used to flag items reported statis- ticsWe wereat the interested item level in variousgrouping characteristics variable of DIF analyses: (1) the type of DIF reporting, (2) of thethe typedecision of rule for deciding whether, (3) whether item authors the sample size , (4)number the of items against which DIFnumber was tested, of items (5) sAlpha are influenced by DIF, (6)- ing of the DIF effect, (7) sizethe studied for DIF, (8) theeffect size of ability flagged for DIF per method,hypothesis (9) the significance level ( ) used, (10) reportpower of the test was discussed, (11) reportingactual and power magnitude of the purification tech-, (12) reporting of the reliability regarding of the scale DIF, (13) whether the issue of , (14) the of the test, (15) ICC/TCC niques, (16) reporting of test information, (17) whether authors displayed DIF- erthrough Cross-validation Item Characteristic ReplicationCurves or Test Characteristic Curves ( ), and (18) whether authors visually displayed . Finally, we coded wheth (19) and (20) were mentioned in the articles. Table 4.2 Categories for variables.

4 Variable Categories Description Method used Unobs. con. invar. model

Cat./ord. indicators (UCIM): e.g., IRT-LR-DIF, MIMIC model with categorical indicators, Continuous indicators GRM, PCM e.g., CFA techniques, MIMIC model with continuous indicators Mantel-Haenszel test Obs. con. invar. model (OCIM): Log./ord. regression SIBTEST Other Test Authors used NHST to decide whether items displayed DIF e.g., ANOVA of Rasch residuals (software package RUMM) Effect size Authors used effect size estimates to decide whether items Significance test displayed DIF Fit indices Statistics reported Complete Authors used fit indices to decide whether items displayed DIF for all items Authors reported significance tests and effect size estimates None estimates Authors did not report any significance tests or effect size Partial Dec. rule Yes Authors did describe a decision rule Authors did report some test statistics or effect size estimates, No Authors did not describe a decision rule Sample size Continuous Number of participants in DIF analysis Number of items Continuous Number of items studied in DIF analysis Hypothesis Yes Authors did specify a hypothesis a priori No Authors did not specify a hypothesis a priori Reporting DIF analyses

113

Table 4.2 Continued

Variable Categories Description Mult. test. cor. Yes No No multiple testing correction used e.g., Bonferroni correction or Benjamini-Hochberg correction Effect size ability Unobserved Authors reported an ability effect size based on latent scores Observed Authors reported an ability effect size based on observed scores No Authors did not report an ability effect size Effect size DIF Unobserved Authors reported a DIF effect size retrieved from an UCIV Observed Authors reported a DIF effect size retrieved from an OCIV No Authors did not report a DIF effect size ICC/TCC Yes Authors included an ICC or TCC No Authors did not include an ICC or TCC Yes Authors mention study was underpowered/small sample size

Power (under) No Authors did not mention study was underpowered or that the might have influenced results

Yes Authors mention study was overpowered/large sample size small sample size might have influenced the results Power (over) No Authors did not mention study is overpowered or that the might have influenced results

Yes large sample size might have influenced the results No Purification Authors did use purification techniques Reliability Cronbach’s alpha Authors reported Cronbach’s alpha Authors did not use purification techniques Other 4 No reliability Authors reported another reliability coefficient Test inf. Yes Authors displayed a visual test information function Authors did not report any reliability coefficient No Authors did not display a visual test information function

Note.

UCIM = Unobserved Conditional Invariance Model, OCIM = Unobserved Conditional Invariance inf.Model, = Test ICC Information. = Item Characteric Curve, TCC = Test Characteristic Curve, GRM = Graded Response Model, PCM = Partial Credit Model, Mult. test. cor. = Multiple testing correction, Dec. rule = Decision rule, Test The majority of these variables are categorical. Table 4.2 summarizes the categories used. For the variable method we selected the category that described the statistical method used in the article. When multiple statistical methods were variable test used to assess DIF we coded all of them, with a maximum of 5 methods. With the we addressedstatistics whether reported authors indicated used significance whether the tests, authors effect reported sizes or fit indices as a flagging rule (as coded separately for each method in a given DIF samplearticle). size The variable - pantssignificance who had tests missing and effect values sizes for thefor allselected items DIFin the analysis. DIF analysis. To document We also the coded deci- sion rule for of items the groupswe registered included whether in the selected the researchers DIF analysis, mentioned excluding the criteriapartici coded the number of items that were studied for DIF and the number of items that wereon which flagged they for decide DIF. We whether coded an whether item or a a DIF bundle effect of size items is influenced by DIF. We

was reported in the article, 114 Chapter 4

and whether this effect size was based on an observed or an unobserved condi- tional variance method. If authors reported separate item parameters for both DIF

- groups, we counted this as DIF effect size present. Additionally, we computed the effect size (Cohen’s d) of the difference between groups in ability based on the re ported difference between the two groups on the total scale (i.e., observed ability) effector the size. latent When ability an effect(actual size ability). other thanWe also Cohen’s coded d whether authors themselves thatreported effect (an size observed to Cohen’s or an d unobserved). For the variable effect hypothesis size and retrieved we scored the whethervalue of thatthe was reported, we transformed- Replication and Cross-validation by checking viaresearchers text search had whether an a priori the articles expectation contained regarding either the the DIF word effect “replication” for specific or “repvari- ables. Moreover, we coded variables 21 We considered power by coding whether the authors discussed if the DIF licate(d)”, and whether they contained the term “Cross-validation”.

retrievedtest (1) had any little estimates power of or the too power small fromsample the sizes, articles. and We (2) also had coded enough whether power the or researcherssuch large power used purification that items with trivial DIF effect sizes could be flagged. We also 4 to acquire a criterion with as little bias as possible. We andalso whetherretrieved they the reportedprovided reliability item characteristic coefficient curves for the or relevant test characteristic scale. Moreover, curves. we coded whether the researchers provided the test information function graphically, - The entire sample was coded by the first author on the basis of a coding- sheet, available on the Open Science Framework (https://osf.io/ty496/). Addi andtionally, had the secondlist of 200 and selected third authors DIF articles independently is available code through 14 articles this each.link. ToSeven as sess inter-coder agreement, we drew a random sample of the 200 DIF articles - plearticles of 21 were randomly both codeddrawn by articles. the second In this and way third we couldauthor, calculate and subsequently the inter-rater the agreementsecond and over third all author three coderseach coded simultaneously seven unique and articles, over pairs which of coders. led to a sam

4.3 Results

Inter-coder reliability -

We used Krippendorff’s α as a reliability measure, because it can deal with nomi nal as well as interval variables, multiple coders, and missing data. Additionally, we

21

Obviouslywas not featured it is possible in the thatanalyses authors of inter-coder used cross-validation reliabilities. or replication studies, but phrased it differently. Note that these variables were coded only by the first authors after the main round of coding. Therefore, this variable Reporting DIF analyses

115

calculated pairwise Fleiss’ Kappa coefficients (for nominal variables) or two-way agreement single-measures Intraclass Correlation Coefficient (ICC, for continuous- variables), and pairwise simple agreement. All coefficients were calculated with- R-package IRR (Gamer et al., 2012). The reliability coefficients are reported in Ta ble 4.3. For several coded variables, reliabilities ranged from acceptable (hypoth esis testing, alpha, method, test information, DIF items, items tested, sample size) to excellent (decision rule bundle testing, multiple testing correction). For other- variables the reliabilities were relatively low (statistics reported, test, decision rule item, ability category, influence DIF, underpowered, overpowered, name ef fect size, purification, reliability, ICC). We removed the variables “influence DIF on theta” and “purification” from our results because inter-coder reliability was too low and it was difficult to reach agreement on the coding of these variables. For some variables there was little variance in the coded categories, which is referred- to as the problem (Hallgren, 2012). Skewed marginal distributions- can lead to underestimations of the true reliability by coefficients such as Krip pendorf’s alpha and Fleiss’ kappa (Hallgren, 2012; Lacy, Watson, Riffe, & Love joy, 2015; Lombard, Snyder-Duch, & Bracken, 2002). The reliability coefficients of variables “decision rule item”. “underpowered”, “overpowered”, “hypothesis”, and 4 “test information” might be affected due to this problem, as those variables have- little variance in the coded categories, show relatively low reliability coefficients but high simple agreement. The disagreement for the remaining variables, “statis parttics reported”, of the disagreement and “decision arose rule due item”, to wereambiguous readily reporting solved after in discussion.the DIF articles. Even though intercoder reliability was low for some variables, we like to point out that

Moreover, we note that previousAlpha DIF DIFreviews effect refrained size fromTest informationreporting intercoder- reliability (e.g., Buzick & Stone, 2011; Tay et al., 2015). theseFor variables. three variables22 (i.e., , , and ), we de cided to refine the scoring , and the first author recoded the values of

22

For the variable “Alpha reported” the original scoring rule was too strict (i.e., first author originally only coded- Alpha’s that were explicitly mentioned in relation to the DIF tests). In a second round of coding, if there was a significance level mentioned for any of the statistical tests in the paper, we assumed authors used that signifi cance level for all tests. For the variable “DIF effect size” the first author coded too strictly, excluding mean effect onlysizes, counting categorized visually effect depicted sizes and test effect information sizes that but were not visually visually displayed. depicted standard In the second errors. round As the this relationship variable was of testrecoded, information including and these standard types errors of effect is 1:1sizes. all Forauthors the variable agreed plotted“Test information” standard errors the first could author also countwas strict, as visu by- ally plotted test information. In a second round of coding visually depicted standard errors were also included in the variable Test information. Chapter 4

116

Table 4.3 Intercoder reliability.

Variable Krip- Kappa/ICC Kappa/ICC Kappa/ICC Simple Simple Simple pen-dorf’s Rater 1- Rater 1- Rater 2- agreement agreement agreement alpha Rater 2 Rater 3 Rater 3 Rater Rater Rater (k = 3) (k = 2) (k = 2) (k = 2) 1-Rater 2 1-Rater 3 2-Rater 3 (k = 2) (k = 2) (k = 2) Methods .80 .12 .87 n n n n n n n .65 .50 .73 .63 Test .14 .87 (nominal) ( = 22) ( = 15) ( = 15) ( = 8) ( = 15) ( = 15) ( = 8) n n n n n n n .34 .66 .33 .53 .75 Statistics reported .71 (nominal) ( = 22) ( = 15) ( = 15) ( = 8) ( = 15) ( = 15) ( = 8) n n n n n n n .46 .56 .30 .75 .50 .86 Decision rule item 1.00 1.00 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .54 .43 .76 .79 .93 Decision rule bundle 1.00 NA NA NA 1.00 1.00 1.00 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n Multiple testing 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) correction n n n n n n n

( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) Alpha .78 1.00 1.00 (nominal) n n n n n n n .73 .65 .79 .86 Sample size .80 .12 .71 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .79 .98 .69 .43 Items tested .714 (count) ( = 21) ( = 14) ( = 13) ( = 7) ( = 14) ( = 13) ( = 7) n n n n n n n .75 .92 .45 .98 .79 .36 .88 (count) ( = 21) ( = 14) ( = 13) ( = 7) ( = 14) ( = 13) ( = 7) 4 n n n n n n n Items flagged .94 .98 .91 .97 .79 .69 Ability category .48 1.00 .71 .71 1.00 (count) ( = 22) ( = 14) ( = 13) ( = 8) ( = 14) ( = 13) ( = 8) n n n n n n n .59 .54 1.00 .04 -.10 1.00 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n Influence theta hat .35 .43 .43 Hypothesis 1.00 -.04 -.08 1.00 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .73 .93 .86 Underpowered .44 .27 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .63 .58 .93 .79 .86 Overpowered -.02 -.08 NA -.27 1.00 .71 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n .86 Name DIF effect size .48 .71 (nominal) ( = 21) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .45 .46 .33 .64 .57 .47 .27 .24 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n Purification .65 .86 .57 .57 Reliability .44 .71 .71 .71 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .51 .49 .45 Test information 1.00 1.00 .71 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .73 .58 .30 .86 ICC .47 .42 .71 .71 (nominal) ( = 21) ( = 14) ( = 14) ( = 7) ( = 14) ( = 14) ( = 7) n n n n n n n .53 .31 .79 Note.(nominal) Missing values( of= Kappa21) are( =introduced 14) ( =due 14) to a lack( = 7)of variation( = 14) in categories.( = 14) In case( = of 7) nom- - k n available.inal variables Fleiss’s κ is reported, in case of continuous two-way agreement single-measures intra class correlation coefficient (ICC) is reported. = amount of coders, = amount of articles, NA = not Reporting DIF analyses 117

Results related to guidelines In Table 4.4 we present the results of the systematic review of the variables relat- ed to the guidelines we described above. Although we selected 200 articles in our review, in 11 % of those articles the authors used multiple statistical methods which resulted in a total of 239 DIF analyses scored on all variables. Table 4.4 Practice of guidelines in the field.

Variable N % Mean group differences: effect size 200 28 14.0 28.0 Reported (latent) Not reported Reported (observed) 56 Method used 116 58.0 Unobserved: cat/ord indicators 239 Unobserved: continuous indicators 119 50.0 Observed: Mantel-Haenszel test 20 8.4 6 2.5 Observed: Logistic/ordinal regression Observed: SIBTEST 38 15.9 Other 42 13 5.4 1 0.4 17.6 Sample size Not specified 4 Reported 238 Not reported 2.1 233 97.9 Flagging rule 5 Test

239 Effect size 27 Significance test 130 54.4 11.3 10 4.2 Significance + Effect size 55 23.0 1 0.4 Significance + Fit Indices Other 1 0.4 Significance + Effect size + Fit Indices

Multiple testing correction 200 Not specified 16 6.7 Yes No 147 54 26.5 Alpha 200 73.5 10

α < 0.01 5.0 0.01 ≤ α < 0.05 29 14.5 4 2.0 α = 0.05 126 63.0 α > 0.05 200 Not specified 31 15.5 Yes Report flagging rule in detail No 27.0 146 73.0 54 118 Chapter 4

Table 4.4 Continued

Variable N % Hypothesis reported Yes 22 11.0 No 178 Report results of DIF tests 89.0 Statistics reported 200 Yes 27.0 Partial 110 54 No 18.0 55.0 Effect size DIF 200 36 Unobserved 70 Observed 35.0 Both 1 37 18.5 No 0.5 DIF display: ICC/TCC 200 92 46.0 Yes 42 21.0 No 24 12.0 134 67.0 Reliability 200 ICC/TCC, but no DIF display Cronbach’s alpha Other 85 42.5 4 51 25.5 No reliability (e.g., TIF, person separation index) Note. 64 32.0 - tion. ICC = Item Characteristic curve, TCC = Test Characteristic Curve, TIF = Test Information Func

Notwithstanding the relevance of knowing the size of the group differences

N in test scores for theoretical and statistical reasons (e.g., power), lessd than half of the articles ( = 84; 42%) reported how much the groups differed in ability or- trait, by either reporting an effect size for the groups (e.g., Cohen’s , regression coefficient) or reporting enough information to calculate those effect sizes. Ab insolute terms values of effect of latent sizes andwere observed small to meanmedium. differences However on a minoritythe tests of(in DIF Cohen’s articles d) are shown in Figure 4.3. Overall, the reported mean differences between groups estimation of group differences in the overall DIF literature. actually reported such effects, so it is unclear whether this number is a correct-

One hundred and twenty-five of the 239 DIF analyses (52.3%) involved un- observed score methods in a frequentist framework. Observed score methods like logistic regression, SIBTEST and the MH test were relatively popular, togeth er accountingN for 29.7% of the reported methods. The category “other” entails Nanalyses that did not fit aforementioned categories,N e.g. other non-parametric- techniques ( =3), analysis of variance of Rasch residuals (with software RUMM, = 24), or a Bayesian unobserved score method ( =1). Our finding that unob Reporting DIF analyses

119

served score methods are most popular corresponds with earlier reviews (Tay et al., 2015). Figure 4.4 displays the distribution of the sample sizes used across the 200 DIF articles. As indicated in several simulation studies (e.g., Guilera et al., 2013; Paek & Wilson, 2011; Penfield, 2003; Woods, 2009), DIF analyses require- at least 500 to 1000 participants (i.e., 250-500 participants per group). Overall the reported sample sizes seemed sufficiently powerful for DIF analyses; the me dian sample size equaled 1,042. Thus, in line with results by Tay et al. (2015),- most DIF studies use adequate sample sizes. The majority of analyses (N = 130; 54.4%) relied solely on significance test- ing when deciding whether an item displayed DIF, while in more than a tenth of the DIF analyses authors used only effect size estimates for flagging DIF. Simi larly, 112 of the 200 articles (56.0%) solely depended on significance testing or on a combination of significance testing and the use of fit indices. In 10 articles (5.0%), it remained unclear how the authors tested for DIF. In about a quarter of- the 239 reported analyses, authors combined significance testing with the use of- effect sizes or fit indices to flag an item for DIF. A little over Na quarter of the ar ticles involved the use of multiple testing corrections (e.g., Bonferroni, Benjami ni-Hochberg) when testing for DIF. The majority of articles ( = 126; 63%) used- 4 ing.a significance One could level argue of this .05. practiceIn 35 of thecan 39 be instancesviewed as in an which alternative the significance way to control level was smaller than .05 authors failed to use an explicit correction for multiple test N for inflated familywise Type I errors when the overall alpha is set at .05. In most articles ( = 111; 55.5%) the authors neither lowered the nominal significance level nor corrected for multipleN testing. In 144 articles (73%) a decision rule was reported in such detail that we could replicate this part of their method, whereas in the remaining articles ( = 56) we were unable to glean what decision rule had been used. So in a sizeable portion of DIF articles, the decision rules remainedN = unclear. Additionally, only 22 articlesN (11%) reported a hypothesis concerning DIF. In the majority of articles, authors did not mention replication studies ( 152, 76%) or cross validation ( = 184; 92%). Upon closer inspection of the 48 thatarticles did (24%)mention that cross did validationmention replication, showed that we only found six thatarticles only involved six DIF articlesa cross featured a replication (3% of total sample). Closer inspection of the 16 articles - validation (3% of the total sample). Thus, the common use of significance testing in DIF assessments is seldom associated with checks for generalizability, speci- fication of specific hypotheses related to DIF, or corrections for multiple testing. In 27% of the articles, authors reported both effect sizes and some indica ptor of statistical significance for all items includedp in the DIF analysis. Preferably- authors would report a test statistic with degrees of freedom and subsequent -value. However, many authors solely reported -values, asterisks or a clear de scription in the text of the items that did or did not display significant DIF, which 120 Chapter 4

Reported latent mean group differences 6 5 4 3 Frequency 2 1 0

0.0 0.5 1.0 1.5 2.0

Latent mean difference (Cohen's d; absolute value) 4

Reported observed mean group differences 12 10 8 6 Frequency 4 2 0

0.0 0.5 1.0 1.5 2.0 2.5

Observed mean difference (Cohen's d; absolute value) Figure 4.3 Histogram of the latent and observed mean group differences. Reporting DIF analyses 121

Sample size of studies (studies with N>15.000 removed) Sample size of studies (studies with N>15.000 removed) 60

60 50

50 40

40 30

Frequency 30 20 Frequency

20 10

10 0

0 0 2000 4000 6000 8000 10000 12000 14000

0 2000 4000 6000 8000 10000 12000 14000 Sample size 4

Figure 4.4 Histogram of samples sizes reported forSample 200 DIFsize tests.

- N we also accepted as indicators of statistical significance. Almost a fifth of the ar ticles ( = 36) reported neither an effect size nor an indicator of significance for halfany ofof thethe items. articles Although reported many no DIFresearchers effect sizes. in the Of field the 108of DIF articles advocated that thedid usere- of DIF effect sizes (Hambleton, 2006; Hidalgo et al., 2014; Meade, 2010), almost - port effect sizes, 34.3% reported effect sizes based on observed score methods, 1264.8% articles reported displayed effect DIF sizes effect based sizes on visuallyobserved by score means methods of an Item and Characteristicone article re ported both. Of the 92 articles that failed to report any numerical DIF effect sizes, neither exact effect sizes nor visually displayed effect sizes were reported. So Curve or on a test level with a Test Characteristic Curve. In 80 articles (40%) generally, reporting of effect sizes is not uncommon, but there is still a large share of DIF articles that did not report effect sizes and results from significance tests. A reliability coefficient of the scale was reported in more than two thirds of- erthe reliability articles, with estimate the majority like the reportingperson separation at least the index traditional or a test reliability information estimate func- Cronbach’s alpha. In the remaining 25.5% of the cases the authors reported anoth 122 Chapter 4

Reported reliability coefficients

Reported reliability coefficients 25 25 20 20 15 15 Frequency 10 Frequency 10 5 5 0 0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 Reliability coefficient 4 Figure 4.5 Histogram of the reported reliabilityReliability coefficients coefficient (Cronbach’s alpha).

tion. Some authors did not clearly specify which kind of reliability they reported.

The distribution of reported values of Cronbach’s alpha are plotted in Figure 4.5. - Overall, reported reliabilities were high: 71.8% of the reported Cronbach’s α were equal to or higher than .80. This finding corresponds to the results of a re view on reliability and shortened test scales (Kruyen, Emons, & Sijtsma, 2013), in Otherwhich 71.5%findings of reported unaltered scales showed reliability estimates above .80.

Figure 4.6 displays the distribution of the number of items that were checked for DIF in the 200 DIF articles. The average number of items tested for DIF was 19.82, wewhile could the calculatemedian number the percentage of items of was DIF 14.50. items Infor 134 214 of analyses. the articles This (67.0%) percentage the authors checked 20 or fewer items. Of the 239 DIF analyses that were reported, of items that were tested for DIF. We calculated the percentages of DIF items per analysisconsists andof the display number the of distribution items that werein Figure flagged 4.7. for DIF divided by the number

The average percentage of tested items showing DIF was 23.15, and the median was 15.59. In 23.0% of the DIF analyses no items were flagged for DIF. Reporting DIF analyses Number of items tested Number of items tested 123

50 Number of items tested 50 50 40 40 40 30 30 Frequency 30 20 Frequency 20 Frequency 20 10 10 0 10 0

0 20 40 60 80 100 120 140 0 0 20 40 60 80 100 120 140

Number of items tested 0 20 40 60 80 100 120 140 Figure 4.6 Histogram of the amount ofPercentage testedNumber items. of of DIF items items tested found 4 Percentage of DIF items found Number of items tested Percentage of DIF items found 100 100 80 100 80 80 60 60 Frequency 60 40 Frequency 40 Frequency 40 20 20 0 20 0

0 20 40 60 80 100 0 0 20 40 60 80 100 Percentage DIF items 0 20 40 60 80 100 Figure 4.7 Histogram of percentage of DIF items.Percentage DIF items

Percentage DIF items 124 Chapter 4

Table 4.5 Convergence of multiple methods: the number of reported DIF items.

Article Number of Method 1 Method 2 Method 3 Method 4 Method 5 items tested 0 0 11 0 1 2 Maller (2000) 30 22 1 1 1 1 2 Chuah et al (2006) 41 1 2 Earlywine (2006) 0 1 2 2 Crane et al (2004) 11 0 4 Teresi et al (2000) 15 3 0 Ellis et al (2000) 20 2 4 Cooke et al (2001) 13 5 N.A. 4 4 7 Behuniak et al (1996) 0 20 Escorial et al (2007) 3 3 2 10 Kalaycioglu et al (2011) 30 6 10 Martiniello (2008) 39 4 17 Gierl et al (1999) 50 6 6 10 12 Cook et al (2007) 23 21 Taylor et al (2012) 30 17 20 21 Waller et al (2005) 15 19 Acar et al (2011) 25

4 A few analyses showed23 very high percentages of DIF items, with 2 analyses even- reaching 100%. Thus, most scales scrutinized for DIF indeed contained items displaying DIF, thereby highlighting the importance of testing for DIF. In 22 ar ticles (11%) authors reported more than one method to analyze DIF. Of those 22 articles, 16 actually reported results for these multiple methods, as reported in Table 4.5. Of those articles only one found an identical amount of DIF items across methods. For some of the other articles, the number of items flagged for- DIF deviated slightly depending on the method (e.g., one DIF item extra in one method Earlywine, 2006). For other articles, different DIF methods failed to con verge leading to different assessments of DIF for a large number of items (e.g., 13 items for Cook et al, 2007). Of the 200 articles in this review, 24.5% discussed the concept of power sampleor the accuracy size was of too the small statistical for reliable method results. with theThe selected median sample size.size ofIn thesetotal, 15.5% authors mentioned that their study might be underpowered, or that the

studies was 583, which indeed might be too small to reliably detect DIF with most methods. On the other hand, in 9% of the articles the authors indicated that

or carried out on only a small number of items for which the criterion entailed a larger number of items. If we 23 These DIF analyses were carried out on a bundle level, such that only a small number of bundles were tested,

remove the analyses that tested less items/bundles than were included in the criterion, the average (21.9) and median (14.29) percentage of DIF items for these analyses slightly change, and the percentage of analyses in which no items were flagged for DIF remains the same (i.e., 23.0%). Reporting DIF analyses

125

due to the large sample size. These articles showed a median sample size of their studies were too powerful, leading to items with trivial DIF being flagged although in four articles authors claimed that the power of the DIF analysis was 5,646. In most articles the actual power of the DIF analysis was not computed,

.80. These results highlight that the majority of DIF studies, despite using NHST, fail to adequately address the key issue of power. 4.4 Discussion of DIF analyses in the literature. In our opinion the reporting practices of DIF analysesWe reviewed are suboptimal.the statistical The methods, majority common of DIF practices,articles did and not reporting report DIF practices effect

- sizes and test statistics for all tested items. Moreover, a substantial number of articles did not report any statistics, thereby preventing readers from fully com prehending the results, learning from the outcomes, and checking for potential perrors. Especially the common lack of reporting of DIF effect sizes is problematic, because it is impossible to judge the influence of a DIF item solely based on a 4 -value. Moreover, in several articles it was unclear how the authors tested for- DIF, and which significance level they used. Numerous articles failed to report the reliability of the scale tested for DIF. Furthermore, a minority of articles re ported the use of a multiple testing corrections. For over half of the DIF articles,- ityit was of DIF difficult analyses to judge we believe whether it is the necessary authors to used report any the corrections statistical for methods multiple in moretesting, detail or whether than is theycurrently just failed common to report such corrections. To judge the qual such that readers can reproduce and replicate the statistical analyses. Even if the DIF analysis is not the main subject of the: the article flagging and rulethe number should ofbe pages formulated in the results of statistical tests in an Appendix or supplemental material on the jour- main article is limited, in this day and age there is little reason not to include the nal’s website or on data repositories such as Dataverse, Figshare, Open Science Framework, or Dryad. - Authors studying DIF rely heavily on significance testing; in more than half of the DIF analyses the flagging rule was solely based on significance testing. Sig nificance testing always depends on sample size, meaning that trivial DIF will- be significant when sample sizes are large enough, while substantial DIF will be missed if sample sizes are too small (Hambleton, 2006). Effect sizes are import ant in our understanding and assessments of DIF (Hambleton, 2006; Hidalgo et al., 2014; Meade, 2010; Stark, Chernyshenko, & Drasgow, 2004; Tay et al., 2015; Teresi & Fleishman, 2007) and so future DIF studies should always attempt to report them. Ideally, researchers should investigate the influence of DIF items Chapter 4

126

testingon a scale could level come as well,from especiallyDIF analyses if the carried goal ofout the in aarticle Bayesian is to framework. compare group Sev- means. A different solution to the problem that sample sizes pose in significance - eral analyses have been proposed in which Bayesian statistics play a role (Soares, Gonçalves, & Gamerman, 2009; Zwick, Thayer, & Lewis, 1999; Muthen & Aspar ouhov 2013); to date they are not often used, however. Another advantage of the testing.Bayesian Because framework the goal is that of themany null DIF hypothesis analyses (i.e.,is actually no DIF to for arrive certain at scalesitems) that can be accepted, which is not possible in the frequentist framework of significance - are free of DIF, null hypothesis significance testing alone often does not do the wouldjob. Relatedly, be valuable because to focus most moreDIF analyses on cross-validation involve multiple and testingreplication issues, in itassess is im- portant to always consider effect sizes alongside other statistical results, and it A third problem we observed is the diversity of results when multiple DIF ments of DIF (Hambleton, 2006). - es are substantial. Only one of the 22 articles in which multiple methods were usedmethods showed were convergence used (see also in Borsboom, terms of the 2006). number For a offew items studies showing these differencDIF. This - 4 outcome is not completely surprising; other studies have shown that DIF analy ses on the same data set by different authors led to different results (Borsboom,- 2006; Millsap, 2006), and DIF analyses on the same data set with three different andIRT softwarefurther stresses packages the led need to differentfor reproducible results asand well replicable (Ong, Kim, assessments Cohen, & ofCram DIF. er, 2015). This variation in results complicates the interpretation of DIF analyses

giveBecause the mostthe number valid and of reliablestatistical results methods for the to detecttype of DIF data is they growing, want itto becomes analyze. Itmore may and be timemore for difficult DIF statisticians for applied andresearchers methodologists to select to the focus method on comparing that will existing DIF methods in a systematic way under different circumstances. In the

willcurrent be most statistical reliable DIF in literature, particular authors circumstances often compare over all newexisting methods DIF methods. to one or two existing methods. This makes it difficult to judge which statistical method-

Fortunately, we can make several positive remarks about the current prac totice carry in DIF out testing DIF analyses as well. Ininstead our sample of observed (based scoreon a population methods. Theof studies majority citing of seminal DIF publications) many researchers selected unobserved score methods statistical power that have been discussed at length in the debate on replication reported sample sizes were sufficient to large, indicating that the problems of - (Hambleton, 2006) at first sight do not apply to assessments of DIF. Power of DIF tests is of course also dependent on other characteristics, like test length, differ ence in ability distributions between groups, type of multiple testing correction Reporting DIF analyses 127

used, and DIF effect size. Even though sample sizes are large, this is no guarantee the DIF test has sufficient power. When reliability estimates (based on CTT) of the scales were reported, they were generally high. We also note some limitations of our review. First, some DIF methods were a little more difficult to categorize because they share characteristics of both observed and unobserved score methods, like SIBTEST (13 out of 239 DIF methods) or ANOVA of estimated residuals of a Rasch model (24 out of 239 DIF somewhatmethods). Weif we chose decided to classify to classify these these methods methods as observed as unobserved score methods, score meth but- the classification is not entirely clear-cut. Conclusions would have been altered withinods. Second, the measurement because we invariancefocused on framework DIF in both that our is literature popular with search continuous and our selection criteria, we probably included relatively few studies that were analyzed indicators (see Schmitt and Kuljanin, 2008 and Vandenberg and Lance, 2000 for a review). Ignoring this field means that we cannot generalize outside the context of differential item functioning. For an overview, some reviews on measurement usinginvariance a cited testing reference have searchbeen carried to seminal out inpublications the past (Schmitt and hence & Kuljanin, limited in 2008; that Vandenberg & Lance, 2000). Third, our population of DIF articles was defined by 4 in ISI Web of Science or were indexed there but failed to refer to these particular regard. Hence, we cannot generalize to other DIF studies that were not indexed

(high quality) DIF publications. Although future reviews could study practices in DIF assessments in other populations, we do consider our large, broad, and highrandom as we sample would to have reflect liked. quite Therefore well how we DIF should analyses especially are typically be careful conducted to inter in- pretthe wider variables literature. “statistics Fourth, reported” for some and items “name the effect inter-coder size estimate”. reliability Although was not the as issue of low intercoder-reliability of several variables can be attributed to vari-

- ous causes (e.g., simple disagreement, inconsistent coding), some of these coding problems arose from insufficient or ambiguous reporting in the articles them selves, thereby reflecting the problem of ambiguity in reporting of DIF analyses. We discarded some items because they proved to be difficult code (e.g., whether authors reported if DIF had an impact on the (latent) ability scores of the groups, whether authors used purification). Finally, we stressed the importance of DIF flagging rules, and as such a dichotomous cut-off for DIF items (i.e., DIF is present- versus DIF is absent), because we believe many authors use DIF analyses purely to distinguish DIF items from non-DIF items, often to remove biased items. If au thors view DIF as a continuous phenomenon (De Boeck, 2008), DIF flagging rules are less important. However, several of the variables we addressed, like reporting of effect sizes, is important even if DIF is considered a continuous event. 128 Chapter 4

means of this systematic review. The most popular statistical methods to test To conclude, we return to the three questions we wanted to answer by-

for DIF, at least in our population of articles, are unobserved score methods, al though the type of unobserved score method varies quite a bit (with IRT-LR DIF- being the dominant approach). Second, we see that in several regards the typical practice in DIF studies is quite good in that the majority of studies are sufficient ly powerful and use scales with sufficient to good reliability. Third, at the same time, our review highlighted several very common suboptimal practices in DIF assessments, such as an overly strong focus on significance testing without much- attention for issues of replication, cross-validation, or corrections for multiple- testing. Also, many DIF studies failed to report effect sizes and information need ofed thefor methodologicalreproducibility andliterature replicability. could certainly Thus, the add reporting in that practiceseffort to improve of DIF anal the yses are in need of improvement, and the guidelines we proposed on the basis

quality of assessments of DIF in future research.

4 Reporting DIF analyses

129

4

Chapter 5

The psychometrics of stereotype threat

threat. This chapter will be submitted as Flore, P. C., and Wicherts, J. M. The psychometrics of stereotype Chapter 5

132

Abstract

Whereas the literature on stereotype threat is rich in theoretical insights and -

causedempirical by results an environment concerning of mean stereotype effects, threat.it has little With attention item-level for analyses the psycho we metric qualities of the tests used to assess the typical performance decrements threat effects because such items are more threatening and demand more cog- nitivecan study resources the hypothesis that might that be partlymore difficulttaken up items by having will show to deal larger with stereotype increased

worries about performance. In our first study, we computed item-level statistics retrievedbased on fromclassical ten testpreviously theory published(CTT), to studystereotype the amount threat articlesof missing about responses, the per- formanceitem-level ofmeans, females item-rest and girls correlations, on math or andspatial reliability tests. We coefficients found prevalent on datasets miss-

Comparing item means across conditions differing in stereotype threat did not ing responses, especially in data sets showing large stereotype threat effects.

withshow respect systematic to stereotype patterns inthreat which conditions effects were using moderated a large stereotype by item difficulty. threat data In the second study, we used a formal test of Differential Item Functioning (DIF) systematic DIF that would indicate that stereotype threat on math performance set from a registered study at Dutch high schools. We did not find evidence for DIF analyses were found to be fairly low. 5 is moderated by item difficulty. However, in a simulation, power rates for these The psychometrics of stereotype threat

133

5.1 Introduction

onAfter math more or spatial than two decades of stereotype threat research, many hypotheses regarding the negative influence of gender stereotype threat on the performance tests have been tested in studies involving college students (S. J. Spencer et al., 1999), and school aged children (Ambady et al., 2001). Various- studies have identified important moderators and mediators of this debilitating effect among negatively stereotyped female test-takers, like domain identifica- tion, math anxiety, and test difficulty (see S. J. Spencer et al., 2016 for a review). The effect of negative stereotypes concerning women’s ability in various quan titative domains has been studied using various types of ability tests, including math tests, spatial ability tests, chemistry tests, engineering exams, and physics tests (see Doyle and Voyer, 2016 for a recent meta-analysis). In contrast to the many different variables that have been included in stereotype threat research, and the relatively well developed theory (e.g., Inzlicht & Schmader, 2012) about what strengthens or weakens stereotype threat effects, the statistical tools used- pareto study groups stereotype on these threat performance effects are variables rather limited. to see Typically,whether researchersthe mean effects used standard analysis of (co) variance (AN(C)OVA) or regression analyses to com or specific moderations of the effect align with stereotype threat theory (e.g., C. Good et al., 2008; Keller, 2007a; Schmader, 2002). For instance, Delgado and Prieto (2008) studied the effects of stereotype threat on the performance of a 5 math test, and used an ANCOVA to show that the stereotype threat effect was particularly clear for women reporting higher levels of math anxiety. Similarly, Spencer et al, (1999) and O’Brien and Crandall (2003) used ANOVA to show that- the effects of stereotype threat were only apparent on the more difficult math items, but not on easier math items. Despite the availability of many psychomet theric tools psychometric to study stereotypelevel. threat data, and the added information obtained by the useWhereas of such studying tools, very main little effects is known and interactions about how providestereotype us with threat some works inter at- - esting information, extending analyses to model effects at the item-level is valu able for two reasons. First, item-level analyses provide a more detailed picture of- the effects of stereotype threat: they show which items are difficult or easy (e.g., as indexed by item means or difficulty parameters) and how well items discrimi nate between different levels of the underlying math ability (e.g., by reporting the- item-rest correlation or estimated discrimination parameters), they provide an theyapproximation might offer of insight the reliability into whether of the testthe time(e.g., limitby reporting used in atest lower administration bound esti mate of reliability like Cronbach’s alpha or a model based reliability index), and might have been too strict (e.g., by noticing many skipped items at the end of the Chapter 5

134

test). All this information is lost when one simply focusesmath and on spatialdifferences tests in used group in means based on the number of correctly answered items (sum score). Therefore, we currently know little about the quality of the stereotype threat literature. Second, with item-level statistics we can test which items show stereotype threat effects, and whether stereotype threat somehow alters the measurement properties of math or spatial skills tests. For instance, psychometric techniques allow us to study whether women under stereotype visiblethreat showby considering larger performance only sum decrementsscores and arefor difficultexpected items to lead and to smaller Differential or no performance decrements on easy items. Such item-specific effects are not readily

Item Functioning (DIF; Holland & Wainer, 1993) in stereotype threat data sets (Wicherts et al., 2005). - In the context of stereotype threat, it is quite possible that particular items are more difficult or discriminate less for female students who experience ste reotype threat, compared to students who do not experience stereotype threat- (Millsap, 2011; Wicherts et al., 2005). Consequently, and because stereotype ofthreat women effects who are do generally or do not seen experience as a measurement stereotype problem threat. So(e.g., stereotype Walton & threatSpen cer, 2009), tests are not expected to be measurement invariant between groups - ternsmight followwell cause theoretical DIF. With predictions Item Response derived Theory from (IRT)stereotype models threat we can theory. test forAs measurement invariance by means of DIF analyses, and see whether DIF pat 5 alsothis theoryexplain suggests DIF with that respect women to gender(but not that men) is commonly might underperform found in high-stake on high- stakes test due to threat effects (Aronson & Dee, 2012), stereotype threat might on word problems testing. For instance, stereotype threat might explain why women underperform (Doolittle & Cleary, 1987; Kalaycioglu & Berberoglu, 2011; Ryan & Chiu, 2001), spatial ability items (Gierl et al., 2003) or geometry items- (Doolittle & Cleary, 2017; Gamer & Engelhard, 1999; Harris & Carlton, 1993; Li,- tentiallyCohen, & explain Ibarra, at2004; least Ryan part &of Chiu, the gender 2001; gapTaylor in mathematics& Lee, 2012) .and As such,spatial stereo skills type threat can be seen as a nuisance variable that creates DIF, which could po to compare DIF patterns with gender DIF in high stakes mathematics tests. Spe- tests. If DIF is (consistently) present in stereotype threat data, we would be able

high-stakescifically, we couldsettings check wherein whether experiments the patterns are ofproblematic DIF caused for by logistical stereotype and threat eth- as studied in low stakes settings (often psychological labs) are also evident in- yses should enable us to study generalizability of the stereotype threat effect in ical reasons. Thus, if stereotype threat causes certain patterns of DIF, DIF anal

real-life tests. In addition, knowing which types of items are most susceptible to stereotype threat effects, could aid the development of tests that are less affected The psychometrics of stereotype threat

135 by the performance impediments caused by stereotype threat. - reotype threat experiments and used several ways to study item-level hypoth- In this chapter, we studied the psychometric quality of ability tests used in ste- eses concerning stereotype threat effects. In the first study we re-analyzed ex perimental stereotype threat data with techniques from Classical Test Theory- perimental(CTT; F. M. Lordstereotype & Novick, threat 1968). data setIn the gathered second at study, Dutch we high illustrated schools. Finallythe use we of useunidimensional simulated data IRT tomodels study andpower DIF and analysis Type (Holland I error rates & Wainer, of DIF 1993) analyses on angiven ex the characteristics of our data set.

5.1.1 Stereotype threat Stereotype threat occurs when a member of a negatively stereotyped group is - afraid to be judged according to that stereotype, which causes performance dis ruptions in the stereotyped domain (Steele, 1997). In the case of mathematics, ontheory tests states than thattheir girls ability (or women)allows. Such who afear situational to confirm pressure the stereotype can be athat validity girls are less capable in mathematics than boys (or men) will actually perform worse- stereotype threat threat to mathematical testing, because girls’ (or women’s) ability will be struc turally underestimated (Delgado & Prieto, 2008). Studies of have been mostly carried out in the lab in samples of college students (e.g., 5 Brown & Josephs, 1999; Johns, Schmader, & Martens, 2005), but also in middle school aged children and high school aged children (e.g., Delgado & Prieto, 2008; Ganley et al., 2013; Keller & Dauenheimer, 2003; Muzzatti & Agnoli, 2007; Pichoste- reotype& Stephens, threat 2012), conditions and occasionally perform worse in highthan stakesgirls in testing the control contexts conditions (Stricker in which& Ward, stereotype 2004). On threat average, these experiments show that girls assigned to

is usually nullified or removed (e.g., by stating that the oftest stereotype is not diagnostic threat on for test math performance ability or byoperates stressing through that the a complex test is gender interaction fair) (Nguyen & Ryan, 2008; Picho et al., 2013; Walton & Cohen, 2003). The influence of motivational, cognitive, affective, and physiological processes (Schmader et al., 2008),The which effects include of stereotype variables threatlike a physiological stress response to the threat and deteriorated working memory efficiency. might actually become more motivated should in a stereotype be most prominent threat condition when difficultand ob- tests are used (Keller, 2007a; S. J. Spencer et al., 1999). If tests are easy, girls stereotypetain higher threatscores than girls in a control condition (O’Brien & Crandall, 2003).- In contrast, difficult tests will trigger worries or arousal when girls experience , which will strain working memory. Working memory is a re Chapter 5

136

- formancequirement decrements for successful in situations performance where on students(difficult) experience math problems. stereotype As a threatresult,. compromised working memory triggered by difficult tests will lead to larger per - Another proposed mechanism is the mere effort account. This theory finds its roots in social facilitation theory (Zajonc, 1965), which suggests that the presste- reotypeence of anthreat audience facilitates well-learned (or mastered) task, while it impairs- novel (or not yet mastered) tasks. Although girls (or women) experiencing means that girls will will be most motivated likely toperform disconfirm well theon well-learnednegative stereotype, or easy thetests situ or ational threat is expected to trigger a pre-potent (i.e., dominant) response. This-

items wherein such dominant responses are often correct, whereas their perfor mance will decrease for the not yet mastered, difficult tests or items for which the dominant response is incorrect (Jamieson & Harkins, 2009; S. J. Spencer et al., 2016). In practice, items in high stakes math tests, exams or selection tests will not all be equally difficult. Tests need to be suited for the population of students that are tested. A test will be most informative when on average students score 50%- tionsof the should items correctly. be larger Itemthan means0 and theshould test ideallyshould not be reliable.be too high In testsor too with low these (e.g., larger than 0.30 and smaller than .70; Mellenbergh, 2011),stereotype item-rest threat correla will

properties, it would make sense that girls who experience 5 mostly underperform on difficult items (O’Brien & Crandall, 2003), whilestereo at the- typesame threat time performingcondition might quite be well very or motivated perhaps evento answer better the on easy easier items items. correctly After all, easier items will elicit less anxiety than difficult items, and girls in the

because they feel the need to disconfirm the negative stereotype. Put differently, for easy items the prepotent response results in correctly solved items, whereas this is not the case for difficult items. Preferably, these types of questions are- addressed using DIF analyses in an IRT framework. However, DIF analyses are demanding, and in small data sets it is often not feasible to test for DIF (see Chap framework.ter 4 and references In terms oftherein). CTT the To item inspect means some of the basic stereotype item-level threat statistics group andcan beto comparedsee whether to datathe item can bemeans modeled, of the one control could group. use more basic analyses in the CTT To illustrate the possible effects of stereotype threat on an item-level we

- simulated data under four scenarios, with a large number of observations and- items. We simulated data under the Rasch model, with a wide range in item diffi threatculty parameters in what we across call a items.PP plot. We The display four scenariosthe proportions differ incorrect the amount (often denotof DIF anded P) impact. for the Impact two groups refers that to group reflect differences two conditions in the with mean or of without underlying stereotype latent The psychometrics of stereotype threat

137

1.00 1.00

0.75 0.75

0.50 0.50 P stereotype threat 0.25 P stereotype threat 0.25 No mean ST effect − DIF No mean ST effect No mean ST effect − invariant − invariant No mean ST effect 0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P control P control

1.00 1.00

0.75 0.75

0.50 0.50 P stereotype threat 0.25 P stereotype threat 0.25 Mean ST effect − DIF Mean ST effect Mean ST effect − invariant − invariant Mean ST effect

0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P control P control 5 Figure 5.1 Four scenario’s displaying DIF and impact.

- - math ability θ (i.e., participants of one group are just not as good in math as par ticipants in the other group), whereas DIF occurs when, controlled for θ, particu welar itemssimulated are more a data difficult set in whichfor one there of the was groups. no average In IRT terminology,stereotype threat DIF means effect that the item parameters differ across groups. In Scenario 1 (upper left panel) on latent ability, and the test was invariant (i.e., no impact and no DIF). As we plot the item means (simply the proportion correct or P) for both groups in this scenario in Figure 5.1, we can see that all items fall on the diagonal, indicating stereotypethat all items threat show effect similar of Cohen’s means (proportiond correct or P) across= N both groups. In the second scenario (bottom left panel), we assumed that thereST was a sizeable and C = N = .80 on the latent trait (θ (-0.80,1) and- trol groupθ (0,1)),than for but the the stereotype test was still threat invariant (i.e., impact, but no DIF). In this scenario we see that item means (proportions correct) were higher for the con group, indicating that the items were Chapter 5

138

easier for the control group than for the stereotype threat group. This difference

is largest for items with medium difficulty. This is a well-known result from CTT (F. M. Lord, 1980) and shows that even if the item parameters are identical across- ancegroups of inCTT the item measurement statistics when sense groups (the IRTdiffer difficulty in ability parameters is well known are andidentical gave riseacross to groups),the development impact leads of IRT to divergingmodels that proportions do allow correct.comparisons The lack of measureof invari-

averagement parameters stereotype across threat groups (Embretson & Reise, 2000). Instereotype the third threatscenario (upper right panel) we study the PP-plot without an stereotype threateffect on the latent trait, but with a gradual DIF effect.ste- reotypeHere, threat affects the item parameters depending on how difficult- they are, with effects being positive for easier items, and from a large effect effects in favor being of negativethe stereotype for more threat difficult group items to a large (with effect lower in P-val favor ues in both groups). In this third scenario there is no impact but DIF that ranges a continuous moderation of stereotype threat - of the control group as we consider more difficult items. This scenario represents effects by item difficulty param eters in line with results that were studied by O’Brien and Crandall (2003) by- creating two types of tests (a difficult test and an easy test).and This the averagescenario stereo of DIF- typecreates threat an s-shape across the diagonal. Finally we created a fourth scenario (bot tom right panel) in which we both included gradual DIF stereotype threat 5 effect on the effect latent on level the and latent therefore ability is(i.e., somewhat impact and less DIF).clear. In The this patterns scenario in Figthis- s-shape seems to appear as well, but is pushed down by the -

andure 5.1arguably provide overly implications restrictive for modelhow proportions that might correct not apply will to differ most under actual differ cog- nitiveent scenarios tests data and used is based in stereotype on the unidimensional threat Rasch model, which is a basic

research (e.g., where the variation in item difficulty parameters might be less ideal and the discrimination parameters- iosmight to considerdiffer across PP-plots items). from Also, actual the patternsstereotype of threatDIF and experiments impact should to get be some seen as theoretically driven. Nonetheless, these patterns can be used as ideal scenar

psychometricsense of the item-specific characteristics effects to seeemerging whether from these stereotype idealized threat. patterns In Study would 1 webe visible.studied empirical PP-plots of various stereotype threat data sets, as well as other

5.1.2 ST, psychometrics and the current literature

stereotype threat To get an intuition of the occurrence of psychometric analyses (e.g., item-level analyses with CTT, IRT or Factor Analysis, FA) in the current The psychometrics of stereotype threat

139

ste- reotype threat stereotype threatliterature we performed a systematic review of several characteristics of 79 articles studying a college student population and 23 tests. Results articles are studying given in a Appendixschool aged E. Inpopulation, total 128 stereotypeconcerning threat gender studies stereotype were threat experiments and performance on Math, Spatial Skills, and Science (MSSS) - reported (101 studies in the articles of college students, and 27 studies were- reported in the articles of school aged students). These articles rarely report ed item level statistics (e.g., item means, item-rest correlations, IRT or FA esti mates). Specifically, at least some of those item-level statistics stereotypewere reported threat in only four of the 128 studies (3%). In nine of the 128 studies (7%) authors split- the items into an easy and a difficult subset to study whether - effects differed across difficulty levels. The remaining 115 studies neither ad dependentdressed stereotype variable threatin the atsurveyed the item stereotype level nor compared threat studies. easy andReliability difficult of sub the sets of items (90%). Moreover, reliability estimates werestereotype rarely reported threat studies. for the Overall there is some variety in how the researchers scored the ability tests. The MSSS tests was only reported in in 12 (9%) of the 128 themajority number of studies of items simply correct reported divided the by number the number of correctly of items answered attempted items by (i.e., the sum score) (58% of 128 studies), whereas other studies reported accuracy (i.e., students; 11%), guess corrected math scores (i.e., guess correction or formula scoring; 10%), or a combination of those three measures (21%). Only 33 of the- 5 studies (24%) reported the amount of missing responses on the MSSS test for conditions separately or tested whether missing responses differed significant ly between conditions. Interestingly, a majority of experimentsstereotype (77% ofthreat the 128 re- search.studies) Even used though time limits most when stereotype administrating threat articles the cognitive seem to abilityreport tests.tests that are Therefore, speededness of tests might be an issue in known as power tests (consisting of challenging items that students have ample stereotypetime to complete) threat research instead areof speeded often accompanied test (consisting by stringent of extremely time limits easy mightitems that no student is able to complete within the allotted time), the fact that tests in - mean we should regard them as partially speeded tests (F. M. Lord & Novick, 1968). Speededness can compromise the reliability of tests (Attali, 2005), or in flate estimated reliability of tests (Cronbach & Warrington, 1951; Lu & Sireci, 2007; Wise & Demars, 2009), and as such is seen as an undesirable property.- However, in the past differential speededness effects have been found in the SATs for minorities and females (A. P. Schmitt & Dorans, 1990; A. P. Schmitt, Dor contextans, Crone, of stereotype & Maneckshana, threat 1991). Even though speededness in tests should be- warded off in testing, it might reflect a realistic stressful testing situation. In the an interesting question arises concerning speeded 140 Chapter 5

stereotype threat have a larger num- ber of unanswered math items due to time constrains than their counterparts for whomness, namely stereotype whether threat girls is removed.who experience Such a pattern of increased missingness un- stereotype threat on the working speed of the students on a test. der threat might reflect the influence of stereotype threat studies to compare several CTT statistics across stereotype threat condi- In our first study, we obtained raw data from several published stereo- type threat tions. Specifically, we compared girl’s or female’s item performance in conditions and control (stereotype safe) conditions on the amount of missing item responses, item-level statistics, and reliability coefficients, to get an overview of the psychometric qualities of the scales.checked From whether a theoretical the data point sets wereof view, suitable we explored for latent whether variable differences modeling. in item difficulties between the groups were larger for more difficult items. Finally, we

5.2 Study 1

In stereotype threat articles

psychometricthis study, wecharacteristics re-analyzed ofdata math from or 10spatial published skills tests in stereotype threat research.(see Table We 5.1 use for references)CTT to study reporting whether a psychometric total of 13 data characteristics sets to get a sense of math of the or 5 spatial skills tests differ over the experimental groups. Of the 13 data sets, four data sets were conducted in a laboratory setting among college students, and nine data sets were gathered among middle or high school students. In this study, we focus on the female test takers in the two conditions (stereotype threat vs. control) and ignore the male test takers. 5.2.1 Method Data sets

We contacted approximately 30 stereotype threat researchers via e-mail for data setsets from of stereotype the Open Science threat experimentsFramework. We (including used all allobtained corresponding data sets authors of studies of published studies included in Flore & Wicherts, 2015) and gathered another data

that have been published (provided that they shared data that included scores on the item level), except for one study in which the sample sizes were too small- for meaningful computations (n < 10 per cell) and one study that focused on very young elementary school children. Of the 10 articles we used, for five arti cles data were collected in the United States (including three lab studies and two high school studies; (Brown & Pinel, 2003; Cherney & Campbell, 2011; Ganley et The psychometrics of stereotype threat 141

al., 2013; Gibson et al., 2014; S. J. Spencer et al., 1999), for one article data were gathered in Spanish high schools (Delgado & Prieto, 2008), for three articles data- were collected amongst German high school students (Keller, 2007a; Keller & Dauenheimer, 2003; Neuburger et al., 2012), and for one article data were collect selecteded at a Dutch those university groups for (Wicherts which we et expected al., 2005). the When largest data effects sets included based on multiple theory. groups (e.g., a control group, a nullified group and a stereotype threat group) we - For Wicherts et al. (2005) we included the nullified condition and the stereotype studiesthreat condition, there were and only for two Gibson conditions et al. (2014) available. we included The studies the femaleused somewhat identity con dif- dition (i.e., a stereotype threat condition) and the control condition. For all other tests.ferent For tests: the two article studies of Delgado involved and mental Prieto rotation we included tests (Delgado both the & math Prieto, test 2008; and Neuburger et al., 2012), the remaining studies involved several types of math the mental rotation test. The data of Ganley et al (2013) were split up in three separate data sets based on age, because age groups received different math tests. For most studies items were scored as correct, incorrect or missing, except for- egorythe studies included of Delgado both missed and Prieto responses (2008), and Neuburger attempted et items al. (2012) answered and incorrectly.Brown and Pinel (2003) for which items were scored correct or incorrect, where the last cat Statistical analyses We only included data from women and girls from the stereotype threat experi- ments. All statistics were calculated for the stereotype threat group and the con- 5 trol group separately. For data sets that included information on whether stu- per group. Additionally we plotted the percentage of missing responses per item. dents skipped an item or not, we calculated the percentage of missing responses - ationsAfter this, and we effect rescored sizes. theWhereas skipped some items authors as wrong used items different to obtain scoring reliabilities, methods item means, item-rest correlations, sum score means, sum score standard devi

Becauseto calculate studying means, the standard number deviations, correct is theand most effect popular sizes (e.g., measure formula in stereotype scoring or threataccuracy), we wanted to keep the scoring constant over the different data sets. use. We calculated the sum score for both groups based on the number of items literature, we deemedd as measure this would of effect be thesize most over reasonable the sum score. scoring We rule calcu to- correct, and Cohen’s lated Cronbach’s alpha (i.e., KR-20) and a 95% confidence interval using Feldt’s method (Feldt, 1965). Per item, we calculated the proportion of students that answered the item correctly, and item-rest correlations. We noted the amount of items without any variance (i.e., students in that group either all answered those questions correctly, or all answered incorrectly). 142 Chapter 5

5.2.2 Results

Stereotype threat effect and missing responses

stereotype threat - malesIn Table in 5.1the westereotype displayed threat several condition statistics underperforming for the thirteen compareddata sets. Ato negative females ineffect the sizecontrol (i.e., condition. Cohen’s d) The indicates data sets a classical vary in sample size. Standardized effect, with effect fe - - sizes in terms of the sum score of items showed mixed results, with some nega tive effect sizes, and some effect sizes close to zero, in line with earlier meta-an alytic results (Flore & Wicherts, 2015). The prevalence of missing responses varied across studies. For the five studies with the largest frequency of missing responses (i.e., data sets with averaged 20% of the items left unanswered for at least oneThe oflargest the groups) stereotype we plotted threat effectthe proportion sizes over of the missing sum scores item responses were seen for in the two conditions in Figure 5.2. -

the lab studies of Spencer et al. (1999), Wicherts et al. (2005) and the field ex periment of Keller and Dauenheimer (2003). Interestingly, missing responses for these three studies were more frequent in the stereotype threat condition than in the control condition, and the difference in number of missing responses is- especially large for the studies of Keller and Dauenheimer (2003), and Wicherts et al. (2005). For the latter two studies the missingness patterns appear to ex 5 plain a sizeable part of the effect size on the sum score. Figure 5.2 shows that for all three studies most of the missing responses were located (unsurprisingly) at the end of the test, which probably means that many students did not reach the items at the end, and subsequently that manystereotype students threat did effectnot have size enough on the timesum to finish the entire test. betweenFor itemsother thatstudies were with skipped a (modest) and items that were incorrect. For those studies score (Brown & Pinel, 2003; Delgado & Prieto, 2008) we could not distinguish

we plotted the item means (i.e., P-values) in Figure 5.3 for both groups separately.- Figure 5.3 shows that the tests of Brown and Pinel (2003), and the MRT of limitDelgado on theand mental Prieto (2008)rotation show test aand clear we declining have no informationtrend of item regarding means as skipped allocat ed time progressed. Because Delgado and Prieto (2008) imposed a stringent time

items, the decline might be a sign that many students failed to reach many of the later items, or it might indicate that the most difficult items were placed at the end of test (a practice that is typically recommended in the case of speeded tests; Oshima, 1994). When eyeballing the item means of the Delgado and Prieto (2008) data set, we see that performance differences between conditions were clearer for the later items in the test, whereas we do not see the same pattern in item means in the data of Brown and Pinel (2003). Finally, for some studies missing responses The psychometrics of stereotype threat

143 0 0 1 0 0 0 0 0 3 4 1 3 4 Items Items without variance

0 6 0 4 2 0 2 1 1 5 0 6 10 cor rest rest Neg. item-

12 20 10 30 24 12 12 12 26 26 16 20 20 (8) (4) (6) (20) (15) (10) (25) (10) (10) (12) (20) (20) (20) items (time) Numb. of Numb.

- - - - 1.46 2.25 1.07 0.30 0.41 6.06 4.31 8.13 6.75 (SD) (2.07) (1.69) (1.44) (0.98) (0.77) (4.41) (4.71) (4.36) (4.47) MeanC mis. resp.

, =Cronbach’s , mean mis. resp. = mean of the missing the of mean = resp. mis. mean , =Cronbach’s , d - - - - 1.09 2.18 1.27 0.29 0.64 5.04 7.95 7.13 (SD) 10.11 (1.76) (1.82) (2.05) (0.94) (1.13) (3.46) (4.29) (4.50) (4.88) MeanST mis. resp. = Cohen’s Cohen’s =

d

C α 0.74 0.59 0.37 0.71 0.63 0.83 0.56 0.70 0.83 0.87 0.87 0.60 -0.03 []95% [0.62; 0.83] [0.30; 0.80] [0.21; 0.51] [0.62; 0.79] [0.51; 0.73] [0.67; 0.93] [0.30; 0.75] [0.54; 0.82] [0.73; 0.91] [0.76; 0.95] [0.80; 0.92] [0.41; 0.75] [-0.82; 0.52]

ST α 0.67 0.63 0.50 0.70 0.81 0.88 0.73 0.57 0.73 0.75 0.87 0.56 0.35 []95% [0.52; 0.79] [0.38; 0.81] [0.38; 0.60] [0.60; 0.79] [0.74; 0.87] [0.77; 0.95] [0.57; 0.85] [0.33; 0.75] [0.55; 0.86] [0.55; 0.89] [0.80; 0.92] [0.35; 0.72] [-0.22; 0.73]

5 d 0.16 0.04 0.13 0.13 -0.05 -0.35 -0.01 -0.28 -0.29 -0.47 -0.14 -0.59 -0.92 []95% [-0.43;0.33] [-0.93;0.23] [-0.05;0.38] [-0.58;0.03] [-0.68;0.75] [-0.35;0.61] [-0.74;0.16] [-0.40;0.67] [-1.14;0.21] [-0.61;0.32] [-0.31; 0.30] [-1.01;-0.18] [-1.61;-0.22] 52 22 89 89 15 33 41 31 16 36 47 20 NC 162

C X 5.87 9.77 4.40 8.78 6.73 6.61 5.85 7.11 5.05 (SD) 22.38 14.32 13.81 11.86 (2.82) (3.01) (1.62) (3.83) (3.45) (3.43) (2.34) (2.84) (4.85) (5.64) (3.58) (2.35) (1.82) 54 24 79 79 15 34 36 24 19 36 47 16 177 NST

ST X 5.72 8.67 4.72 7.71 6.87 6.94 5.08 5.62 3.31 (SD) 22.35 14.92 11.63 11.33 (2.57) (3.27) (2.30) (3.88) (4.20) (3.94) (2.86) (2.37) (3.93) (3.67) (3.73) (2.66) (1.96)

grade grade th grade grade th th Sum scores and SDs, Cronbach’s alpha, missing responses and item statistics for the dependent variable in stereotype threat experiments. threat in stereotype dependent variable the for statistics and item alpha, missing responses and SDs, Cronbach’s scores Sum

st = stereotype threat group, c = control group. X = sum score, SD= standard deviation, deviation, standard SD= score, sum = X group. control = c group, threat stereotype = st Study Gibson et al. (2014) & Pinel (2003) Brown & Campbell Cherney (2011) - math & Prieto Delgado (2008)* - MRT & Prieto Delgado (2008)* et al. - 4 Ganley (2013) et al. - 8 Ganley (2013) et al. - 12 Ganley (2013) (2007) Keller & Dauenheimer Keller (2003) et al. (2012) Neuburger Wicherts (2005) Spencer et al (1999) Table 5.1 Table Note. responses, Numb of items = total amount of test items, Time = amount of time to complete the test in minutes, Neg. cor item-rest = number item- of negative groups. the of one least at for responses incorrect or correct only either with items the = variance without Items groups, the of one least at for correlations rest 144 Chapter 5

Keller & Dauenheimer (2003): Missing responses

0.75 shape

0.50 Girls C

Missing 0.25 Girls ST (in proportions) 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Items

Keller (2007): Missing responses

0.75 shape 0.50 Girls C

Missing 0.25 Girls ST (in proportions) 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Items

Cherney & Campbell (2011): Missing responses

shape 0.50 Girls C 0.25 Missing Girls ST (in proportions) 0.00 1 2 3 4 5 6 7 8 9 10 5 Items Wicherts et al (2005): Missing responses

0.75 shape

0.50 Girls C

Missing 0.25 Girls ST (in proportions) 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Items

Spencer et al (1999): Missing responses

0.75 shape

0.50 Girls C

Missing 0.25 Girls ST (in proportions) 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Items

Figure 5.2 Missing responses for five stereotype threat studies. The psychometrics of stereotype threat

145

Brown & Pinel (2003): Item means

0.75

shape

0.50 Girls C

Item means Girls ST

0.25

0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Items

Delgado & Prieto (2008) MRT: Item means

0.75

shape 0.50 Girls C

Item means Girls ST 0.25

0.00 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Items

Figure 5.3 Item means for two studies with a (modest) stereotype threat effect.

- did not follow a clear pattern (Cherney & Campbell, 2011; Ganley et al., 2013; Gib son et al., 2014; Keller, 2007a). The amount of missing responses is high in several studies; for five out of the nine data sets that contained missing responses at least Item-levelone of the groups statistics skipped on average 20% of the items or more. We will now zoom in on item means to inspect patterns of performance differ- ences between conditions across the items. Because the samples of the data sets the PP-plot should be interpreted with caution. For data sets of which we had are small, we could not carry out formal DIF tests, and conclusions drawn from stereotype threat group against the item means of the control group in a com- information on the missing data, we plotted the item means (i.e., P values) of the Chapter 5

146

bined PP-plot (Figure 5.4). Items with more than 30% of missing responses for the twoWe groups see that combined for many are studies illustrated the pattern with an of openitems dot, in the items PP-plot with resembleless than Scenario30% of missing 1 most: responses no average are stereotype illustrated threatwith a effectclosed on dot. the latent trait and no

th DIF (i.e., invariance). For instance, this is the casestereotype for the data threat sets effects of Cherney show and Campbell (2012), Ganley et al. (2013) – 8 grade,stereotype Gibson et threat al. (2014) effect and on Keller (2007). The three studies with the largest stereotype threat on the patterns that most resemble Scenario 2 (an average the latent trait but no DIF) or scenario 4 (an average latent trait but some DIF), however it is difficult to distinguish the two scenarios from each other, especially in small samples with short tests (thestereotype small samples threat effectpreclude is caused any formal by items test on of whichDIF here). the prevalence of missing responses is high. For Nonetheless, we can see that a large share of the average

instance, if we would remove all items with more than 30% of missing stereotyperesponses threatin the data set of Keller & Dauenheimer (2003), the remaining items would be distributed evenly across the diagonal, indicating a very small or no effect on the latent trait and no clear pattern of DIF. Moreover, we can see that the test of Spencer et al. (1999) was extremely stereotypedifficult, and threat that foreffect half would of the items the number of missing responses was sizeable. Similarly, the test of Wicherts 5 et al (2005) was clearly a speeded test, and the th grade follow a somewhat largely disappear if the speeded items (those with many missing responses) were removed. The item means of Ganleyth et al. (2013) – 4 stereotype threat peculiar pattern, perhaps due to the small number of items in the test. The item meansThe of GanleyPP-plots et foral. (2013) the data – 12 sets grade without show information a clear, but concerningsmall missing re- effect, consistent with an effect at the latent level, but no systematic pattern of DIF. stereotype threat - sponses are shown in Figure 5.5. For Brown and Pinel (2003), item effects of seemed to appear mostly for easy items and less so for diffi thatcult items,indicates which stereotype goes against threat theoretical effects on the predictions. latent trait The nor test a lack of Neuburgerof invariance. et al. (2012) was clearly a speeded test, but the PP-plot neither showed a pattern patterns of DIF due to stereotype threat. The mental rotation test of Delgado and The math test of Delgado and Prieto (2008) was very easy and showed no clear- cause we have no information on the missing responses in this data set we do notPrieto know (2008) whether differences the stereotype in P-values threat mostly effect occurred is caused for by the a lack difficult of responses items. Be or

patterns of lack of invariance in the analyzed stereotype threat data sets. Because sampleby a larger sizes share are too of itemssmall weanswered were unable incorrectly. to test Overall, these patterns we could formally. not see clear The psychometrics of stereotype threat 147

1.00 1.00

0.75 mis > 0.3 0.75 mis > 0.3

FALSE FALSE 0.50 0.50

TRUE TRUE 0.25 0.25 P stereotype threat P stereotype threat Wicherts et al (2005) Wicherts

Cherney & Campbell (2011) Cherney 0.00 d = 0.16 0.00 d = −0.59

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P control P control

1.00 1.00

0.75 mis > 0.3 0.75 mis > 0.3 FALSE 0.50 0.50 FALSE TRUE 0.25 0.25 P stereotype threat P stereotype threat

0.00 d = 0.04 0.00 d = 0.13 Ganley et al − 4th grade (2013) Ganley et al − 8th grade (2013) Ganley 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P control P control

1.00 1.00

0.75 mis > 0.3 0.75 mis > 0.3 FALSE 0.50 0.50 FALSE TRUE 0.25 0.25 P stereotype threat P stereotype threat

d = −0.47 d = −0.29

Keller & Dauenheimer (2003) Keller 0.00 0.00 Ganley et al − 12th grade (2013) Ganley 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P control P control 5 1.00 1.00

0.75 mis > 0.3 0.75 mis > 0.3

FALSE FALSE 0.50 0.50

TRUE TRUE Keller (2007) Keller 0.25 0.25 P stereotype threat P stereotype threat Spencer et al (1999)

0.00 d = 0.13 0.00 d = −0.92

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P control P control

1.00

0.75 mis > 0.3

FALSE 0.50

TRUE 0.25 P stereotype threat Gibson et al. (2014)

0.00 d = −0.05

0.00 0.25 0.50 0.75 1.00 P control Figure 5.4 PP-plots for stereotype threat data sets with information on missing responses. 148 Chapter 5

1.00 1.00

0.75 0.75

0.50 0.50 P stereotype threat P stereotype threat

Brown and Pinel (2003) Brown 0.25 0.25 Delgado & Prieto (2008) − math Delgado

d = −0.35 d = −0.01 0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 P control P control

1.00 1.00

0.75 0.75

0.50 0.50 P stereotype threat P stereotype threat

Neuburger et al (2009) Neuburger 0.25 0.25 Delgado & Prieto (2008) − MRT & Prieto (2008) − MRT Delgado

d = −0.14 d = −0.28 0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 5 P control P control Figure 5.5 PP-plots for stereotype threat data sets without information on missing responses.

Quality of the scales

negativeFinally, Table item-rest 5.1 shows correlations that the appeared psychometric for at qualitiesleast one of of the the math groups. and Perhaps spatial ability tests were not always ideal. Specifically, in quite a few data sets multiple

correlationsnot coincidentally, indicate these that item-restthe item is correlations not good at mostlydistinguishing appear skilled for data from sets un in- skilledwhich studentsstudents. had many missing responses. Usually, low or negative item-rest

students in a condition answered the item correctly or incorrectly. In the PP-plot Furthermore, a few tests included some items without variance because all

stakeswe saw test that wherein overall thesome test tests design where would extremely ideally difficult,be such thator really the mean easy. percentAs such,- these tests will carry little information, and might not be representative of a high

age correct across items would be close to 50%. The psychometrics of stereotype threat

149

Although we also reported Cronbach’s alpha for all tests in Table 5.1, we- should be careful interpreting reliability coefficients for (partially) speeded tests, stereotypebecause for threatthose tests reliability coefficients tend to get inflated or deflated, de pending on the guessing strategy of the students. Reliability coefficients in these these differences did experiments not seem to varied be consistently from extremely favor the low stereotype (even negative) threat or to controlhigh. We conditions. see some differences in reliability coefficients between conditions, but

5.2.3 Discussion from 10 stereotype threat Three things stood out from the results of our item-level analyses of 13 data sets articles. First, in five studies many responses were of the test. For two out of three studies with a stereotype threat dˆ amissing, substantial and forpart all of thesethe group studies differences most missing in sum responses scores could occurred be attributed at the end to group differences in missingness. There are several explanations foreffect these ( differ> .40)- stereo- type threat ences in missing responses, with the tendency of girls and womenstereotype in the threat conditions to respond to fewer items (albeit not all data sets showed- this pattern). For instance, it might be possible that girls in the condition work slower because they are cautious in their answering,stereotype and consethreat quently were forced to leave more items open at the end of the test. Additionally,- 5 it might be possible that girls leave more questions open in the - condition because they do not know the answer to these questions. This expla nation might be more likely in studies that use formula scoring, and in which stu indents the stereotypeare encouraged threat not condition to guess might whenever have beenthey moredid not hesitant know theto guess. answer Either (for instance, in the study of Spencer et al., 1999). Finally, perhaps girls and women in creating stereotype threat effects. way, manyThe presence tests were of speededspeededness to a certain would degree,not be surprisingand so speed in couldlight of be the a factor time pressure used in stereotype threat students had one minute or less to answer each item on average. Depending on experiments. With an exception of one study,- - the type of questions, the time pressure might be somewhat higher than in regu lar high stakes math tests (i.e., exams and selection tests). For instance, the quan titative section of the paper and pencil version of the GRE consists of two sets of 25 items that students each need to finish in 40 minutes (Educational Testing studentsService, 2016); at Tilburg in the University SAT MATH allows students students need to to work finish 180 58 minutes items in on 80 40 minutes items. (CollegeBoard, 2015); and a statistics exam on regression analysis for bachelor

Note that speededness of tests can lead to DIF (Lu & Sireci, 2007), for instance if Chapter 5

150

the different groups do not work at the same speed or if the groups have different guessing strategies. A second main result of Study 1 was that psychometric characteristics of the ability tests strongly differed across stereotype threat studies. Whereas some th

tests had desirable item properties (e.g., 4 grade Ganley et al., 2013), others showed items with extremely difficult items (e.g., Spencer et al., 1999), badly discriminating items (e.g., Bell et al., 2003), little to no variance in certain items (e.g., Keller & Dauenheimer, 2003), and high item difficulty due to extreme time ofpressure information (e.g., Wichertsactually carried et al., 2005). by the Althoughitems might the betest less length than of expected the math based and onspatial the reportedtests was test reasonable length. Especially for most itemstests (i.e., with ten little items variance or more), or extremely the amount low

or negative item-rest correlations do not add much to the test. Moreover, we saw- that reliability estimates were quite unstable in several of the studies. Because reliabilitymany of the estimations experiments too turned seriously. out Becauseto be speeded reliability to a certainestimates degree, can be and over be- cause many studies deliberately use homogenous samples, we should not take studies were probably affected: either by the common occurrence of missing re- estimated or underestimated when tests are speeded, the coefficients for most Because stereotype threat researchers only rarely mention psychometric sponses, or by guessing of the students when time ran out. - 5 qualities of the studied tests (of 128 studies only 3% reported some form of ofitem-level psychometric analyses, characteristics 9% reported of astereotype reliability threat coefficient; tests on as adescribed study’s result. in a litAs erature review in Appendix E), it is clearly not common to discuss the influence- ful if researchers in the future would report item-level statistics in addition to av- such, to get an intuition of the psychometric qualities of the tests it might be help

erages. Whereas we now focused on CTT item-level statistics, factor analyses or theseIRT statistics data sets would are for fulfil the that wider goal literature as well. onOur stereotype sample of threat.studies from which we obtainedThe rawthird data main for result reanalysis from re-analyzingis broad, but 10it is stereotype not clear howthreat representative experiments was based on patterns of item means across conditions in PP-plots. Here we failed

we illustrated in the introduction by means of simulated data were created with ato large see very number clear of patterns observations of a lack and of ainvariance large number or DIF. of However,items. The the smaller scenarios the -

sample and the shorter the test, the more difficult it will become to discover pat terns of invariance by eyeballing PP-plots. Additionally, restriction of range might- also obscure patterns of lack of invariance (DIF). For instance, the math test of Spencer et al. (1999) was so difficult that there was little variance in item diffi culty, and as such there was little opportunity for a lack of invariance, related to The psychometrics of stereotype threat

151

for these data sets using an IRT model to test stereotype threat hypotheses on item difficulty, to emerge. Preferably, we would have carried out formal DIF test an item level. Unfortunately, one of the downsides of IRT modeling are the large- sample size needed, and some quality requirements concerning the items (e.g., studentshaving positive per group discrimination at the very parameters,least. For powerful lack of speededness).DIF analyses the Most groups ST exper need toiments be even are simplylarger. tooWe smallattempted for IRT to modelling, model DIF which for two requires of the approximatelydata sets with 100the

- largest sample sizes (i.e., Cherney & Campbell and Delgado & Prieto), but due to largethe overall data set small we samplerecently sizes collected and someamongst other Dutch unfortunate high school psychometric students. On quali this empiricalties we failed data to set successfully we illustrate fit theDIF usemodels of IRT to modelsthe data. on As stereotype such, we turnedthreat datato a and explore the role of DIF in the context of stereotype threat.

5.3.1 Item Response Theory Models IRT models are a family of latent variable models that assume a causal relation- ship from the latent trait θ (i.e., math ability in stereotype threat research) to the indicator variables (i.e., the math items in stereotype threat research). A well- known IRT model that can be fit to tests with dichotomous outcome variables is the two parameter logistic model (2PL). This model specifies Item Response 5 Functions (IRF) based on latent ability,which item the difficulty probability and ofitem a correct discrimination. response The item response function for an item based on a 2PL model (Birnbaum, 1968) to item i xi θ and discrimination item parameter αi, and dif- is given by the followingβ :equation, in ( = 1) is a functioni of 1 ficultyP(xi = 1item|θ,α parameteri , β i ) = 1+ exp[−α i (θ − β i )] . (1) these models are more demanding in terms of sample size. These IRT models are accompaniedPseudo-guessing by a setparameters of assumptions. can be added The assumption to form the of 3PL unidimensionality model, however means that we expect the observed item responses to be a function of only one

- (continuous) latent variable. Because complete unidimensionality is unlikely to- hold in actual data, the presence of essential undimensionality (i.e., one dom- inant latent variable with several subfactors; Tay, Meade, & Cao, 2015) is suf ficient for adequate parameter estimation (Drasgow & Parsons, 1983). The as sumption of local dependency indicates that conditional on person ability, item- sets are independent, i.e., there is no remaining statistical relationship between particular item sets when controlled for ability. Finally, the assumption of mono Chapter 5

152

of answering an item correctly as is implicit in the functional form stipulated in tonicity requires that students with higher ability will have a higher probability

Equation 1. - In IRT several parameterizations can be used. For instance, one can use Equation 1 in which discrimination parameter and difficulty parameter are im- plemented. Alternatively, one can use the intercept-slope parameterization, in- which intercept γ and discrimination parameter α are used. This parameteriza tion has computational advantages. The difficulty parameter can later be calcu γ latedβ = − using the following equation: α There. are several ways to estimate item parameters in IRT. A classical tech(2)-

person and item parameters are estimated. Marginal Maximum Likelihood Es- nique is Joint Maximum Likelihood parameter Estimation (JMLE) in which both- - timation (MMLE) is an increasingly popular technique in which person param eters are integrated out and only item parameters are estimated (Bock & Ait kin, 1981). The supplementary EM algorithm (SEM) is developed to solve the problem of computational difficulties concerning the information matrix when the ability test consists of a large amount of items, and provide more accurate whichstandard we errors use throughout for item parameters this chapter. than more classical approaches (Cai, 2008). 5 Currently MMLE and SEM has been implemented in software package Flexmirt, 5.3.2 Differential item functioning - culty or discrimination parameters for members of the experimental group com- Differential item functioning (DIF) occurs when certain items have different diffi types of DIF can be distinguished: uniform DIF and non-uniform DIF. Uniform pared to members of the control group (Holland & Wainer, 1993). Two different

theDIF abilityoccurs ofwhen the particularstudents. Non-uniform items are more DIF difficult occurs forwhen students particular who itemsbelong dis to- a certain group than for students who belong to another group, controlled for students. Examples of item characteristic curves for dichotomous items with uni- criminate less for students in one of the groups, controlled for the ability of the

form and non-uniform DIF are shown in Figure 5.6. In the left graph the plotted item is more difficult for group 1 than for group 2 with the same ability level θ.- In the right graph the item discriminates less for group 1 than for group 2, again conditioning on θ. To ensure sufficient power rates and stable parameter esti mates relatively large sample sizes are required for DIF analyses. Most methods fare relatively well with a minimum of 500 subjects per group (Tay et al., 2015). The psychometrics of stereotype threat Uniform DIF Non−uniform DIF 153 Uniform DIF Non−uniform DIF Uniform DIF Non−uniform DIF 1.0 1.0

Group 1 Group 1 1.0 1.0 Group 2 Group 2

0.8 Group 1 0.8 Group 1

1.0 Group 2 1.0 Group 2 0.8 0.8 ) Group 1 ) Group 1 θ θ 0.6 0.6 Group 2 Group 2 ) ) 0.8 0.8 θ θ 0.6 0.6 P(Y=1| P(Y=1| 0.4 0.4 ) ) θ θ 0.6 0.6 P(Y=1| P(Y=1| 0.4 0.4 0.2 0.2 P(Y=1| P(Y=1| 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 0.0 0.0 −10 −5 0 5 10 −10 −5 0 5 10 0.0 0.0 −10 −5 0 5 10 −10 −5 0 5 10

−10 −5 θ0 5 10 −10 −5 θ0 5 10 θ θ Figure 5.6 Examples of an item with uniform and non-uniform DIF. θ θ

There are several statistical methods to test for DIF. For dichotomous outcome are several statistical methods and software packages to estimate DIF in 1PL or variables, 1PL and 2PL IRT models are commonly used for DIF analysis. There

2PL IRT models. We will discuss three of them, for a more complete overview see- Teresi (2006), Millsap (2011), or Tay et al. (2015). First, we discuss the DIF sweep or Langer-Improved Wald-2 tests imple- 5 mented in software package Flexmirt version 3.524. Those (Cai, 2017).latent mean The DIF and sweep stan- first estimates group ability averages and standard deviation estimates assum areing estimated.invariant items DIF tests (Woods, are thenCai, &carried Wang, out 2012) by a Wald χ 2 test. A combined test dard deviation estimates are used in a second analysis, in which DIF parameters are carried out to specify whether the DIF is uniform or non-uniform. An upside shows whether either uniform or non-uniform DIF occurs. Then, separate tests is no need to create a DIF free anchor. A downside of this approach is that abil- of this approach is that all items can be tested for DIF at once, and as such there ity estimates might be biased by DIF. This could result in inflated Type I errors (WoodsA second et al., 2012). approach Therefore, to testing this DIF method which is is useful also implemented as a screening in tool Flexmirt for DIF, is but is not optimal as a final DIF model. - the Langer-improved Wald-1 test. For this method, it is necessary to select an in variant anchor, i.e., a selection of items for which researchers expect the absence

24 - poses. The latent mean and standard deviation of the second group can then be estimated in the same metric. Note that the latent mean and standard deviation of one of the groups needs to be fixed for identification pur Chapter 5

154

a researcher could select one or several items based on theoretical grounds. In theof DIF. case There of stereotype are several threat strategies to select items as anchor items. For instance,

, it might be useful to refrain from selecting difficult items in the anchor, because those items are expected theoretically to show DIF. Another strategy is to select an anchor on statistical grounds, for instance by iterative purification (e.g., Candell & Drasgow, 1988; Khalid & Glas, 2014; Park & Lautenschlager, 1990), the iterative forward approach (Kopf et al., 2015a; Kopf,- Zeileis, & Strobl, 2015b) or by first using DIF sweep in Flexmirt. The statistical strategies are numerous, but purification methods can lead to contaminated an- chors, which in turn can lead to severely inflated Type I error rates (Kopf et al., 2015a). Whenever possible it might be wise to select the anchor based on theo retical grounds. Advice on the amount of needed anchor items differs, with one- (González-Betanzos & Abad, 2012; Kopf et al., 2015a), three (Thissen et al., 1993) theseor four models anchor the items nested (Kopf structure et al., 2015a; of data Thissen can be et taken al., 1993) into account. as a suggested For stereo min- typeimum. threat In Flexmirt, analyses can easily be extended to multi-level IRT models. In

research this can be useful if data were gathered at multiple schools, inclasses, which or a institutions.compact and an augmented model are compared by means of likeli- A third and popular approach to DIF testing is IRT-LR-DIF (Thissen, 2002), source program in FORTRAN enabling likelihood ratio tests. IRT-LR-DIF either useshood an ratio all-other (LR) tests. items The as anchor program approach IRT-LR-DIF or an (Thissen, invariant 2002) anchor is compriseda basic open of 5 several items. Several checks and practices need to be considered when carrying out DIF

tests. Notably, it is important to check whether the IRT model fits the data well.- Moreover, the IRT model assumption of unidimensionality needs to be checked i.e., whether one underlying factor predicts item responses, or whether a multidi mensional model including several factors would be more appropriate (Tay et al., 2015). For DIF testing usually multiple statistical tests are needed, which often equals the test length. It is important to control the family-wise Type I error rate- or the false discovery rate, by either using the Bonferroni correction (Bland & asAltman, such leads1995), to which more ispowerful straightforward tests compared and easy to tothe use, Bonferroni or the Benjamini-Hoch correction. We usedberg thesecorrection methods (Benjamini to test for & DIFHochberg, in stereotype 1995), threatwhich data.is less conservative and

DIF and stereotype threat Predictions about stereotype threat and DIF are best described based on the

Yerkes-Dodson law (Keller, 2007a; Yerkes & Dodson, 1908). This law describes a quadratic relationship between situational stress and task performance, in The psychometrics of stereotype threat

155

- mately decreases when the amount of situational stress continues to rise. For which task performance first increases under mild situational stress, but ulti- difficult tasks, performance decrements start to occur at lower amounts of situ- formanceational stress decrements than for oreasier even tests. improvements If we apply in this performance to stereotype on easythreat items. research, Such differentialwe would expect stereotype most threatperformance effects decrementswill manifest on themselves difficult items as a andlack less of mea per- - stereotypesurement invariance threat than (DIF) for overgirls experimentaland women who groups. do not We experiencewould expect stereotype that dif ficult items show higher difficulty parameters for girls and women experiencing A second type of DIF could occur if certain item types appear more threat- threat. For easy items, we would expect a reverse pattern. - ening than other types of items. For instance, studies on gender DIF showed that women with equal math ability often underperform on geometry items com itemspared areto men less (Alexeev,problematic 2008; for Gamergirls who & Engelhard, do not experience 1999; Harris stereotype & Carlton, threat 1993; Y. Li et al., 2004; C. S. Taylor & Lee, 2012). It might be possible that geometry- ness. If girls under stereotype threat (i.e., girls in the control condition). Moreover, a third source of DIF could be speeded girls in stereotype threat work slower, for instance because they want itemsto avoid or makingon items mistakes, placed at DIF the causedend of theby speedednesstest. might occur. In this case, Stereotype threat has conditions been linked will to DIFmost or likelymeasurement underperform invariance on difficult in only - 5 der stereotype threat has been associated with a lack of several types of mea- few studies (Arbuthnot, 2005; A. S. Cohen & Ibarra, 2005; Wicherts, 2007). Gen- highsurement achieving invariance black atstudents the scale in level, a high a conceptstereotype closely threat related condition to DIF, performed in a num lessber serieswell than subtest high andachieving an arithmetic black students subtest in (Wicherts a low stereotype et al., 2005). threat Moreover,condition on a subsetAlthough of itemsthese thatformer previously studies showedshow a presumedethnicity DIF, link whereas between they stereotype did not threatunderperform on a subtest of non-DIF items (Arbuthnot, 2005). can give a surplus of information for stereotype threat researchers compared to the standard and DIF, way direct of analyzing tests of stereotypeDIF with respect threat data.to stereotype One of the threat explanations conditions for a lack of DIF testing in experimental stereotype threat data might be the large

DIF testing might be a good alternative to the more classical statistical approach sample size requirements for DIF analyses. For larger data sets, IRT modeling and to model stereotype threat DIF. In this study we take an exploratory approach and refrainedthat has been from used pre-registration. in the past decade. We carried To our out knowledge, DIF analyses this iswith the experimental first attempt Chapter 5

156

a priori expectations concerning uniform DIF based on stereotype threat theory. condition as the grouping variable, only focusing on female students. We did have H1: More difficult items will show larger uniform DIF effect sizes than less difficult items, in favor of girls in the control condition.

We will inspect this hypothesis visually by plotting DIF effect sizes. It is also pos- sible that DIF due to stereotype threat is non-uniform. Stereotype threat effects

tomight stereotype be moderated threat by domain identification (Keller, 2007a; Schmader et al., 2008; Steele, 1997), which is positively associated with ability. If susceptibility a second hypothesis. effects is stronger among higher-ability levels, this would lead to a suppression of the discrimination parameter. Therefore, we formulated H2: More difficult items will show larger non-uniform DIF effect sizes, with lower discrimination parameters for girls in the stereotype threat condition.

5.4 Study 2

- erage stereotype threat effect. Unfortunately we are not in the possession of a stereotypeIdeally, we wouldthreat datatest theseset with hypotheses both a sizeable in a data stereotype set that showedthreat effect a sizeable and with av 5

a sample size large enough for IRT DIF testing. However, it is possible that there- are items that display DIF, even if no average differences show up on the sum score. For instance, non-uniform DIF can be cancelled out, and the effect of sever- ingal positively girls in the uniform stereotype DIF items threat (i.e., items favoring girls in the control condition) thecan canceluse of outDIF theanalyses effect ofin severalstereotype negatively threat uniformresearch DIF on items our Dutch (i.e., items stereotype favor threat condition). Therefore we illustrate and explore

data set (Flore, Mulder, & Wicherts, n.d.; Chapter 3). 5.4.1 Method Participants and materials stereotype threat

In this study, we investigated the difference between girls in a condition and a control condition, in a Dutch sample of 13-14 year old high Childrenschool children were randomly (Flore, Mulder assigned & Wicherts,either to a n.d.). stereotype Data were threat collected condition between or to a controlSeptember condition. 2016 andWe removed March 2017 children in 21 with schools a large throughout number of themissing Netherlands. respons- The psychometrics of stereotype threat

157

N es (more than 30% of the items left unanswered), or clear aberrant patterns of response (i.e., aaaaa9aaaaaaaaaaaaaa), resulting in a sample size of = 1,033 girls. Whereas gender DIF would be interesting to assess as well (i.e., comparing the difference in item estimates for boys and girls), it is beyond the scope of this- paper, and as such we removed boys from the data set. Children completed 20 MC math questions selected from TIMSS 2003, in cluding 12 items on the topic “Number” and 8 items on the topic “Geometry”. We purposely selected those items because of their variation in item difficulty toparameters work on the (average test for to 20 difficult minutes. items) and high discrimination parameters in the large-scale TIMSS data set (M. O. Martin et al., 2003). Students were allowed Statistical methods -

We started with the CTT analyses, to get a similar overview of our own data com pared to the data sets we inspected in Study 1, after which we turned to IRT separately and once for groups pooled together. We used M2 modeling. We fitted a 1PL, 2PL and 3PL version2 to the data, for the two groups- (Maydeu-Olivares & Joe, 2005) and RMSEA to assess model fit, S-χ to assess item fit (Orlando & This2 is a sen, 2003), and local dependence statistics (Chen & Thissen, 1997) to inspect local dependencies between2. Local itemsdependency for several is usually models. not Item considered fit statistic problematic S-χ goodness of fit measure, and good item fit is indicated by the lack of statistically significant tests of S-χ 5 modelif the local we starteddependence with aindices multi-group are smaller IRT model than 3.0 without (Houts DIF & Cai, effects 2016). to study To study the latentrelative group model mean fit we differences used AIC and without BIC fit modelling statistics. DIF. After selecting an appropriate

- Subsequently we fitted three different DIF analyses to investigate the amount of DIF in our sample. As discussed in the introduction section, we start weed withapplied the the Langer-Improved DIF Langer-Improved Wald-2 Wald-1 test implemented test in Flexmirt in with Flexmirt, anchor then items. we Anchorcarried itemsout a wereDIF test selected using partly the likelihood on theoretical ratio groundstests in andIRT-LR-DIF statistical and grounds finally

- (i.e., we will use four items that are relatively easy and which showed little DIF in the first DIF analysis). We controlled the false discovery rate by using the Ben pjamini-Hochberg-values to individually procedure computed (Benjamini critical & Hochberg, values. We 1995; reported Thissen the et analyses al., 2002), as using a significance level of α = .05 to start with. With this method, we compared stipulated in Flore, Oberski, and Wicherts (n.d. ; Chapter 4). To check whether dependency of observations influenced the results we carried out an additional DIF analysis in which we included the multilevel structure of the data (i.e., insert the class level), and compared and contrasted the results. Chapter 5

158

5.4.2 Results

CTT

To see how our data set compares to the data sets in Study 1, we started the analyses by plotting the PP-plot for our data in Figurestereotype 5.7. We threatsee some effect variation on the inlatent difficulties level. for the two groups, however the pattern of items does not seem to follow patterns of DIF, neither did it show a mean

the scale were Cronbach’s ST C standardAs reported deviations in onFlore, the Muldertest sum and scores Wicherts were (n.d.)X = 11.40reliability and SDcoefficients of α = .58 and Cronbach’sST α = .59, group ST means and XC C - hen’s d = -.07. There were no items with negative item-rest correlations= or3.10 items for the experimental group, and = 11.62 and SD = 3.12 for the control group, Co

1.00

0.75

mis > 0.3 0.50 FALSE Flore et al (2017) P stereotype threat 0.25 5 0.00

0.00 0.25 0.50 0.75 1.00 P control

0.50

shape

Girls C 0.25

Girls ST Flore et al (2007) Missing responses

0.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Items

Figure 5.7 PP-plot and number of missing responses for Flore et al. (n.d.). The psychometrics of stereotype threat

159

showing an increase in missing responses at the end of the test. Note that we without variance. Patterns of missing responses were also plotted in Figure 5.7, - removed students from the data set with large amount of missing responses (> 30%) under the assumption that they did not take the test seriously, as pre-reg Modelistered data-fit in Flore, and Mulder unidimensionality & Wicherts (n.d.).

stereotype threat To assess model data-fit and unidimensionality of the math scale we carried out several unidimensional IRT analyses, once for the girls in the condition and girls in the control condition separately, and once for both groups together. Model fit indices are given in Table 5.2. We see that the 3PL model fitted the data best when all groups were pooled together. However, when we fitted the models for both groups separately, we saw that for the 3PL the EM algorithm did not converge satisfactorily (i.e., some of the item parameters and their standard M2 errors were extremely large). The 2PL model did converge and showed sufficient well.model-data Only item fit (i.e., 7 in RMSEAthe stereotype < .5 and threat non-significant ). An inspection of the item fit statistics showed that overall for both groups, most items fitted reasonably group, and items 1 and 18 in the control thegroup items. showed Local poorer dependence fit. Item indices fit statistics for both for groups both groups can be foundcan be in found Table in F2 Table and F1 of Appendix F. Furthermore, we checked standardized local dependency of stereotype threat group and two pairs for the Table F3 of Appendix F. Although a few item pairs showed local dependencies thatlarger suggested than 3.0 the(three assumption pairs for ofthe unidimensionality was violated. Based on these 5 analysescontrol group), we decided this number to use a was 2PL negligible model to commenceand we could our not DIF find analyses. a clear pattern

Multi-group IRT

Before turning to DIF analyses, we fit a multi-group IRT 2-PL model, where the item parameters for both groups were theM same, but latentp means and SD’s were allowed to differ. First, we checked for model2 fit and item fit. We saw that the model-data fit seemed to be sufficient ( (378) = 411.21, = .12, RMSEA = .03). In this model, we did see some more badly fitting items and item dependency- between several item pairs. There could be several explanations for this, such as needing more power to actually find item misfit and item dependency. Alterna tively, it could also be a sign of DIF. Item fit statistics are reported in Table F4 in Appendix F. Item parameters are reported in Table 5.3. Overall the items were αˆ < βˆ ’s were larger than relatively easy. Items 3, 5, 13 and 18 had low discrimination parameters (i.e., 0.40). Consequently, some of the standard errors of the desired (i.e., S.E. > 0.50). Chapter 5

160

Table 5.2 Joined and separate unidimensional IRT models for girls in the stereotype threat condition and the control condition.

-2LL AIC BIC M2 (df) p RMSEA All girls Model 1 < .001

23318.92 23360.92 23464.67 418.49 (189) 0.03 Model 2 < .001 0.02 (1PL) 23195.45 23275.45 23473.06 248.32 .001 0.02 (2PL) (170) Model 3 23178.90 23298.90 23473.06 193.76 ST Model 1 11848.01 .001 (3PL) (150) 11717.13 11759.13 252.37 0.03 Model 2 0.00 (1PL) (189) 11665.82 11745.82 11915.12 163.61 .624 n.c n.c n.c n.c n.c n.c (2PL) (170) Model 3 Control Model 1 .001 (3PL) 12061.06 12103.06 12192.55 256.74 0.03 Model 2 12014.10 0.144 0.01 (1PL) (189) 12094.10 12264.56 189.62 n.c n.c n.c n.c n.c n.c (2PL) (170) Model 3 Note. AIC = Akaike(3PL) Information Criterium. BIC = Bayesian Information Criterium.ST= girls in stereo-

type condition, Control = girls in control condition. N.c. = model that failed to converge properly.

Table 5.3 Estimated item parameter for multi-group 2PL IRT model.

αˆ γˆ βˆ αˆ γˆ βˆ 5 (s.e.) (s.e.) (s.e.) (s.e.) (s.e.) (s.e.) Item 1 0.08 -0.14 Item 11

0.58 1.15 1.73 -1.51 Item 2 Item 12 -1.28 (0.10) (0.07) (0.13) (0.17) (0.13) (0.19) 0.50 -1.23 2.54 0.45 0.58 -0.74 -0.24 0.71 (0.11) (0.09) (0.50) (0.10) (0.07) (0.32) Item 3 0.32 2.34 Item 13 0.35 Item 4 0.74 -1.71 Item 14 0.40 (0.10) (0.07) (0.70) (0.09) (0.07) (0.26) 1.26 -0.25 0.62 0.27 1.02 -0.41 (0.12) (0.09) (0.27) (0.09) (0.07) (0.21) Item 5 -3.73 Item 15 0.68 0.60 1.12 -1.41 0.17 -0.28 (0.10) (0.08) (1.38) (0.11) (0.08) (0.13) Item 6 1.58 Item 16 0.62 Item 7 -0.27 Item 17 0.42 (0.16) (0.12) (0.17) (0.11) (0.08) (0.13) 0.58 0.16 0.86 -2.05 Item 8 1.01 1.22 -1.22 Item 18 -1.82 (0.10) (0.07) (0.14) (0.11) (0.08) (0.53) 0.32 0.58 1.07 1.40 -0.44 (0.15) (0.11) (0.17) (0.10) (0.07) (0.61) Item 9 -1.31 Item 19 0.58 0.76 Item 10 Item 20 0.44 -4.40 (0.15) (0.11) (0.17) (0.11) (0.08) (0.19) 0.59 0.54 -0.92 1.92 (0.10) (0.07) (0.20) (0.14) (0.11) (1.41) The psychometrics of stereotype threat

161

TIF for Group 1 TIF for Group 2 4 4 4 3 3 3 2 2 2 Information Information 1 1 1 0 0 0

-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 Theta Theta Theta Figure 5.8 Test information functions for the ST group (a, group 1) and the control group (b, group 2).

Inspecting latent ability we see very small differences in both group averag-

ST = 0.00 and µC ST= 1.00 and SDC es (µ = 0.10) and standard deviations (SD = 0.99). We plotted the information function for both groups in Figure 5.8a and Figure DIF5.8b. Analyses p - 5

In Table 5.4 we reported all statistical tests, -values, Benjamini-Hochberg crit inical our values, items and for the twoestimated groups. discrimination For some items and the difficulty difference parameters in estimated for difthe- DIF sweep model fitted in Flexmirt. We did not find any statistical significant DIF ficulty parameters was substantial (e.g., item 3, item 5), however for those items the standard errors were quite large as well. In Table 5.5 we show the second- icantDIF analysis DIF items in whichin our wedata used set. IRT-LR-DIFHowever there (Thissen, is a possibility 2002) to thatsee whetherthe methods our findings derived from the first DIF analysis are robust. Again, we saw no signif datawe used in which (i.e., the the iterative DIF analyses all-other-items were carried as anchor out using and anfixing invariant the latent anchor. means) We were biased due to present DIF. Therefore we additionally fitted a model to the selected this invariant anchor partly on the basis of theory (i.e., selecting some- of the easiest items for the anchor) combined with some statistical knowledge as well (i.e., selecting the items with small differences in estimated item parame ters). As such, we selected items 6, 9, 11 and 16 as anchor items. The DIF model with these items as invariant anchor items did not influence the outcomes: again we found no statistical significant DIF, results are reported in Table 5.6. Chapter 5

162

(S.E.) C C ˆ β 0.03 (0.16) 2.31 (0.60) 1.78 (0.50) 0.77 (0.32) 0.59 (0.23) 0.58 (0.19) 0.65 (0.24) -1.73 (0.35) -3.20 (1.61) -1.56 (0.27) -0.17 (0.15) -1.09 (0.17) -1.41 (0.24) -1.23 (0.41) -1.39 (0.21) -1.61 (0.60) -0.37 (0.19) -1.85 (0.59) -1.58 (1.01) -3.76 (1.39) (S.E.) ST ˆ β 2.70 (0.87) 3.15 (1.90) 0.57 (0.34) 0.65 (0.35) 0.62 (0.17) 0.94 (0.29) -0.29 (0.16) -1.71 (0.38) -4.22 (2.10) -1.31 (0.19) -0.40 (0.20) -1.39 (0.19) -1.23 (0.19) -0.76 (0.19) -1.59 (0.24) -1.06 (0.32) -0.19 (0.15) -2.37 (0.91) -1.75 (0.52) -5.61 (3.16) -value is lower than the lower is -value p (S.E.) C C ˆ α 0.59 0.14) 0.58 (0.14) 0.55 (0.16) 0.47 (0.14) 0.82 (0.18) 0.27 (0.13) 0.98 (0.20) 0.65 (0.14) 1.20 (0.22) 1.02 (0.20) 0.45 (0.13) 1.24 (0.23) 0.39 (0.13) 0.39 (0.13) 0.48 (0.13) 0.61 (0.14) 0.47 (0.15) 0.23 (0.13) 0.52 (0.15) 0.53 (0.20) (S.E.)

ST ˆ α 0.28 0.14) 1.10 0.20) 0.62 (0.14) 0.45 (0.15) 0.21 (0.13) 0.65 (0.16) 1.27 (0.23) 0.52 (0.13) 0.81 (0.17) 0.72 (0.15) 1.11 (0.22) 0.51 (0.14) 0.34 (0.13) 0.34 (0.12) 0.76 (0.16) 0.66 (0.15) 0.35 (0.14) 0.48 (0.15) 0.58 (0.16) 0.33 (0.19)

c.v. .005 .024 .008 .004 .003 .025 .011 .020 .014 .021 .016 .015 .009 .019 .010 .013 .023 .001 .006 .018 B-H p .141 .844 .334 .106 .031 .984 .485 .758 .515 .802 .579 .533 .446 .707 .482 .505 .831 .001 .177 .640

(1) 2

χ 2.2 0.0 0.9 2.6 4.7 0.0 0.5 0.1 0.4 0.1 0.3 0.4 0.6 0.1 0.5 0.4 0.0 1.8 0.2 10.3 5 uniform -values -values can be considered significant if the reported c.v. P .024 .016 .001 .011 .025 .006 .013 .003 .021 .004 .018 .014 .023 .008 .009 .019 .015 .005 .020 .010 B-H p .848 .643 .165 .481 .964 .343 .488 .171 .761 .181 .689 .513 .795 .429 .466 .700 .541 .200 .751 .474 (1)

2

χ 0.0 0.2 1.9 0.5 0.0 0.9 0.5 1.9 0.1 1.8 0.2 0.4 0.1 0.6 0.5 0.1 0.4 1.6 0.1 0.5 non-uniform

c.v. .006 .025 .005 .004 .003 .014 .013 .008 .021 .010 .023 .015 .019 .016 .011 .020 .024 .001 .009 .018 B-H p .331 .881 .239 .210 .096 .637 .616 .373 .772 .396 .791 .664 .723 .681 .598 .744 .811 .003 .382 .694 (2) 2

χ 2.2 0.3 2.9 3.1 4.7 0.9 1.0 2.0 0.5 1.9 0.5 0.8 0.6 0.8 1.0 0.6 0.4 1.9 0.7 11.9 combined DIF analyses 1: Flexmirt DIF sweep analyses. DIF sweep 1: Flexmirt DIF analyses B-H c.v. = Benjamini Hochberg critical values. critical values. = Benjamini Hochberg B-H c.v. Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 Item 16 Item 17 Item 18 Item 19 Item 20 Item Table 5.4 Table Note. BH-critical value. The psychometrics of stereotype threat

163

Table 5.5 DIF analyses 2: IRT-LR-DIF analyses.

2 p B-H 2 p 2 p αˆ αˆ ˆ ˆ χ (2) χ (1) χ (1) β β c.v. ST C ST C combined non-uniform uniform Item 1 .287 - - - - 0.04 Item 2 .024 - - - - 2.28 2.5 .025 0.62 0.56 -0.30 .024 - - - - 0.20 1.81 0.3 .861 0.46 0.55 2.62 Item 4 .024 - - - - 0.82 -1.74 Item 3 3.0 .223 0.46 3.29 .024 0.0 1.00 4.8 .028 0.28 -4.12 3.4 .183 0.66 -1.68 .024 - - - - Item 5 4.9 .086 0.30 -3.13 Item 7 .018 - - - - -0.40 Item 6 0.9 .638 1.26 0.99 -1.30 -1.55 Item 8 .018 - - - - 1.3 .522 0.51 0.67 -0.15 .741 - - - - 1.10 1.02 -1.22 -1.41 2.6 .273 0.83 1.25 -1.39 -1.05 Item 10 .014 - - - - -1.22 Item 9 0.6 .015 Item 11 0.7 - - - - 1.9 .387 0.73 0.46 -0.75 Item 12 0.7 .011 - - - - 0.40 .705 .013 1.09 1.25 -1.61 -1.37 .741 .010 - - - - 0.82 .705 0.50 -1.09 -1.53 Item 14 .741 - - - - 0.47 Item 13 0.6 0.34 0.36 0.57 1.0 .008 - - - - 0.6 .009 0.34 0.64 0.61 .741 - - - - Item 15 .607 0.75 0.62 0.63 0.57 Item 17 .741 - - - - -1.78 Item 16 0.6 .006 0.64 0.59 -0.19 -0.37 Item 18 .002 .004 1.8 0.18 11.2 .001 0.48 0.24 0.6 .005 0.36 0.49 -2.34 2.0 - - - - 13.0 -1.75 -1.54 Item 20 0.8 .001 - - - - Item 19 .368 .003 0.60 0.54 0.92 0.63 Note. B-H c.v. = Benjamini.670 Hochberg critical values. P 0.34 0.54 -5.49 -3.76 - ported p-value is lower than the BH-critical value. P-values are calculated from reported χ 2 -values. -values can be considered significant if the re 5 the multilevel structure of the data into account. For the multilevel variant of the IRT DIFFor model the sake only of interceptscompleteness, and weslopes fitted are the reported DIF model alongside one more DIF time, tests taking in Ta- theble F5DIF of sweep the Appendix. model and Additionally, the DIF sweep DIF testsmultilevel again modelshowed of similar both groups results against to the eachfirst threeother. DIF It is approaches. clear that the In DIFFigure sweep 5.9 wewith plotted and without the intercept a multilevel parameters structure for yielded similar results: there is little DIF in this data set and the amount of DIF

Powerappears not appreciably different for items differing in difficulty. our data set. To investigate a possible explanation for this lack of DIF we carried outWe afailed simulation to find study significant to check DIF how with much respect power to stereotypewe had to detect threat a conditions medium and in study. Data sets were simulated in R and DIF models were estimated in Flexmirt a large DIF effect (i.e., Δb =(-) .50 and Δb =(-) .75) given the characteristics of our Chapter 5

164

CC−plot 2PL DIF model 2

1

0 Stereotype threat C

−1

−1 0 1 2 Control C CC−plot 2PL DIF multilevel model

2

5

1

0 Stereotype threat C

−1

−1 0 1 2 Control C

Figure 5.9 Intercepts (c) retrieved from unidimensional IRT DIF model and multilevel IRT DIF model plotted for the ST group and the control group. The psychometrics of stereotype threat

165

Table 5.6 DIF analyses 3: Flexmirt DIF anchor analyses.

2 p B-H 2 p B-H 2 p B-H ˆ ˆ χ χ χ αˆ ST αˆ C β ST β C c.v. c.v. c.v. total(2) non-uniform(1) uniform(1) (S.E.) (S.E.) (S.E.) (S.E.) Item 1 1.7 .008 0.0 1.7

.430 .972 .023 .195 .005 0.62 0.61 -0.29 0.03 Item 2 0.2 0.1 .022 (0.14) (0.20) (0.17) (0.20)2.20

0.3 .865 .025 .633 .019 .805 0.45 0.58 2.68 2.8 .247 .207 .002 1.2 .272 0.21 (0.15) (0.21) (0.86) (0.83) Item 3 .005 1.6 .009 0.50 3.15 1.69 Item 4 2.1 .484 .200 0.87 -1.71 (0.13) (0.19) (1.90) (0.68) .344 .006 0.5 .013 1.6 .006 0.65 -1.63 0.0 (0.16) (0.27) (0.38) (0.43) Item 5 4.5 .106 .003 .983 .025 4.5 .034 .003 0.29 0.29 -4.15 -2.99 Item 7 1.1 .014 .478 .011 .011 0.70 (0.14) (0.16) (2.03) (1.64) .575 0.5 0.6 .438 0.52 -0.39 -0.15 Item 8 .011 0.0 0.81 -1.40 (0.13) (0.22) (0.21) (0.18) 1.3 .525 1.3 .257 .005 .971 .025 1.26 -1.03 Item 10 1.2 1.1 0.1 .020 0.72 0.48 (0.17) (0.35) (0.27) (0.23) .552 .013 .293 .006 .775 -0.76 -1.15 Item 12 .022 0.2 .017 0.4 0.41 (0.15) (0.17) (0.19) (0.43) 0.6 .746 .631 .551 .016 0.51 -1.06 -1.51 .724 .020 0.1 .020 0.41 (0.14) (0.17) (0.32) (0.62) Item 13 0.6 .762 0.6 .457 .013 0.35 0.56 0.73 Item 14 0.8 .008 0.2 .017 (0.13) (0.17) (0.33) (0.41) .656 .016 0.6 .436 .637 0.34 0.51 0.65 0.56 0.8 .017 0.2 .014 (0.12) (0.17) (0.35) (0.31) Item 15 .684 .621 .016 0.5 .473 0.76 0.64 0.62 0.56 Item 17 0.4 .807 0.4 .014 0.0 -1.74 (0.16) (0.20) (0.17) (0.29) .023 .518 .917 .023 0.35 0.50 -2.36 Item 18 11.0 .004 .002 1.4 .002 .002 0.24 (0.14) (0.19) (0.91) (0.62) 5 .234 .003 9.6 0.49 -1.73 -1.50 0.0 .840 .022 .220 .008 (0.15) (0.14) (0.52) (1.00) Item 19 1.5 .462 .009 1.5 0.59 0.54 0.93 0.63 Item 20 0.7 .721 0.1 .704 (0.16) (0.19) (0.29) (0.33) .019 0.5 .475 .009 .019 0.33 0.55 -5.59 -3.60 Note. B-H c.v. = Benjamini Hochberg critical values. P (0.19) (0.24) (3.11) (1.49)- ported p-value is lower than the is lower than the BH-critical value. -values can be considered significant if the re

with the DIF sweep approach. We generated data using item α and β parameters equal to the parameters we found in the multi-group IRT model to study the Typeste- reotypeI error rate. threat To study power, we changed the difficulty parameters of six items of stereotypethe math test. threat For group.three difficult The item items parameters we increased used to the generate β parameters data are for reported the group, for three easy items we decreased the β parameters for the ST = -0.10 and µC both to 1.00. Power rates and Type I error rates for the six items are in Table F6 in Appendix F. Latent means were set to µ = 0.00, and stated in Table 5.7, for the total DIF test (i.e., testing significance of non-uniform Chapter 5

166 .00 .55 .80 .05 .83 .97 .01 .97 1.00 Item19 .02 .19 .39 .05 .28 .67 .06 .63 .95 Item3 Negative DIF Negative .04 .32 .58 .01 .56 .82 .05 .86 .98 Item2 Uniform DIF test .05 .23 .04 .38 .66 .05 .72 .80 0.33 Item20 .05 .33 .60 .03 .60 .87 .08 .91 1.00 Item17 Positive DIF Positive .07 .13 .32 .05 .30 .44 .03 .51 .91 Item5 .01 .52 .73 .02 .65 .95 .04 .97 1.00 Item19 .03 .14 .30 .03 .24 .56 .06 .52 .92 5 Item3 Negative DIF Negative .03 .20 .41 .04 .44 .71 .03 .74 .99 Item2 Total DIF test Total .04 .11 .26 .00 .29 .55 .09 .67 .94 Item20 .05 .25 .50 .03 .46 .84 .03 .86 1.00 Item17 Positive DIF Positive .07 .10 .23 .07 .24 .36 .05 .45 .85 Item5

No DIF No DIF No DIF Power analysis and Type I error rate for total DIF test and uniform DIF test. and uniform DIF test total for rate I error analysis and Type Power =(-) .50 =(-) .75 Large DIF Large DIF Large Large DIF Large

Δb Δb

Medium DIF Medium DIF Medium DIF ST = stereotype threat, C = control. DIF =Differential Item Functioning. Item DIF =Differential threat, C = control. = stereotype ST Nst=509; Nc=524 Nst=1,000; Nc=1,000 Nst=2,500; Nc=2,500 Table 5.7 Table Note. Note. The psychometrics of stereotype threat

167 .04 .64 .93 .03 .59 .88 Item6 .01 .62 .92 .04 .66 .94 Item5 Negative DIF Negative .03 .73 .98 .01 .70 .95 Item4 Uniform DIF test .02 .79 .99 .01 .78 .98 Item3 .02 .77 .01 .75 .98 1.00 Item2 Positive DIF Positive .00 .71 .01 .75 .96 1.00 Item1 .04 .47 .89 .01 .48 .86 Item6 .01 .55 .91 .00 .49 .87 Item5

Negative DIF Negative 5 .00 .62 .98 .02 .56 .90 Item4 Combined test Combined .03 .62 .98 .00 .64 .94 Item3 .01 .62 .96 .00 .62 .98 Item2 Positive DIF Positive .01 .58 .96 .01 .59 .87 Item1

No DIF No DIF Power analysis and Type I error rate for combined test and uniform DIF test – ideal test parameters. – ideal test DIF test and uniform test combined for rate I error analysis and Type Power =(-) .50 =(-) .75 Large DIF Large DIF Large Δb Δb Medium DIF Medium DIF ST = stereotype threat, C = control. DIF =Differential Item Functioning. Item DIF =Differential threat, C = control. = stereotype ST = N(0,1); = N(-0.1,1); = N(0,1) = N(0,1) θst θc θst θc Table 5.8 Table Note. Chapter 5

168

withoutand uniform correcting DIF at for once) multiple and testing.the uniform We additionally DIF test. We studied carried for out which the sample power sizeanalysis we wouldwithout be BH-correction,able to detect mediumand found and that large power DIF. Largerates wereDIF is already reasonably low

- tionsdetected per withgroup a samplethe DIF size sweep of 1,000 had ample per group, power whereas to detect power medium rates and to detect large medium DIF are still somewhat low with this sample size. With 2,500 observa

DIF. Overall, Type I error rates appeared to be stable, only exceeding the nominal Type TheI error relatively rate for itemlow level5 in the of mediumpower might DIF condition be partially and causedthe medium by the or itemssmall withsample low sizes. discrimination This might beparameters due to the and/or small discrimination the few items parameterwith extremely of item low 5.

difficulty parameters, notwithstanding our selection of items with medium to high estimated difficulty parameters and overall high discrimination parameters discriminationfrom the 2003 TIMSSparameters survey randomly (M. O. Martin drawn et from al., 2003). a uniform distribution ranging We analyzed another round of simulated data sets to study power, using

from 0.8 to 1.5. Difficulty parameters were evenly stereotypedispersed rangingthreat from -1.0 to 2.0, creating a difficult, but diverse test. Large or medium negative DIF effects were placed on the easiest items (i.e., favoring the condition),- whereas large or medium positive DIF effects were placed on the most difficult items (i.e., favoring the control threat condition). Sample sizes equaled the sam 5 µST = 0.00 and µC ple sizes used in our high school = -0.10 study. and We µ added a condition without impact (i.e.,- = 0.00, andST both variancesC 1.00), and kept the condition with small impact as well (i.e., µ = 0.00, and both variances 1.00). Re testssults withare stated more indesirable Table 5.8 item for properties. the combined This DIF is most test prominentand the uniform in the DIFuniform test. DIFWe seetest. that overall power rates to find DIF effects are a lot higher when we use

usingType a correction I error likerates the did Benjamini-Hochberg not exceed the alpha correction level of .05. or the Recall Bonferroni that we cor did- rectionnot correct would for compromise familywise Typepower I evenerror further. rates in this simulation study, and that

5.4.3 Discussion In this stereotype threat stereotype threat using three IRT DIF methods. The results of the several meth- data set, we explored whether we could find DIF due to relax our stringent statistical criteria a little and ignore the BH-correction that ods converged, yielding no evidence of DIF in either of the methods. If we would

control family-wise Type I errors, we found one item displaying DIF. This item The psychometrics of stereotype threat

169

stereotype threat group. We stereotype threat the- was relatively easy, and “showed DIF” in favor of the favoringdid not find girls patterns in the stereotype of DIF that threat were condition expected forbased easy on items. ory, namely DIF favoring girls in the control condition for difficult items and DIF - There are several reasons for this lack of DIF. First, we found that power to find DIF was relatively low. This lack of power mostly occurred due to the rela tively small sample size (for DIF analyses at least) and some undesirable item characteristics (i.e., low discrimination parameters and items with very high or low difficulty parameters). The undesirable item characteristics are unfortunate, as we carefully selected items from the TIMSS 2003 survey with overall high but largestvarying stereotype estimated threat item difficulties, and overall high estimated discrimination anyparameters stereotype (M. threat O. Martin et al., 2003). Nevertheless, as our study is one of the studies to date, we doubt whether there are currently data sets that do enable powerful DIFstereotype analyses. Second,threat ef it- might be possible that the circumstances in which we tested were not sufficiently doesthreatening not occur for inthe stereotype students inthreat our sample, and as stereotypesuch no threat effects only emergesfects occurred through at all impact (see discussion and hence in acts Chapter via actual 3). Third, differences it is possible in latent that ability. DIF just data sets, and - nored missing responses. There is a good chance that the missing responses are There are several limitations to this study. A first limitation is that we ig wasnot Missinginteresting Completely to see whether At Random the pattern (MCAR). of Evenmissing though responses we would led to not DIF. like This to 5 issuesee person touches and upon item the parameters speededness biased of the by thetest missing as well. responses,The amount in of this time study allot it- ted for this test might have be a little too short, and as such some students might- inghave for left stereotype questions threat unanswered or guessed when they noticed timestereotype was running threat out. This speededness could have led to DIF, which would have been interest theory (and certainly not atypical for studies given the results from Study 1). only selectedSecond, students the estimated from the discrimination two highest education parameters levels for in most the Netherlands. of the items were somewhat low. This might be caused by a restriction of range, because we

This leads to a small amount of variance in the groups, which translates to low discrimination parameters when variances are fixed to 1. Even though this is not uncommon, as we saw in Study 1 in which many studies actually showed several happennegative more item-rest often correlations, in stereotype it mightthreat explain our lack of DIF findings as well. Again, this might be caused bystereotype restriction threat of range, research and might again receivethis appears more atto- tention in the future. research. Hopefully, the psychometric qualities of the ability tests in 170 Chapter 5

Third, we ignored the fact that the math test consisted of two types of math departuresquestions: geometryfrom unidimensionality items and number when items. we checked A two-dimensional local dependency model indices. could perhaps have been more appropriate a priori. However, we did not find large - turesThese from indices unidimensionality are technically mightnull hypothesis have been significancelimited. tests as well, and as such also sensitive to sample sizes. It is not impossible that power to find depar-

Fourth, the data sets that we simulated to study power were simulated un inder a somewhatmultilevel dataof a setsimplified that comprises scenario of compared students fromto our several original classrooms. data set. ForWe instance, we did not simulate data including the dependency one would expect - simulated the data using a 2PL model, however one might expect some students- to guess, in which case a 3PL model might be more appropriate. We did not sim- ulate missing responses. The test consisted of two different types of math ques thattions, the yet DIF we testssimulated in our a sample unidimensional were probably scale. However,underpowered. if anything, these differ ences probably led to an even larger loss of power, leaving intact our conclusion

5.5 General discussion

stereotype threat item-level analyses. Based on a review of the literature we saw that item-level anal- 5 ysesIn this of the chapter, ability we tests argued in stereotype that threat research could benefit from

studies are typicallystereotype neglected, threatwhich is unfortunate for two reasons. First, item level analyses are essential to assess the quality of a mathematical test. Based on ourstereotype re-analyses threat of 13 research. We found data sets in Study 1, we conclude there is much left desired when it comes to the- psychometric qualities of math tests used in stereotypepatterns of threatspeededness, effects negativeand missingness item-rest due correlations, to speededness unstable of reliabilitythe tests clear coef- ficients and extremely difficult or easy items and tests. The association between provide social psychologists with tools to study hypotheses beyond the standard comparisonsly warrants more of means attention or correlations in future studies.with sum Second, scores. the The field relationship of psychometrics between stereotype threat

effects on item-level and item difficulty can be studied in the- tions.framework A more of measurementsophisticated approachinvariance. is Withto formally CTT we test can DIF eyeball using patterns IRT models. of (lack In of) measurement invariance by means of PP-plots of items, under strongstereotype assump threat hypotheses by means of a data set of Dutch high school students. Unfortu- this chapter, we illustrated how a 2PL IRT DIF modelstereotype can be threat used todata test set.

nately we could not find patterns of DIF in our The psychometrics of stereotype threat 171

We limited ourselves in the first study to data sets that were available to us, ofand ability so our tests set forof studiesall stereotype obviously threat represents a convenience sample. As such,- we cannot make justified generalizations concerning the psychometric quality share of the published stereotype threat studies. However, we did study how of- ten researchers reported reliability coefficients or item-level statistics in a large research (as described in detail in Ap pendix E). As such, we do feel confident to conclude that either authors do not know much about the psychometricstereotype qualities threat effectsof the abilitycan be causedtests, or by find very them dif- insufficiently relevant to discuss in their papers. We hope this will change in the averagefuture, because stereotype (the threat lack of) ferent patterns of item-level statistics. Again, we limited ourselves by focusing on- effects, ignoring moderators included in the original study (e.g., domain identification in Keller, 2007). We ultimatelystereotype made this threat deci sion because we wanted to treat data sets homogenously, we did not want to cut sample sizes even more by selecting subsets of students, and effects are frequently found without inclusion of moderators in samples of both collegeIn studentsthis chapter (e.g., we Spencer, also limited Steele our& Quinn, discussion 1999) and and analyses school students to unidimen (e.g.,- Keller & Dauenheimer, 2003). it comes to stereotype threat a multitude of IRT models are available and poten- sional IRT models with continuous latent ability distributions. However, when interestingtially useful. for We large can dataextend sets the with model a considerate to include missingness,amount of missing for instance respons by- 5 means of an IRTree model (DeBeer, Janssen, & De Boeck, 2017). This is especially- stance a multidimensional IRT model that deals with missing data at the end of es. There are other models that can deal with nonignorable missingness, for in the test (Pimentel & Glas, 2008), or a multidimensional IRT model that deals with inskipped modeling items stereotype (Holman threat& Glas, data 2005). as well.Alternatives Another to promising deal with approachspeededness is the of tests are mixture models (Bolt, Cohen, & Wollack, 2002), which could be useful use of explanatory IRT models, like random-weights differential item functioning models and (random-weights) differential facet functioning models (Meulders be& Xie, made 2004). random With over these students models, to item allow properties for individual can be differences.taken into account Additional and- DIF (e.g., types of item) or differential facet functioning-interaction effects can ste- reotypely, person threat covariates effects like by certainmath anxiety individual or domain differences identification implicated could in stereotype be added threatto these theory. models Another to study approach specific forhypotheses stereotype regarding threat data the moderationsets that include of a measure of (state) test anxiety is a bi-factor model that models a math factor and an anxiety factor, which for the anxiety factor loads on the math items as well. 172 Chapter 5

- We tried out several aforementioned models (e.g., IRTree model, explanatory IRT notmodel provide with informationperson predictors that went such beyond as domain information identification drawn and from math the standardtest anx iety included, the bi-factor model), but the models either did not fit well or did new range of opportunities for analyzing stereotype threat IRT DIF model. Unfortunately, whereas the use of IRT models opens up a whole- en the characteristics of current stereotype threat data sets. effects, As relatively at the present simple we find ourselves not quite ready to use the advanced statistical techniquesstereo giv- type threat DIF models already require larger groups than are currently available in realistic and research, desirable more item complex properties models would will provide require aeven good larger starting sample point sizes. for moreHigh quality,advanced large modeling scale studies of stereotype across multiplethreat effects. labs employing math tests with

the psychometric properties of math and spatial skill tests in stereotype threat On a more substantive note, it might be interesting to stereotypestart a discussion threat efon- stereotype threat effects in studies that are research, and their implication for the generalizability of fects. For instance, if we only find - fectedcarried by out stereotype under extreme threat incircumstances real life test settings(e.g., very for difficult which such tests, circumstances high amount of time pressure) we might start to wonderstereotype how often threat girls and effects women arise are under af

testare unlikely students to under occur. these Of course, stringent the fact conditions that we cannot be sure whether ste- 5 reotypestringent threat conditions is theoretically very interesting, but if we do not actually

Psychological Testing is an actual it is societaladvised problemto avoid thatspeeded justifies tests changes unless inspeededness policy (Logel is et al., 2012; Walton et al., 2013). For instance, in the standards of Educational and

crucial for the construct of interest (American Educational Research Association allottinget al., 2014). more Moreover, time on high if we stakes do test tests students to reduce under situational these stringent pressure conditions, caused by stereotypethe problem threat. can potentially We suggest be tosolved implement in a straightforward policy only when way, we for consistentlyinstance by stereotype threat effects on realistic tests that convincingly mimic real life situations. find The psychometrics of stereotype threat

173

5

Chapter 6

Discussion Chapter 6

176

6.1 Discussion

-

The goal of this PhD project was to find answers to several vexing issues concern showing Stereotype predictable Threat patterns (ST) ofand DIF?” Differential Our ambition Item Functioning was that such (DIF), answers such aswould “Do providegender ST us effects with useful occur amongknowledge Dutch on high the schoolgeneralizability students?”, of and the “Do ST effectST data to sets re- al-life testing and would aid in improving high stakes testing situations for girls.

duringUnfortunately, the projects the results and discuss of four the studies answers reported that I in can this give dissertation on these and offer related more questions than answers. In this section, I will reflect on the challenges I faced research. questions. I will conclude with a short discussion of possible future studies in ST

6.2 Mixed results and publication bias in stereotype threat

literature by means of a meta-analysis. We included studies that tested the effect In chapter two, we studied ST effects among school aged girls in the published-

of ST on Math, Science and Spatial Skills (MSSS) tests of schoolgirls in experi ments in various countries. Averaged over all studies we saw a small ST effect, 𝑔𝑔𝑔𝑔� = -0.22. Obviously, this average effect size estimate is based on a diverse set heterogeneityof studies, with in varying effect sizes. conditions, using different materials, and consisting of students from various backgrounds. As a result, we observed a large amount of- 6 Based on theoretical rationales, we included four potential moderating vari ables that we tested in a confirmatory way (based on pre-registration), namely thetest STdifficulty, effect. Wepresence found ofthat boys none during of these testing, key themoderator type of controlvariables group, explained and an a index of cross-cultural gender equivalence across different samples used to study - significant amount of variance in effect sizes in our meta-analysis. Adding other moderators (e.g., age and type of manipulation) as part of the exploratory anal- yses did not explain much of the variance in study outcomes either. It is difficult effects.to reach A definite lack of powerconclusions might on be whycaused we byfailed restriction to find significant of range of moderator the moderating vari ables of the ST effect. One of the explanations is that we had little power to find - variables. For instance, if many authors decided to adopt a difficult test, there- fectwould could be littlebe that variance the presence to gauge of thepublication relationship bias betweencould have test obscured difficulty potential and ef moderatingfect size. Another effects. explanation Publication for bias the could lack have of significant complex effectspredictors on heterogeneity of the ST ef Discussion 177

(Augusteijn, van Aert, & van Assen, 2017). The only significant predictor of effect sizes was the study’s precision (as measured by the standard error of the effect- larlysize), when with lessno other precise substantive studies showing or methodological larger effect characteristics sizes. This small explain study it. effect Oth- is often considered a sign of publication bias (e.g.,p Sterne & Egger, 2005), particu publication bias as well. We concluded that publication bias is likely to have oc- er publication bias tests, with the exception of -curve, pointed to the presence of- ates uncertainty regarding the size of the ST effect and the moderation thereof by curred in ST research, as it has in many other research lines. Publication bias cre authors also drew attention to the fact that publication bias might have distorted particular important design features, types of tests, settings and samples. Other - nicationsour view withof ST several (Ganley ST et researchers al., 2013; Stoet highlighted & Geary, common 2012; Zigerell, shared worries 2017), thata bias it which might be amplified by the sensitive nature of the topic. Personal commu is difficult to publish non-significant ST studies. For this reason, we initiated two reportprojects that for we which will publicationcarry out in isthe guaranteed following years.regardless of outcomes: the first is the registered report in Chapter 3, and another project is a registered replication

Since we conducted our meta-analysis reported in Chapter 2, new studies have been carried out in Uganda (Picho & Schmader, 2017) and the UK (Davies, Conner, Sedikes, & Hutter, 2016). Our own study described in Chapter 3 would fit the inclusion criteria of the meta-analysis as well. In the coming years, we might be able to update our meta-analysis with an analysis that will have sufficient power for moderating effects to emerge, if indeed these factors are important in- ingdetermining large ST effectsthe relative whereas size othersof ST effects. found noFor ST now, effects the mostat all. we This can pattern say is alsothat ST experiments among schoolgirls show mixed results, with some authors find 6 appeared in other meta-analyses on the ST effect among college, university and school students (Doyle & Voyer, 2016; Nguyen & Ryan, 2008; Picho et al., 2013). currentlyFuture meta-analyses unclear how might well thesealso benefit methods from correct novel fordevelopments particular effects in correcting due to researcher’seffect size estimates common for opportunistic publication usebias of (van degrees Assen of etfreedom al., 2014) in the although analysis it ofis

The overall small sample sizes in ST research is cause for concern. From the data from ST experiments (Van Aert et al., 2016). - clusteredthe articles nature included of the in data our thatmeta-analysis, arises when we studying learned school that few children. authors This report ulti- matelyed power leads analyses to underpowered before data collection,studies that and aggravate almost noproblems authors related considered to publi the- cation bias and other QRPs in the analyses of data (e.g., Bakker et al., 2012). Also, none of the studies in the meta-analysis was pre-registered, which means we 178 Chapter 6

do not know whether forms of p -

-hacking or HARKing occurred. This finding in spired us to take pre-registration, power, and a multilevel structure as essentials in our own study of high school girls in Chapter 3. 6.3 Can we replicate effects of stereotype threat?

-

describedWe planned our two ST large study scale in a populationpre-registered of Dutch studies high in schoolan attempt students. to find All the hypoth ST ef- esesfect as and it has statistical been found analyses in previous were pre-registered (non pre-registered) using the studies. format In of Chapter a registered 3 we

report (Chambers, 2013), in which peer review of the method section takes place of the before the study is carried out. In total we studied over 2,000 students in threat21 high condition schools. andMoreover, a control we conditionselected high-performing that have been successful students (i.e.,implemented students from the two highest education levels in the Netherlands), we used a stereotype-

in past stereotype threat experiments, we used a math test similar to past ste reotype threat research, and the manipulation check rendered our manipulation successful. As such we are confident that we did everything in our power to find- dents.a ST effect, We checked within the whether limits ofindeed our pre-registered individual differences decisions. implicated We did not, in however, ST theo- find a significant stereotype threat effect in our sample of Dutch high school stu

ry heighten susceptibility to ST (i.e., domain identification, gender identification- ysisand wemath found test strong anxiety), evidence but these for thevariables absence did of not a ST highlight effect on anymath specific performance effects for test-takers that should be most vulnerable to ST. Overall, using Bayesian anal 6 - in our sample of Dutch high school students in the age range 13-14. part ofThe this second dissertation replication because effort it exceeded we initiated, available which time entails and aresources. registered Regis rep- lication of a well-known study (Johns et al., 2005), could unfortunately not be

reviewedtered Replication before the Reports replication (RRR; study Simons, can Holcombe,take place. Once& Spellman, the research 2014) proposal require researchers to write a detailed protocol for a replication study, which is alsothe peerpro-

has been accepted by the editors and vetted by the original researchers, In tocol will be placed online, the proposers will carry out the replication study, and- other labs are encouraged to carry out the (same) replication study as well. - our proposed registered replication study, male and female students will be as whereassigned to womena ST condition, in the teachinga control conditioncondition performedand a teaching as well condition. as women In the in orig the inal study, women in the ST condition underperformed on a test of GRE items,

control condition. Recently, NWO started funding replications efforts, and we are Discussion

179 glad to have obtained funding to perform this high-powered RRR in Tilburg. Oth- er labs in the Netherlands and other countries have already expressed an interest in joining the effort. This RRR should shed more light on the generalizability of

- stereotype threat effects across cultures, colleges and universities. Overall, the currently available pre-registered studies on ST research por tray a bleak picture. Two direct replications (Gibson et al, 2014; Moon & Roeder, 2014) of a well-known ST priming study (M. Shih, Pittinsky, & Ambady, 1999)- performancein which Asian due women to ST: in one three study conditions was unable were to studied replicate (a threatthe original condition study and at two none threatening conditions) failed to show convincing evidence of under all (Moon & Roeder, 2014), whereas the other paper was able to replicate the effect among college students (Gibson et al., 2014). The effects in this replication- were small however, and they were only statistically significant when students comparisonwere removed with that one were of unfamiliarthe non-threatening with the stereotype, conditions and and only the whenthreat accura condi- cy scoring was used (but not for other ways of scoring performance). Only the N tion showed a significant ST effect, comparison with the other non-threatening condition did not show a ST effect. Finally, in a large replication study ( =590) using a female Mechanical Turk sample, effects of ST were not replicated either- (Finnigan & Corker, 2016). To my knowledge, pre-registered ST studies are limited to these four stud ies. As (rigorous) pre-registration restricts researcher degrees of freedom (e.g., no opportunities to cherry pick, no flexibility in statistical analyses; Wicherts et- al., 2016) and guarantees both a confirmatory approach (i.e., no opportunties to HARK; Wagenmakers et al., 2012) and eventual publication (or at least inclu ission hesitant in future to reviewsattach too or muchmeta-analyses), meaning to such ST pre-registeredprevious small ST sample studies ST give exper the- 6 imentsmost convincing that lacked and control the least of biasedthe many evidence. biases Basedcaused on by these publication four studies, bias andone - - researcher’s degrees of freedom in the analysis. Obviously, these four pre-reg istered studies only studied particular populations (e.g., Dutch high school stu dents, Mechanical Turk workers, US college students) using particular methods meagre(e.g., ST resultspriming, obtained explicit fromST manipulations) these pre-registered and are ST not studies representative does raise ofa redthe larger body of work in the ST literature assembled over the years. However, the flag, andEven will though hopefully we focused push morea lot on ST the researchers role of power to adopt of the pre-registration statistical tests in future studies, to get a more proper overview of the replicability of ST research.

ST studies, another solution would be to adopt Bayesian analyses as a regular practice. With a Bayesian approach, low power and optional stopping are less of an issue. In Chapter 3 we supplemented our use of Null Hypothesis Significance 180 Chapter 6

Testing (NHST) with hypothesis tests based on approximated adjusted fractional Bayes factors. Given that most ST authors use NHST, we considered our focus on power to be justified. However, because user friendly Bayesian software is now STreadily research available with (e.g. Bayesian software analyses. program Although JASP; Lovesuch etanalyses al., 2015; might Wagenmakers, not be en- Love, et al., 2016), it might be interesting to supplement or substitute NHST in unhealthy habit of dichotomous decisions that no doubt played a role in the de- tirely immune to biases, they would hopefully lead researchers away from an

studiedvelopment in Chapter of QRPs 4. and publication bias. Such dichotomous decisions on effects also play a role in assessments of Differential Item Functioning (DIF) that we

6.4 What do we learn from empirical DIF studies?

systematically reviewed characteristics of 200 empirical DIF studies. It is well In Chapter 4, we studied a large share of the methodological DIF literature, and

known that the most popular DIF methods require large samples. This is true- of unobserved score methods (e.g., IRT-LR-DIF, MIMIC modeling, MG-CFA and DFIT), as well as observed score methods (e.g., Mantel-Haenszel test, logistic re Ingression, the systematic SIBTEST). review Under of 200the DIFmost studies favorable from circumstances, the literature weminimum saw that group sam- size requirements range from 250 subjects per group to 500 subjects per group.

mediumple sizes orin largemost DIF DIF effect. applications in the literature were sufficiently large, and consequently the power of most DIF analyses seems to be sufficient to find a- 6 For several aspects of DIF methods and results, reporting in the DIF lit erature was suboptimal. Specifically, clear descriptions of the flagging rule and statistical results were often not given in DIF articles. Authors frequently failed to report mean group differences (not reported in 58.0 % of the articles), and sometimes failed to report the kind of statistical test that was used (not reported in 6.7% of the tests). In 18% of the articles neither results of significance tests, nor effect sizes were reported for any of the items. Unfortunately, a large share of articles reporting DIF analyses failed to report any DIF effect sizes at all (46%).- Only 27% of the DIF articles indicators of statistical significance and effect sizes were reported for all items. For researchers using the scales in practice, it is high ly relevant to know the severity of DIF. Notably, Millsap and Kwok (2004) studied the effects of violations of measurement invariance on the quality of decisions made in a selection context based on the scale. However, if DIF studies often fail to report the DIF effect size, this hinders such analyses of the practical impact of DIF on validity of assessments or decisions made on the basis of scales. Given the Discussion 181

torelevance report DIF of such analyses information more extensively for users ofeither the scales, in the followarticle upor asstudies online of supple DIF on- mentsthe same at thescales, publisher’s and reviews website of the or inDIF data of therepositories. scales, we Knowing encourage the researchers severity of which could help improve future scale development. DIF among items in a scale is also important for understanding why DIF occurs, - We found that most researchers base DIF flagging rules only on significance tests was(54.4% rarely of thestudied DIF articles).or mentioned Given in this our focus sample on NHST,of DIF thearticles. amount This of aligns pow wither to thefind common DIF in those lack of studies detailed is reportingextremely of important. DIF analyses However, and results. power In powerof DIF analysis for DIF tests there are many characteristics that authors need to take into consideration. A lack of DIF effect sizes and other information of relevance

- to group comparisons (e.g., reporting means and standard deviations for ability in both groups), will impede power analysis for future DIF analyses. Another in teresting finding from our review of the DIF literature was that in the subset of articles that used multiple statistical methods to test for DIF, different methods flagged a different amount of DIF items in most of the cases. This implies that not every method is equally sensitive to DIF, and that the choice of method could influence the number of DIF items found. This aligns with previous findings that showed not all DIF methods produce the same results (Borsboom, 2006) and- that not all DIF software packages produce the same results (Ong et al., 2015). Often DIF is considered to be a discrete characteristic of items, with items show ing DIF or not, and deleted from scales accordingly. However, any finding of DIF is conditional on a host of factors including the type of group comparison, various specifics of the analyses including the used method and criteria, and sampling 6 error. Therefore, it might be more realistic to consider the gradual nature of DIF- (De Boeck, 2008), as DIF effects come in all different sizes. This gradual nature- might be captured better in random item IRT models (De Boeck, 2008). Prefera bebly, a test good developers start. Readers and researchers can then decide should for take themselves this gradual which nature DIF intoeffects consid they considereration in to the be future. sizeable Reporting and which DIF DIF effect effects sizes they for consider all studies negligible. (and items) would

6.5 Can we study DIF in stereotype threat research?

theIn Chapter systematic 5, wemanner studied we whetheroriginally stereotype envisioned. threat Most datasets causes DIF from in previous math tests ST experimentsamong female were students. simply It tooturned small out for to well-knownbe difficult to DIF study tests. DIF Of in all ST the data datasets sets in 182 Chapter 6

that we were able to gather from ST colleagues, none had groups larger than 250- 500 students, which is the minimum for most well-known DIF methods (Finch, 2005; French & Maller, 2007; Guilera et al., 2013; Khalid & Glas, 2014; Penfield, 2003; Rogers & Swaminathan, 1993; Woods, 2009b). When we tried to model DIF for the somewhat larger ST datasets (Cherney & Campbell, 2011; Delgado &- Prieto, 2008) as an exploratory exercise, the DIF model did not fit properly (e.g.,- estimated parameters had large standard errors, estimated parameters were ex tremely large). Virtually none of the mentioned studies in published meta-anal yses (Doyle & Voyer, 2016; Flore & Wicherts, 2015; Nguyen & Ryan, 2008; Picho et al., 2013) had sample sizes that would suffice for powerful DIF analyses, with an exception of two studies (Smeding et al., 2013; Stricker & Ward, 2004), each onlywith onaround our own 300 dataset students from per thegroup. registered Unfortunately, report among we did Dutch not have high access school to stu the- data from these ST studies. Therefore, we decided to carry out formal DIF tests means of PP-plots based on statistics from classical test theory. dents, while considering the other datasets in a straightforward graphical way by

datasetsIn Study drawn 1 fromof Chapter ten stereotype 5, we first threat considered experiments. missing We responses, saw that missing item-level re- statistics, reliability estimates and patterns of effects on individual items in 13 tests. This indicates that many of the math and spatial ability tests used in these stereotypesponses were threat highly experiments frequent in were some speeded studies, tests. and occurred For most mostly of the atstudies the end with of

at the end of the tests suggested that part of the stereotype threat effect was causedconsiderable by differences average stereotype between the threat experimental effects, the groupspattern in of howmissing fast-paced responses fe- male test takers worked on the items. This might indicate that stereotype threat

6 speeded tests. In the relatively small sample data sets reanalyzed in Study 1 using leads to slower responding, which will most strongly affect test performance on

classical p-values, we failed to find very clear patterns of DIF due to stereotype threat. In Study 2 of Chapter 5, we tested DIF formally using an IRT model on the todata experimental from our large-scale conditions stereotype among the threat female experiment test takers reported in our inown Chapter dataset. 3. InA line with the lack of mean effect in these data, we failed to find DIF with respect - reotypepower simulation threat experiment study showed among that school power girls to findwe are DIF aware in difficult of. and easy items was lowAlthough in our itstudy, is interesting notwithstanding to study thepatterns fact that of DIFit represents in ST data the in alargest pre-regis ste-

to anticipate all elements needed for power analysis and pre-registration prior tered manner, DIF testing in ST research proved challenging. We found it difficult

to data collection of our ST high school study, which meant we did not base the Discussion

183 stopping rule of data collection based on a DIF simulation study. In the end we decided to use an exploratory approach to illustrate DIF in ST research.25 In turned out though that even our large ST study was underpowered when we modeled - istic by means of large scale collaboration projects featuring multiple labs and DIF. If we would like to create DIF models in the future, this would only be real - yielding large data sets based on a common design and large Ns. For instance, providea large data us with collection enough effort data suchto study as a DIF. study Another performed alternative by many would labs be or pooling a (reg severalistered) independent replication reportST studies involving using multipleidentical math universities tests. One and of colleges the problems might of these solutions is that the same tests will probably not be equally difficult for the students at the different sites. If tests are too easy, ST effects are not expected to occur. If tests are too difficult, floor effects might obscure ST effects. Moreover,- dentsDIF over in particular sites could cultures occur, for for instance another whenreason. items These are are included certainly that challenges are not part for futureof the curriculuminternational in collaborativecertain (but not ST projects.all) countries, One of or the if items goals areat the easier start for of stuthis project was to see whether we could build more extensive models in which DIF effects could be modeled alongside item covariates and/or person covariates.

Theoretically, it is tempting to think that DIF in ST data could be a function of IRTperson models variables can be like used domain to test identification, whether DIF test could anxiety be explained or gender by identification any of these and of item variables like item difficulty, item content, and item type. Explanatory be feasible in the future if large collaboration projects will be carried out. person or item variables (De Boeck & Wilson, 2004). This type of modeling might

However, sample size is not the only issue that needs to be handled. We 6 saw that most tests used in ST studies appear speeded, which can influence both classical test theory statistics and IRT parameter estimates. However, some ST researchers might argue that time pressure is a necessary requirement for ST effects to occur. In that case, it would be interesting to model missing data by- means of IRTree models (DeBeer et al., n.d.) or other psychometric models that- take missing responses or speededness into account (Holman & Glas, 2005; Pi mentel & Glas, 2008). Again, really large sample sizes are needed to model miss ingness, and a considerable occurrence of missing responses. Currently we lack tosufficiently the degree large of speededness datasets to developin high stakessuch computationally testing. If ST indeed demanding only occurs models. un- However, it might be useful to compare the degree of speededness in ST studies

time and resources. 25 Frankly, collecting a sample size larger than 2,162 students would not have been possible due to restriction of 184 Chapter 6

threat would be to allot ample time to complete high stakes tests. When time der strong time pressures, a simple solution to negate the effects of stereotype - pressure is not part of the construct under investigation, and there appears to be little reason why solving speed should be an important factor in math profi ciency,Notwithstanding researchers should the challengesavoid speeded we facedtests (Americanin using advanced Educational psychometric Research Association et al., 2014). - analyses of stereotype threat effects, we made a start in stressing the importance of psychometric analyses in ST data. The lack of reporting on psychometric qual- ities of ST tests in the published literature (e.g., as shown in a literature review of stereotype threat experiments included in Appendix E, in less than 10% of the in indifferencecluded studies among a reliability stereotype coefficient threat is researchers reported), and regarding the varying the psychometric characteristics of the 13 datasets we studied in Chapter 5, attest to a widespread data collection efforts needed for advanced psychometric modelling of stereo- qualities of tests used to uncover stereotype threat effects. Although large-scale

type threat effects require extensive collaboration, there can be little doubt that psychometricour understanding modelling. of a complex phenomenon like ST, which features differences in effect depending on items and persons, could eventually benefit from such

6.6 Future of stereotype threat research

- The studies in this dissertation raise questions about whether, how, and when- 6 stereotype threat affects test and item performance. On the whole, I am not confi dent that ST has practical significance in real life testing, nor am I confident to de clare that ST does not influence women’s test scores. I feel that, given the current state of ST research (e.g., the use of small sample sizes, highly restricted samples,- extremely difficult and short tests, improper use of covariates) and our current results we should be cautious to turn ST findings into policy. Some rigorous con- firmatory (replication) studies, with the dependent variable based on tests that closely mirror high stakes tests, would provide social scientists with more confi dence to give well-funded policy advice. With me, there are several researchers critical of the robustness of ST effects, and this group urges for some changes in practicesthe field (Finniganwould be routinely& Corker, implemented2016; Stoet & whenever Geary, 2012; possible. Zigerell, 2017). In my humble opinion, I think future ST studies would greatly benefit if the following

First, more ST researchers should pre-register their studies (Zigerell, 2017). Pre-registration is straightforward and feasible in a field with such clear theories and abundance of earlier results (S. J. Spencer et al., 2016). Researchers can just Discussion

185

Science Framework and time-stamp the document before collecting the data. If post hypotheses, sampling plans, and detailed statistical analyses on the Open recognizesa researcher that would it would like to be run premature additional, to attach non-registered a lot of weight analyses, to their this outcomes is always possible as long as the researcher reports the extra analyses as exploratory, and

(Wagenmakers et al., 2012) and eventually seek replication of found patterns in adversarialfresh samples. collaboration Second, it orwould many be labs great studies. if ST researchersHopefully devoted would ST work researchers towards more large-scale collaborations. Several forms of collaboration are possible, like will initiate some of these types of rigorous replication studies, as they are often STof highstudies quality by reporting thanks to more extensive extensively preparation, on the psychometricshared knowledge characteristics and decent of sample sizes. Third, authors could easily increase the information value of their- parent practices and data sharing could greatly contribute to our understanding ofthe ST tests effects they and use. moderation Not all authors thereof could by be individual expected differencesto study DIF, and but item more charac trans- teristics. Overall, it is informative to see some psychometric characteristics of the tests reported, on what type of items the largest effects appear (e.g., easy wellor difficult items discriminateitems, items betweenplaced at those the beginning with high or or thelower end sum of the scores. test), Reporting whether missing responses added to the ST effect, how difficult the items are, and how context.item-level As information a new item on is thenot list only of characteristicsgood practice, ofbut replicable it can sometimes and reproducible explain the presence or absence of ST effects, particularly when tested formally in an IRT stereotypeST research, threat we would experiments like to add for psychometric future re-analyses characteristics and better of informed the dependent choic- esvariable. of the scalesThere usedcan be in littlestereotype doubt threatalso about experiments. the benefits of sharing the data of 6 to study. It deserves more extensive research with rigorous and bias-free stud- Stereotype threat is without a doubt a highly relevant, but complex subject hopefullyies, which beallow able the to fullyuse of grasp modern under psychometric which circumstances techniques. stereotype With large threat sample ef- sizes, high quality tests and these advanced psychometric techniques we will fects arise. This knowledge can then be used to create more confident empirically supported policy advice, and eventually enhance fairness in high stakes tests for female students in STEM fields and other educational contexts.

Addendum

Appendix A Appendix B Appendix C Appendix D Appendix E Appendix F References Dankwoord 188 Addendum

Appendix A: Final model, psychometric analyses and exploratory analyses - chapter three

Graded response model for scales

math anxiety domain identification gender identification reportingWe fitted a the unidimensional RMSEA based graded on limited response information model to statistics. the data forFor the the three scale scales math , , and with Flexmirt 3.51, - anxiety (10 items), the unidimensional graded response model (RMSEA = .03, AIC = 37343.12, BIC = 37624.81) fitted the data better than the null model (RM SEA = .16, AIC = 47037.93, BIC = 47263.28). Subsequently, factor loadings were requested, and 9 out of 10 items had factor loadings larger than 0.80. For the item “My hands sweat when I am taking a math exam” the factor loading equaled- 0.66. The marginal reliability estimate of the math anxiety scale based on the IRT model was .87. For the scale domain identification (12 items), the unidimen sional graded response model (RMSEA = .06, AIC = 66490.74, BIC = 66828.77) fitted the data better than the null model (RMSEA = .13, AIC = 73222.50, BIC = 73492.92). Subsequently, factor loadings were all larger than 0.40. The marginal reliability estimate of the domain identification scale based on the IRT model was .88. Overall, we were pleased with the psychometric qualities of both scales. However, exploratory factor analysis gave some interesting insights as well. When fitting a model with a two factor solution to the math anxiety scale (CF-Quartimax estimation, oblique rotation), items with high (rotated) loadings on one factor were concerned with bodily reactions to math anxiety (e.g., “When withtaking high a math loadings test Ion feel the like other I am factor going were to cry”, concerned “When I withstart cognitivea math exam, worries my heart beats really fast” and “Before taking a math exam I feel nausea”), and items

about mathematics tests (e.g., “Math tests scarer = me”, .81. For“Math the testsscale makedomain me iden feel- insecure”, and “The day before a math exam I think that everything is going to be A wrong”). The two factors are highly correlated, tification, a two factor solution showed the two distinct scales that we included, with the seven items of the first scale showing high loadings on one factor (liking and self-efficacy), and the five other items showing high loadings on the other factor (importance of math for student’s future). Two items of the first scale also loaded highlyr on the second factor (“I would like to do more math in school”- and “I like studying mathematics”). The two factors are correlated but not very thestrongly, two separate = .44. For scales. this reason, we conduct a sensitivity analysis in the explor atory part of this paper, where we split the domain identification variable up into

For the scale gender identification (4 items) the psychometrics qualities were not as good. The RMSEA was low (unidimensional model, RMSEA = 0.13, Addendum

189

null model, RMSEA = 0.13), however the fit indices did prefer the unidimensional model (AIC = 22809.14, BIC = 22921.81) over the null model (AIC = 23595.62, TheBIC =marginal 23685.77). reliability Factor estimateloadings ofwere the highermath anxiety than 0.40, scale except based one on the items IRT with mod a- factor loading of 0.27 (“Being a boy/girl is an important reflection of who I am”). withel was the .67. interpretation Considering ofthe this length scale. of this scale, we did not attempt exploratory factor analysis with two or more factors. As mentioned before, we will be careful Final model

- To create a final model we use math anxiety, domain identification and gender as- predictor variables. To obtain the final model we included math anxiety and do main identification (Model 1), gender (Model 2), the two-way interactions gen der x math anxiety, gender x domain identification and math anxiety x domain identificationp (Model 3), and finally a three-way interaction betweenp the three predictors (Model 4). Model 1 predictedp significantly better than the null model 210.53, < .001), whereas Model 2 outperformed Model 1 60.33, < .001) and Model 3 outperformed Model 2 6.75, = .034). Model 4 did not predict better than model 3. We report the regression coefficients for Model 3 in Table B4. In onModel math 3 we performance see interaction is stronger effects offor gender girls thanand domainfor boys. identification, The positive and effect math of anxiety and domain identification. The positive effect of domain identification domain identification on math performance is strongest for students who have low scores on math anxiety (e.g., -1 SD), and least strong for students who have Checkhigh scores assumptions on math anxiety (e.g., +1 SD). To check whether our models do not violate the assumptions of the multilevel

- model, we checked for assumptions for Model 2 and Model 4 of the main analyses (as described in Snijders & Bosker, 2012). Additionally we checked the assump- A tions for the final model that we fitted to the data. For Model 2 we did not find- indications of heteroscedasticity, or deviations from normality, based on inspec- tion of level-one OLS studentized residuals. For Model 4 we did not find devia tions from normality, based on inspection of level-two residuals. Influence diag nostics of level two-units were overall small. For the final model, we did not find clear deviations from normality, or indications of heteroscedasticity. We added ofa quadratic these variables. effect of math anxiety and domain identification to the final model, both separately and together, and we failed to find significant quadratic effects-

Adding random effects for gender, stereotype threat condition and their in teraction term did not result in a better fitting models. Turning the main analysis Addendum

190

schoolin a three-level level explained model less (with variance individuals (intercept being variance the first level, classes being the- second level and schools being the third level) did not influence the results. The = 0.52) than the class lev elExploratory (intercept analyses variance= 2.54) in the random intercept model without predictors.

exploratory analyses to check how stable our results are. In this section we give aTo brief ensure overview the findings of the ofextra our analyses main analyses we carried were out. robust, we carried out several

We tested children in VWO (highest educational level in Dutch high schools), HAVO (second highest level) and HAVO/VWO mixed classes. Adding class.dummy The variables estimated for amount the level of oflevel the two class variance (HAVO decreasedas reference considerably group, a dummy by in- for VWO, and a dummy variable for HAVO/VWO) showed a main effect of type of

cluding the level (from 3.04 to 1.02). Students in HAVO classes scored higher than students in VWO classes and students in HAVO/VWO classes. We did not find a significant a significant stereotype threat effect controlling for the type of class, nor did we find significant three-way or lower order interaction effects between the type of class, experimental condition and gender. Additionally, we checked whether the type of class influenced (i.e., whether we carried out the experiment gender.during math Removing class classesor during that another were not type tested of class). during We math found class no dideffect not of alter type the of originalclass, nor results. did we Including find a significant presence interaction of the teachers effect aswith dummy stereotype variable threat did andnot

- alter original results, nor did removing classes where the teacher was absent. The domain identification scale and the math test both exist of two sub sets of items. Splitting the domain identification scale in two separate tests, and- lyzingre-analyzing the main the analyses main analyses for these for subsets both halves does do not not alter alter the the conclusions. results. Similarly, Re-an- splitting the math test in a geometry subtest and a number subtest, and re-ana A alyzing the data with different scoring rules (i.e., with the number of questions- correct or accuracy as dependent variables) did not alter the pattern of results. alterAnalyzing the results. all moderators Removing and students subsequent that didinteraction not take effects the test in oneseriously model from (in stead of running three different models, as we did in the main analyses) did not

the data set (N = 10) as well as somewhat noisy classes (e.g., classes in which the students were somewhat noisy before or after testing, or classes in which notdistractions alter the occurred results. Solely like outside analyzing music a subset playing) of studentsdid not have high an in impact math anxiety on the didresults. not Solelyalter the analyzing results. aSolely subset analyzing of students a subset high inof domainstudents identification that believe didthe Addendum

191

negative stereotype (i.e., boys are better in math tests than girls) did not alter the results either. Because the process of data gathering took approx. 6 months, we scoredinvestigated better whether or worse time over influenced time. Visual performance inspection on gave the a math vague test. indication We plotted of a linearthe number effect correctof time againston math time, performance. and fitted Addinga loess linea linear to see effect whether of time students to the model did not result in a better model, or altered conclusions.

A Addendum

192

Appendix B: Extra tables - chapter three

)

c -D p (1) (1) (1) (df) (0.3) (1.04) 9418.8 9365.3 9365.0 9364.0 (53.5)* (D Deviance 0.90 5.25 0.91 5.11 0.91 5.11 0.92 5.10 Var. comp. Var.

variance variance variance variance variance variance variance variance Level-one Level-one Level-one Level-one Level-one Level-two Level-two Level-two Level-two Level-two (difficult) Random part t 0.52 -7.36 -7.36 -6.00 27.48 28.00 25.76 24.43

3.14 3.52 3.50 0.05 3.55 -0.76 -0.76 -0.87 (S.E.) (0.11) (0.13) (0.10) (0.14) (0.10) (0.10) (0.15) (0.14) Coefficient (difficult) Fixed effect Fixed

) c -D p (1) (1) (1) (df) (1.60) (0.12) 9113.5 9054.2 9052.6 9052.5 (D (59.28)* Deviance 0.71 4.54 0.71 4.41 0.71 4.40 0.71 4.40 Var. comp. Var.

(easy) variance variance variance variance variance variance variance variance Level-one Level-one Level-one Level-one Level-one Level-two Level-two Level-two Level-two Level-two Random part t -7.76 -7.77 -1.27 -5.35 A 66.63 63.70 59.31 55.19

6.83 7.20 7.26 7.24 -0.75 -0.75 -0.12 -0.72 (0.10) (0.11) (0.10) (0.12) (0.10) (0.10) (0.13) (0.13) (easy) Fixed effect Fixed Coefficient (S.E.) Fit measures, deviance, unstandardized regression coefficients and variance components for models without moderators split for easy and easy for split moderators without models for components variance and coefficients regression unstandardized deviance, measures, Fit Intercept: Intercept: Gender: Intercept: Gender: ST: Intercept Gender: M0 M1 M2 M3 Table B1 Table difficult items. Addendum

193

) c -D p (3) (df) (2.01) 9362.0 (D Deviance 0.89 5.10

variance variance Level-one Level-one Level-two Level-two (difficult) Random part 1.02 7.75 1.03 0.82 1.09 0.59 -0.36 -6.06 -0.36

0.21 3.10 0.21 0.58 0.26 0.47 -0.05 -0.88 -0.05 (0.14) (0.20) (0.40) (0.15) (0.14) (0.20) (0.71) (0.24) (0.80) (difficult) Fixed effect Fixed

) c -D p (3) (df) (0.93) 9051.5 (D Deviance 0.70 4.40

(easy) variance variance Level-one Level-one Level-two Level-two Random part 0.51 0.57 0.70 -0.65 -0.34 -5.38 -0.65 -0.34 19.56 A

0.06 7.00 0.32 0.12 0.48 -0.09 -0.73 -0.09 -0.06 (0.13) (0.19) (0.36) (0.13) (0.13) (0.19) (0.63) (0.21) (0.71) (easy) Fixed effect Fixed Continued ST: STxGender: Intercept Gender: ST: STxGender: gender: Prop Gender teacher.d1: Gender teacher.d2: Gender is dummy coded with males being the reference group. ST is dummy coded with the control group being the reference group. Gender group. reference the being group control the with coded dummy is ST group. reference the being males with coded dummy is Gender

M4 Table B1 Table Note. of the teacher is dummy coded, with male teachers being the reference group, dummy 1 for female teachers and dummy 2 for male both teachers. In female the and deviance column, between brackets the difference between the previous model and . Models fit with Maximum Likelihood estimation. Addendum

194 BIC 8447.9 8453.6 8384.5 8391.7 8398.7 8417.4 8437.1 AIC 8431.9 8432.3 8357.8 8359.7 8361.3 8364.0 8367.6

2 - χ .11 (1) (1) (1) (1) (3) (3) 1.61 0.40 3.32 2.33 (df) 76.46 ident. Gender BIC 8465.7 8313.2 8254.9 8262.2 8268.5 8284.2 8304.0 AIC 8449.7 8291.8 8228.1 8230.1 8231.1 8230.8 8234.5

2 - χ (1) (1) (1) (1) (3) (3) 0.03 1.02 6.33 2.23 (df) 65.65* ident. 159.86* Domain BIC 8641.4 8562.8 8504.6 8511.9 8518.1 8538.3 8557.7 AIC 8625.3 8541.3 8477.8 8479.7 8480.5 8484.6 8487.9

2 - χ (1) (1) (1) (1) (3) (3) 0.01 1.24 1.86 2.73 (df) Math 85.99* 65.57* anxiety - BIC 8738.3 8665.0 8672.4 8679.1 8698.9 - AIC 8722.2 A 8643.5 8645.5 8646.9 8650.5

- - - (df) 2 χ 0.00(1) 0.63(1) 2.33(3) General General 80.71(1)* (without moderators) Sensitivity analyses 1: analysis with students excluded who did not answer the read check or the manipulation check correctly. correctly. manipulation check or the check read the did not answer who Sensitivity students excluded 1: analysis with analyses

AIC = Akaike Information Criterium. BIC = Bayesian Information Criterium, Domain ident. = Domain identification, Gender ident. = Gender identification. Gender = ident. Gender identification, Domain = ident. Domain Criterium, Information Bayesian = BIC Criterium. Information Akaike = AIC Model 0: intercept Random Model 1: Moderator Model 2: Gender Model 3: condition ST Model 4: STxGender Model 5: STxGenderxmod Model 6: predictors Class-level Table B2 Table Note. Addendum

195

Table B3 Sensitivity analyses 2: analysis with outliers removed.

Math Domain Gender anxiety ident. ident. χ2 AIC BIC χ2 AIC BIC χ2 AIC BIC (df) (df) (df) Model 0: - - - random intercept 10387 10403 10671 10688 10492 10509 Model 1: 10484 moderator 72.41* 10316 10338 189.40* 10506 1.92 10492 10514 Model 2: 10420 10448 82.14 10412 10440 (1) (1) (1) 63.32* 10255 10283 65. 45* 10422 .02 10414 10447 Gender (1) (1) (1) ST condition Model 3: 0.09 10257 10290 0.89 10455 Model 4: (1) (1) (1) 0.23 10259 10297 0.38 10423 10462 0.05 10416 10455 10424 2.88 10474 STxGender (1) (1) (1) Model 5: 3.47 10261 10317 5.55 10479 10419 10428 STxGenderxmod (3) (3) (3) class-level predictors Model 6: 2.34 10265 10337 1.49 10501 2.06 10423 10495 Note. (3) (3) (3)

AIC = Akaike Information Criterium. BIC = Bayesian Information Criterium, Domain ident. = Domain identification, Gender ident. = Gender identification.

Table B4 Final model: unstandardized regression coefficients and variance components for final model.

Fixed effect Random part

t Variance component

Coefficient Intercept (S.E.) Level-two variance Model 3 10.520 49.33 2.816 Level-one variance (final model) (0.213) Gender: -1.253 -7.94 11.094 0.078 (0.158) A Domain identification: 6.08 Math anxiety: (0.013) -0.058 -3.91 - 2.74 (0.015) Gender x domain identi 0.046 0.001 0.07 fication (0.019) Gender x math anxiety Math anxiety x domain -0.004 (0.020) -3.66 Note. identification (0.001)

Gender is dummy coded with males being the reference group. ST is dummy coded with the control group being the reference group. Domain identification and math anxiety are grand mean centered. Models are fit with Maximum Likelihood estimation. Addendum

196

Appendix C: Statistical DIF methods and purification techniques - chapter four

Statistical methods – details

Observed Score Methods

withA well-known the observed observed total scoresscore methods of the participants to study DIF as is matching the MH testcriterion. (Holland In its & Thayer, 1988). This method uses contingency tables to obtainM contingencya DIF test, usually table M seemost Table basic C1. form, the MH method is based on a 2-by-2-by- (Dorans & Holland, 1993), in which denotes the number of total score levels,

Table C1 The contingency table of the Mantel-Haenszel method.

Score on Item

Group 1 0 Total

Reference f1rm f0rm nrm

Focal f1fm f0fm nfm

Total n1m n0m nm

To test whether an item displays DIF a χ2 test can be used: 2 2      f1rm − E( f1rm) −.5 ∑ f1∑rm − E( f1rm) −.5 2  m ∑ m ∑  MH − χ =2  m m ,  (1) MH − χ = ∑Var( f1rm) , (1) m Var( f ) ∑ 1rm , (1) A m where, where, n1m nrm E(f1rm ) = E(R1rm | α=1) = , and (2) n n n E(fwhere,) = E(R | α m 1m rm E( f1rm 1rm) = E(R1rm1rm| α = 1) =n1m nrm E(f1rm ) = E(R1rm | α=1) = nm , and (2) n1m nrm n fm n0m nm Var( f1rm ) = Var( f1rm |=1)α = 1) = 2 , and. (3) (2) nk (nk −1)

Effect size estimates are often reported in then form1m n ofrm then fm commonn0m odds ratio, given by Var( f ) = Var( f | α = 1) = . (3) 1rm 1rm n 2 (n −1) ∑ f1rm f 0 fm nk k k m MH = , . (4) (3) Effect∑ sizef1 fmestimatesf 0rm nk are often reported in the form of the common odds ratio, given by α m or the MH D-DIF which is a transformation of the common odds ratio, ∑ f1rm f 0 fm nk m MHMH D=-DIF = - MH]. , (4) f f n ∑ 2.35ln[α1 fm 0rm k α m or the MH D-DIF which is a transformation of the common odds ratio,

MH D-DIF = - MH].

2.35ln[α 2    ∑ f1rm − ∑ E( f1rm) −.5 2  m m  MH − χ = , (1) ∑Var( f1rm) m

where,

n1m nrm E(f1rm ) = E(R1rm | α=1) = , and (2) nm Addendum

197 n1m nrm n fm n0m Var( f1rm ) = Var( f1rm | α = 1) = 2 . (3) nk (nk −1) givenEffect bysize estimates are often reported in the form of the common odds ratio, given by Effect size estimates are often reported in the form of the common odds ratio,

∑ f1rm f 0 fm nk m MH = , (4) ∑ f1 fm f 0rm nk α m (4) or the MH D-DIF which is a transformation of the common odds ratio, MH D-DIF = - MHαMH]. orMH the D- DIFMH D-DIF which is a transformation of the common odds ratio, = -2.35ln[2.35ln[α ]. (5) the focal group is the group of interest and the reference group is the standard In traditional DIF analyses, a focal and a reference group are distinguished, where to which against will be compared (Holland & Thayer, 1988). The common odds ratio would be 1.0 if no DIF were present, larger than 1.0 if the item was more difficult for the focal group than the reference group (controlled for ability), and smaller than 1.0 if the item was more difficult for the reference group than the focal group (controlled for ability). The MH D-DIF measure is developed by ETS |MH(Dorans, D-DF| 1989) and has led to the ETS classification rules (Dorans & Holland, 1993): items with negligible DIF are labeled as “A” (not statistically significant or < 1.0), items with large DIF are labeled as “C” (statistically significant and |MH D-DIF| >1.5) and items with intermediate DIF are labeled as “B” (items- tensionnot labeled has beenas “A” developed or “C”). Originally, to test for one non-uniform of the downsides DIF as of the MH test was that it only tests for uniform DIF, and not for non-uniform DIF. However an ex well (Mazor, Clauser, & Hambleton, 1994). Instead of using the observed score as matching criterion, someA authors second advocate observed thick score matching method i.e.,that the does use allow of a matching testing for variable non-uniform which consists of intervals of the observed scores (Donoghue & Allen, 1993). item is treated as the dependent variable in separate analyses. Let be the ob- DIF is logistic regression analysis (Swaminathan & Rogers, 1990), in which eachg be the grouping variable and let Y be the item score variable. We can estimate the A served total score that serves as an estimator of participants’ ability, let uniformprobability DIF: of a correct response on an item, conditional on both the grouping variable and estimated ability, by simply applying the logistic model to test for exp(β +β θˆ + β g) P[Y = 1| θˆ, g] = 0 1 2 ˆ 1+ exp(β 0+β1θ + β 2 g) . (6) By adding an interaction term to the analysis a second model can test for non-uni- form DIF: Addendum

198

exp(β +β θˆ + β g + β θˆg) P[Y = 1| θˆ, g] = 0 1 2 3 ˆ ˆ 1+ exp(β 0+β1θ + β 2 g + β3θg) . (7) - - The parameters can be estimated using maximum likelihood estimation (Swami nathan & Rogers, 1990). A significance test can be carried out by applying a like lihood ratio test, a Wald test or a score test, alternatively known as the Lagrange Multiplier (Paek, 2012). As effect size measure either Nagelkerke (Zumbo, 1999; usefulnessZumbo & Thomas, of these 1997)effect orsizes (delta) for judging (log) odds practical ratio importance:(Fidalgo, Alavi, the &LR Amirian, adjust- 2014; Monahan, McHorney, Stump, & Perkins, 2007)the authors demonstrate the

regressioned odds ratio procedure and its hasconversions more power to the to deltadetect metric, non-uniform the Educational DIF than theTesting MH Service (ETS can be used for this method. Various studies show that the logistic

procedure, but less power to detect uniform DIF than the MH procedure (Güler &- Penfield, 2009; Herrera & Gómez, 2008; Swaminathan & Rogers, 1990). - A third method that is a bit more difficult to classify as observed or un- rectedobserved for is any SIBTEST differences (Shealy in &the Stout, ability 1993). distributions SIBTEST byis ameans non-parametric of a regression tech nique that statistically tests the weighted mean differencesβ between= B(θ ) fgroups,(θ )dθ cor correction. For this method the parameter of interest is UNI ∫ F , which is defined as: β = B(θ ) f (θ )dθ UNI ∫ F where B(θ) =P(θ ,R) -, P(θ ,F) is the difference in probability for the reference and(8)

fF θ β is given by: the focal group of correctly answering an item conditional onUNI θ, and ( ) gives the density function of θ for the focal group. An estimate of K βˆ = p ( p* − p* ) UNI ∑ k Rk Fk k =0 , (9)

A where K pk is the proportion of par- ticipants in score group k p* - p* is the maximum of the observed test score, Rk Fk each score group k for the focal group, and ( ) the corrected mean difference for the reference group and the focal group on the studied item, for βˆ can be used to create a statistical test for DIF: (for more detailsUNI on the regression correction, see Shealy and Stout (1993)).Theβˆ estimate SIB = UNI ˆ σˆ(βUNI ) , (10) in which ˆ is the standard error. It has been shown that approximately σˆ(βUNI ) SIB Addendum

199 follows the normal distribution with a mean of zero and a variance of one under using cut-off values obtained from a standard normal distribution. the null hypothesis of no DIF, which means a significance testˆ cancan bebe carriedused as out ef- βUNI fect size estimate. With SIBTEST it is not only possible to identify DIF at the item When testing for DIF with the SIBTEST procedure, level, but also to obtain effect size estimates and carry out statistical tests for bundles of items, i.e. differential bundle functioning (DBF). SIBTEST has been- tectdeveloped non-uniform to detect DIF uniform in comparison DIF, however to other Crossing observed SIB wasand laterunobserved developed score to detect crossing DIF (Li & Stout, 1996). Crossing SIB is a powerful method to de methods (Finch & French, 2007). Unobserved Score Methods Many unobserved score methods with categorical indicators are described with- models are developed with dichotomous items like the two-parameter logistic in the item response theory (IRT) modeling framework. In its most basic form-

(2PL) and three-parameter logistic (3PL) model, however polytomous exten limitsions ourselves of these models to descriptions exist as ofwell unobserved like the graded score methodsresponse using model dichotomous (Samejima, 1969), polytomous Rasch models (Andrich, 1978; Masters, 1982), etc. We will probabilityitems. Within of thea correct context response of IRT many to item DIF i detectionx methods exist,θ however and discrim they- are all based on basic IRT models like the (mosti general) 3PL model in which the ination item parameter ai, bi, and pseudo-guessing item parameter c : ( = 1) is the function of i difficulty item parameter

1− ci P(xi = 1) = ci + 1+ exp[−ai (θ − bi )] . (11) To test for DIF either the item parameters or the item characteristic curves need to be compared for the groups under investigation. One of the classical approach- A es to test for DIF is by calculating Lord’s X2 In this procedure using a linking procedure to ensure that both (F. M. groups Lord, 1980).are placed on the same item parameters are estimated for the focal2 and reference group separately, bFi and discrimination item parameter aFi for the focal group can be compared with the correspondingmetric (Kim & Cohen,parameters 1995). for With the referenceLord’s χ thegroup: difficulty item parameter

2 −1 Lord'sχ = (ξiR − ξiF )' ∑i (ξiR − ξiF ) , (12) 200 Addendum

where ˆ ξiR = (aˆiR ,biR )'

where ˆ , ξiF = (aˆiF ,biF )'

and where −1 is the, inverse of the asymptotic sampling variance-covariance ∑i matrix of the differences between the item parameter estimates. For large sam- ples this test statistic follows a X2 distribution with two degrees of freedom. The abovementioned test captures uniform and non-uniform DIF simultaneously.

Separate tests to qualify the difference between item discrimination parameters or difficulty parameters can also be conducted. A second widely used IRT method to test for DIF is IRT-LR-DIF (Thissen et al., 1993). In this approach a likelihood ratio test between a compact model whereas(C) and an in augmented the augmented model model (A) is item conducted, parameters in which under the investigation compact model for DIFthe areitem freely parameters estimated. for theThis focal test andstatistic reference tests thegroup invariance are constrained of the item to be parame equal,-

following formula: ters under investigation, hence the null hypothesis of no DIF, and is given by the

G 2(df) = -2ln  Likelihood[A] G2 (df ) = −2ln    Likelihood[C] . (13) The Likelihood

model respectively. equals The the test likelihood statistic Gof2 isthe assumed data given to be the X2 marginaldistributed maximum and the likelihood estimationsdf of the parameters, for the augmented and the compact augmented and the compact model. This procedure can incorporate tests for degrees of freedom ( ) correspond to the difference in parameters between the

uniform (i.e., constraining the difficulty item parameter in the compact model) and non-uniform DIF (i.e., constraining the discrimination item parameter in the A thiscompact approach model), the as area well between as a joint the test. item response function of the focal group and the itemA third response approach function to DIF of testing the reference is by using group. area In measures its most general(Raju, 1988). form anIn area measure can be represented as follows:

A = f s [PR (θ )− PF (θ )]

, s bounded by a lower limit and an upper limit(14) for θ. Various area measure which is defined for an interval exist due to variations in the width of the interval (i.e., whether the interval is bounded or unbounded), the integration method (i.e., continuous integration or discrete approximation), the kind of differences that Addendum 201

- ences in f are used (i.e., signed or unsigned) and the weights are used (i.e., are the differ equally weighted or not). A well-known method is developed by Raju (1990), who used continuous integration to obtain signed and unsigned area thatmeasures. give item On a effect test level size the measures effect size in themeasure same DFIT framework. has been These developed indices (Raju are closelyet al., 1995), related which to some also versions led to the of thedevelopment area measures. of the NCDIF and CDIF indices Area measures can be considered as effect size estimates in the context of

IRT, which is especially useful when a 2PL or 3PL model is used. Another effect size estimate is to simply interpret the difference in difficulty estimates (uniform DIF) orFor discrimination unobserved score parameters method (non-uniformDIF testing of DIF)scales between with continuous groups as indica effect- size (L. Steinberg & Thissen, 2006). group CFA - tors a CFA approach would be most appropriate. Often used techniques are multi- (French & Finch, 2008; Meredith, 1993; Widaman & Reise, 1997)re views confirmatory factor analysis (CFA, Restricted Factor Analysis (Barendse et al., 2010), or analyses using the Multiple Indicators Multiple Causes (MIMIC)- model, with especially the MIMIC model being popular due to its smaller sample size requirements (Woods, 2009) We will describe the MIMIC model in more de tail here, as it is popular in in connotation with the term DIF. The MIMIC model consists of two parts; a measurement part and a structural part (Wang, Shih, & Yang, 2009). In its simplest form the measurement part is given by yi* i z i ,

= λ θ + β’ + ε where y * , (15) z i i is the latent response variable* for item i, λ is the factor loading, θ is the groupingθ = γ + ζ variables on response y latent trait, β is the regression coefficienti for item i representing the effect of the to the focal group and z=0 referring to the reference group and εi represents mea- surement error for item i. Because, zy is* the grouping variable withy z=1is used referring as an yi* i z i , i i approximation. The structural part of the MIMIC model is given by is not observed, indicator = λ θ + β’ + ε A z

θ = γ + ζ , (16) whereγ z on θ ζ to grouping is the variable regression z and coefficient with mean=0. displaying Because the a meanmodel difference in which allfor items groups are , and the residual which is assumed to be normally distributed, unrelated tested for DIF at once would not be identified (Woods, 2009), researcher should either test for DIF according to a DIF free baseline model (i.e., testing DIF items one by one assuming all other items are invariant) (Finch, 2005) or test for DIF with a DIF free anchor, which can be created according to different strategies 202 Addendum

loading(Woods, of 2009). the item For under the MIMIC investigation. model the effect size measure MIMIC-ES has been developed (Jin et al., 2012), based on the regression coefficient and the factor

downsideThe upsides of the MIMIC to MIMIC model modeling was the are fact that that covariates the method can is easily not suited be included, to test forand non-uniform that continuous DIF becauseas well as the categorical model treats indicators item loadings can be as used. the sameOriginally, across a - action term to the MIMIC model which allows for non-uniform DIF testing as well groups (Woods, 2009), however tests have been developed to include an inter also show unacceptable high Type I error rates. A better alternative for non-uni- (Woods & Grimm, 2011). Those tests have been shown to be quite powerful, but

alwaysform DIF feasible testing for could some be type by means of research. of multi-group CFA, however this technique requires larger sample sizes (French & Finch, 2006; Woods, 2009) which is not - Whereas many traditional frequentist methods nowadays have more recent Bayesian counterparts, this holds for DIF testing as well. Describing these meth- ods goes beyond of the scope of this paper, for more information on Bayesian methods see Zwick, Thayer, and Lewis (1999) and Soares, Gonçalves, and Gamer man (2009). The last few years Bayesian methods gained in accessibility and popularity as good alternatives for frequentist statistical methods. Purification - details We distinguish three different kind of strategies that are based on some form -

of purification: the two-step purification approach, the iterative purification ap proach and the constant anchor approach. One of the first approaches mentioned used.to purify First the all matching items are criterion checked is for the DIF two-step using all approach other items (Clauser as matching et al., 1993; cri- Holland & Thayer, 1988; F. M. Lord, 1980), in which the following two steps are - A terion. Second, ability estimates are estimated for all participants based on the items that did not show statistically significant DIF in the first step. Those puri fied ability estimates can be used to re-estimate the final set of DIF parameters iffor it alldisplayed items. Holland DIF. & Thayer (1988) recommend this procedure, but suggest that the studied item always needs to be included in the matching criterion, even

Although Lord (1980) originally suggested a two-step purification method originalin the IRT iterative context, approach the iterative starts approach out with based the onsame this two two-step steps methodas the two became step popular in IRT (Candell & Drasgow, 1988; Park & Lautenschlager, 1990). The steps are repeated multiple times until two successive iterations would produce purification approach, but continues with an iterative process in which the two Addendum

203

- criterionthe same asset ability of biased estimate. items. ThisFinally iterative the difficulty procedure and discriminationbecame popular item not param only in eters could be estimated for both groups separately using the purified matching the context of IRT, where it can be used to find unbiased items both for linking or for the creation of a DIF-free anchor, but has also been adopted and used in- the context of other statistical methods like logistic regression (French & Maller, 2007) or MH (Clauser & Mazor, 1998). After this original iterative approach sev- eral variations have been proposed. The stepwise purification approach in which DIF estimations are made for all items, the item with the highest statistically sig- nificant DIF value is removed, ability is re-estimated, and the procedure carries on until no new DIF items are detected (Navas-Ara & Gómez-Benito, 2002). Re cently such a stepwise purification method using the Lagrange multiplier (i.e., score test) has been suggested, showing promising power and Type I error rates (Khalid & Glas, 2014). Yet another variation to iteratively select an appropriate anchor is the iterative forward approach (Kopf et al., 2015a, 2015b). In the first iteration of this approach the anchor exists of one item (empirically selected procedurethrough either goes auxiliary on as long DIF as tests the oranchor a ranking is shorter system, than for the more amount information of current see DIF-freeKopf et al., items. (2015b)), and with each iteration an item is added to the anchor. The The third procedure is known as the constant anchor selection strategy.

In this approach the anchor will comprise of a predetermined number of items, suggestedwith 1 (González-Betanzos standards. These & constant Abad, 2012; anchors Kopf can et al.,be 2015a),either selected 3 (Thissen based et al.,on 1993) or 4 anchor items (Kopf et al., 2015a; Thissen et al., 1993) as previously expert review or be empirically selected, which comes close to a purification method. For instance the DIF-free-then-DIF strategy can be used (Shih & Wang,- arately2009; Wang, as a single Shih, item& Sun, anchor. 2012), Items which with strives lowest to select mean a DIF DIF statistic free anchor are assumed through different strategies (e.g., estimate DIF statistics for each item, using all items sep A to be DIF free). 204 Addendum

Appendix D: Detailed description of scales and groups - chapter four

Method - Selection of DIF analysis Sometimes multiple DIF comparisons were performed on the same data in a se- lected article on DIF. In this case we used one of the following selection rules. If

multiple DIF analyses were performed on different scales we selected the first scale that was mentioned in the (a) title, (b) method, or (c) results. When the scale consisted of multiple subscales, and that subscale is used as unit within the DIF analysis, we selected the DIF analysis conducted over the first subscale mentioned in the (a) title, or (b) method. When DIF analyses were performed- odon thesection. same When scale thebut articlereanalyzed solely using contained different a DIF groups analysis (e.g., in gender which andmultiple age) we selected the analysis with the first group that was mentioned in the meth

groups were analyzed simultaneously, we selected this analysis. When multiple categories were used (e.g., comparisons between several countries) we selected- the first two categories that were mentioned in the (a) title, (b) abstract, or (c) method. Again, when the article solely contained a DIF analysis in which multi ple categories were analyzed simultaneously, we selected this analysis. When the article contained multiple DIF analyses over different waves (i.e., longitudinal data) we selected the oldest wave. When an article reported multiple studies that contained a DIF analysis we (a) selected a main study over a pilot study and (b) analysisselected basedthe first on main the new study. data. When When an anarticle article contained presented both DIF data analyses that has both been on ananalyzed item level before and in a abundle published level article, we selected as well the as analyses new data, on we the selected item level the over DIF the analyses on the bundle level. If DIF analyses were only carried out on the bundle level we coded the bundle level analyses. When multiple DIF methods A were used for the same scale, we selected and coded the information for each of response.the methods. Finally, when an analysis was carried out separately for DIF in the latent response and DIF in the attraction parameters, we chose DIF in the latent Addendum

205

Appendix E: Literature review - chapter five

Method In this review section we summarize what kind of psychometrics properties have the literature in stereotype threat experiments in experiments conducted on col- been reported in (a large share of the) stereotype threat experiments. We split lege students (79 articles, 101 studies) and students of elementary school, middle school or high school (23 articles, 27 studies). Those papers were obtained in schoolthree recent students meta-analyses we selected on all stereotype articles mentioned threat (Doyle in two & Voyer,recent 2016;meta-analysis Picho & Stephens, 2012, Nguyen & Ryan, 2008) for the college students section. For the-

(Doyle & Voyer, 2016; Flore & Wicherts, 2015). We only selected published pa pers, the test existed of multiple questions and the test was related to quantitative skills (i.e., mathematics, spatial ability, physics, chemistry). We excluded studies stereotypewhere participants threat. In were this review tested we in extra-ordinarysolely focused on circumstances the test used (e.g., as dependent in a MRI variable.scanner). As We such only we selected did not studies code anything that concerned with regards manipulation to scales related that were to gender used - as covariates, mediator variables or moderator variables. We coded whether au- thors reported a reliability coefficient (e.g., Cronbach’s alpha), item-level analyses that differentiate between the different groups (item difficulties, item-rest cor relations, IRT or CFA), and whether the authors used a time limit. For the variable- tionallyitem-level we analyses, coded whether we coded authors papers reported that reported the amount correlations of missing based responses on item for difficulties as “item difficulties” present (e.g., O’Brien & Crandall, 2003). Addi - the different groups (e.g., reporting number of items attempted split by groups or by carrying out significance tests over the number of attempted items) or sig nificance tests to gauge the difference in missing responses. Finally, we coded the scoring rule used (i.e., number correct, accuracy, or guess corrected scores). The A aused student coding assistant. sheet is Forreported the variable on OSF “Time (https://osf.io/wqnh9/). limit” we originally created three re- All 101 studies of college students were scored both by the first author and insponse which options: it was not yes, clearly no, and described unclear. whetherAs the category a time limit“unclear” was used introduced did not some have vagueness, we collapsed the categories no and unclear, assuming that the articles a time limit on the test. Overall, reliability coefficients of the variables was high: outcome variable (Fleiss’ kappa = .87), reliability reported (Fleiss’ kappa = .89), missing responses (Fleiss’ kappa = .68), and item-level statistics (Fleiss’ kappa =- .60). For time limit Fleiss’ kappa equals .67 without collapsing and Fleiss’ kappa equals .92 with collapsing of categories no and unclear. Simple agreement ex Addendum

206

ceeded 90% for all variables, except for the variable missing responses (simple agreement was 86%). Results

Results are reported in Table E1. Overall, we see that number of items answered correctly are selected most often (in 53% of the studies of college students and in 74% of the studies in school aged students). In the majority of the studies missing responses are not mentioned (in 63% of the studies of college students and in 78% of the studies in school aged students), neither by reporting the- amount of missing responses/attempted items, reporting significance tests, nor by reporting missing responses for all students. Moreover, psychometric quali ties of the tests are rarely studied, for instance a minority of the studies report a reliability coefficient. The majority of studies does not study any item-level analysis (in 92% of the studies of college students and in 81% of the studies in school aged students). Overall, the dependent variable receives little attention. Table E1 Reported statistics on the dependent variable in stereotype threat papers for college stu- dents and school students.

Variable Ncollege students Nschool students Outcome variable 101 27 20 Accuracy 2 Number correct (raw or percentage) 53 0 13 Combination of measures 22 Guess corrected 13 Reliability reported 101 27 5 Yes: Cronbach’s alpha 8 Yes: Other 1 0 3 No Time constraint 101 27 97 19 Yes No/unclear 4 76 23 Missing 101 27 A 26 Number or percentage of attempted items - not split for groups 0 0 Number or percentage of attempted items - split for groups 1 1 12 14 2 Significance test over number attempted only 3 No/no missings/missings removed 74 21 Combination (sig test and number attempted reported) Item level statistics 101 27 2 0 4 Item difficulties (CTT) 2 0 Split test easy/difficult test 5 No 22 MGCFA over item parcels Note. N gives the amount of experiments. 93 Addendum 207

Appendix F: Extra tables - chapter five

Table F1 Item fit statistics “Orlando-Thissen-Bjorner Summed-Score Based Item Diagnostic Tables” for separate IRT 2PL models.

ST group Control ST group Control group group S − χ 2 p S − χ 2 p S − χ 2 p S − χ 2 p Item 1 17.1 .104 22.1 Item 11 11.0 Item 2 7.1 .788 14.4 Item 12 .023 11.9 .372 .363 14.0 17.8 .120 11.1 .441 .213 12.9 .299 12.3 .342 Item 4 .204 .048 Item 14 17.0 .724 Item 3 .299 9.0 .621 Item 13 17.0 .108 20.4 10.4 15.7 18.5 .151 7.9 14.4 8.2 11.4 Item 5 .060 Item 15 16.0 .098 .498 Item 7 .002 .287 Item 17 Item 6 .209 .608 Item 16 15.8 .150 .416 Item 8 10.2 Item 18 7.4 22.8 29.5 13.1 15.5 .162 17.6 .091 .210 .214 18.8 .429 14.9 .136 .763 .019 Item 10 11.1 17.7 .088 Item 20 11.0 Item 9 13.2 13.2 Item 19 .094 15.4 .163 Note. ST = stereotype threat..435 16.8 .079 .362

A 208 Addendum I19 4.0p I18 -0.2p -0.3n I17 3.0p 0.3n -0.4n I16 0.1n -0.7p -0.3p -0.6n I15 4.3n -0.6p -0.5p -0.6n -0.6n I14 0.0p 2.2n -0.1p -0.2p -0.5n -0.3n I13 0.4p 0.6p -0.5p -0.2n -0.6n -0.5n -0.7n I12 0.1p 0.4p 0.1n -0.1p -0.7p -0.6p -0.5n -0.5n I11 0.8n 0.2n -0.4p -0.4p -0.6p -0.1p -0.6p -0.2p -0.6n I10 0.0p 0.6p 0.2p 0.5p 2.1n 0.3n 0.1n 3.4p -0.5p -0.6p I9 0.0n 0.0n 3.2n -0.5p -0.5p -0.7p -0.7p -0.7p -0.0n -0.5n -0.7n I8 0.3p 0.9p 0.3n 0.7n -0.4p -0.7p -0.6p -0.7n -0.5n -0.0n -0.5n -0.6n I7 0.2p 0.2p 2.8p 0.7p 2.6n 0.2n 0.6n 0.3n -0.5p -0.6p -0.4n -0.2n -0.6n I6 0.0n 1.4n -0.5p -0.7p -0.6p -0.4p -0.2p -0.1p -0.4p -0.0n -0.1n -0.5n -0.6n -0.7n I5 0.5p 0.3p 0.2p 0.4n 0.2n -0.2p -0.5p -0.5p -0.7p -0.7p -0.6p -0.6n -0.7n -0.1n -0.2n A I4 0.2p 1.6n 0.2n 0.6n -0.5p -0.5p -0.5p -0.1p -0.6n -0.6n -0.7n -0.6n -0.4n -0.7n -0.7n -0.7n I3 0.8p 0.5p 0.7p 1.0p 0.1n 0.9n 0.1n 3.9n -0.1p -0.4p -0.6p -0.7p -0.5p -0.6p -0.7p -0.7n -0.1n I2 0.2n 1.6n -0.3p -0.5p -0.2p -0.4p -0.5p -0.4p -0.5p -0.6p -0.7p -0.3p -0.4p -0.6p -0.3n -0.6n -0.4n -0.7n Separate 2PL IRT model: Local dependencies of stereotype threat group. threat of stereotype dependencies Local model: 2PL IRT Separate I1 0.4p 0.7p 2.2p 0.8n 0.2n 0.2n 0.1n -0.3p -0.5p -0.6p -0.5p -0.7p -0.5n -0.7n -0.7n -0.6n -0.6n -0.6n -0.6n Item pairs in bold face have local dependency indices larger than 3.0 (considered statistically significant). statistically than 3.0 (considered local dependency indices larger have pairs in bold face Item I20 I19 I18 I17 I16 I15 I14 I13 I12 I11 I10 I9 I8 I7 I6 I5 I4 I3 I2 I1 Note. Table F2 Table Addendum

209 I19 0.7p I18 0.0p -0.3n I17 -0.1p -0.6n -0.4n I16 0.2p -0.1p -0.7p -0.4n I15 0.8p -0.6p -0.6p -0.3n -0.5n I14 1.4n 0.1n -0.4p -0.6p -0.5p -0.7n I13 -0.5p -0.7p -0.1p -0.7p -0.4n -0.5n -0.1n I12 1.2p 0.3n -0.4p -0.6p -0.5p -0.2p -0.6n -0.7n I11 0.8p 1.4n -0.3p -0.1p -0.0p -0.7n -0.6n -0.7n -0.6n I10 1.4n 0.1n 0.7n -0.7p -0.6p -0.4p -0.3p -0.7n -0.6n -0.5n I9 0.1p 0.0p 0.8n 1.5n 0.1n -0.5p -0.7p -0.7p -0.3p -0.1n -0.3n I8 0.2p 1.3n 0.9n 0.2n -0.5p -0.4p -0.2p -0.4p -0.2p -0.6n -0.5n -0.6n I7 0.7p 0.1p 2.1p 0.5n 0.7n 0.2n 4.2p -0.7p -0.7p -0.2p -0.5p -0.5n -0.3n I6 0.9n -0.6p -0.3p -0.7p -0.3p -0.6p -0.7p -0.5n -0.7n -0.7n -0.5n -0.6n -0.6n -0.1n I5 2.3p 0.9n 1.4n 1.4n 1.2n 7.8p -0.6p -0.7p -0.2p -0.5p -0.5p -0.3p -0.4n -0.7n -0.7n A I4 0.3p 0.6n 2.9n -0.7p -0.3p -0.6p -0.6p -0.7p -0.4p -0.3p -0.2p -0.7n -0.5n -0.7n -0.6n -0.3n I3 0.6p 2.0p 0.4p 0.1n 0.5n 0.5n -0.5p -0.7p -0.7p -0.7p -0.6p -0.7n -0.5n -0.7n -0.6n -0.1n -0.6n I2 0.5p 1.5p 0.7p 0.1n 1.6n 0.5n -0.6p -0.5p -0.6p -0.4p -0.7p -0.5p -0.2p -0.7n -0.6n -0.0n -0.7n -0.4n Separate 2PL IRT model: Local dependencies of control group. of control dependencies Local model: 2PL IRT Separate I1 0.6p 0.9p 1.6n 0.5n -0.7p -0.5p -0.3p -0.7p -0.7p -0.5p -0.7p -0.6n -0.5n -0.7n -0.3n -0.7n -0.7n -0.2n -0.6n Item pairs in bold face have local dependency indices larger than 3.0 (considered statistically significant). statistically than 3.0 (considered local dependency indices larger have pairs in bold face Item I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 Table F3 Table Note. 210 Addendum

Table F4 Multi-group IRT 2PL model: Item fit statistics “Orlando-Thissen-Bjorner Summed-Score Based Item Diagnostic Tables”.

ST group Control ST group Control group group S − χ 2 p S − χ 2 p S − χ 2 p S − χ 2 p Item 1 .117 22.4 .021 Item 11 Item 2 7.4 14.2 Item 12 .240 12.4 16.7 11.9 .373 11.3 .334 14.4 18.2 .111 11.4 .412 .769 .223 13.9 .337 Item 4 .178 20.1 .028 Item 14 Item 3 .275 9.8 .554 Item 13 17.7 .088 22.0 11.0 15.1 17.3 .137 8.6 .658 Item 5 .038 Item 15 15.7 .109 .357 Item 7 .002 14.8 Item 17 17.4 Item 6 13.0 .298 9.7 .466 Item 16 15.9 .146 12.5 .328 Item 8 .414 14.7 Item 18 .184 28.0 29.3 .193 15.6 .154 .095 14.2 18.0 17.7 .088 10.3 .143 15.0 .003 Item 10 .421 Item 20 17.7 Item 9 14.5 .151 .163 Item 19 .115 Note. ST = stereotype11.3 threat. 19.5 .052 .061 10.5 .396

A Addendum 211 (S.E.) C

ˆ γ 0.22 0.11) 1.42 (0.14) 0.86 (0.10) 1.52 (0.15) 0.11 (0.11) 1.31 (0.16) 1.44 (0.16) 0.55 (0.10) 1.77 (0.20) 0.63 (0.10) 0.87 (0.11) 0.35 (0.10) 1.99 (0.17) -0.01 (0.10) -1.26 (0.13) -0.85 (0.11) -0.29 (0.10) -0.29 (0.10) -0.35 (0.11) -0.33 (0.11) -value is lower is lower -value p (S.E.) ST

ˆ γ 0.18 (0.15) 1.11 (0.18) 1.19 (0.14) 1.62 (0.31) 0.21 (0.13) 1.11 (0.20) 1.33 (0.26) 0.54 (0.18) 1.81 (0.32) 0.54 (0.14) 0.12 (0.17) 0.83 (0.13) 0.83 (0.13) 1.89 (0.18) -1.21 (0.13) -0.67 (0.10) -0.19 (0.12) -0.22 (0.11) -0.46 (0.16) -0.56 (0.17) (S.E.) C ˆ α 0.43 (0.10) 0.43 (0.12) 0.38 (0.10) 0.63 (0.14) 0.22 (0.10) 0.73 (0.15) 0.52 (0.11) 0.91 (0.17) 0.78 (0.15) 0.36 (0.10) 1.00 (0.19) 0.34 (0.10) 0.28 (0.09) 0.40 (0.10) 0.48 (0.11) 0.44 (0.10) 0.32 (0.11) 0.25 (0.10) 0.37 (0.11) 0.45 (0.16) (S.E.) T S ˆ

α 0.39 0.10) 0.48 (0.11) 0.36 (0.12) 0.18 (0.09) 0.53 (0.13) 0.30 (0.11) 0.92 (0.19) 0.39 (0.10) 0.59 (0.13) 0.79 (0.16) 0.57 (0.12) 0.91 (0.19) 0.28 (0.10) 0.20 (0.10) 0.55 (0.12) 0.54 (0.12) 0.30 (0.11) 0.35 (0.11) 0.53 (0.12) 0.40 (0.15) c.v. .006 .023 .008 .004 .003 .024 .010 .025 .015 .018 .020 .013 .009 .016 .011 .014 .021 .001 .005 .019 B-H p .326 .834 .407 .224 .068 .844 .541 .865 .654 .712 .723 .561 .529 .665 .555 .568 .830 .006 .268 .713 -values -values can be considered significant if the reported P (1) 1.0 0.0 0.7 1.5 3.3 0.0 0.4 0.0 0.2 0.1 0.1 0.3 0.4 0.2 0.3 0.3 0.0 7.6 1.2 0.1 uniform 2

χ c.v. .018 .015 .003 .013 .014 .009 .008 .001 .024 .004 .020 .019 .025 .006 .016 .010 .023 .011 .005 .021 B-H p .731 .656 .151 .579 .583 .424 .410 .147 .963 .177 .756 .751 .968 .403 .695 .509 .897 .529 .323 .817 (1) 0.1 0.2 2.1 0.3 0.3 0.6 0.7 2.1 0.0 1.8 0.1 0.1 0.0 0.7 0.2 0.4 0.0 0.4 1.0 0.1 non-uniform 2

χ A c.v. .010 .020 .004 .009 .003 .015 .011 .006 .023 .008 .021 .018 .019 .013 .016 .014 .025 .001 .005 .024 B-H p .582 .886 .253 .409 .163 .713 .590 .341 .904 .375 .895 .803 .819 .642 .779 .682 .969 .018 .332 .910 (2) total 1.1 0.2 2.8 1.8 3.6 0.7 1.1 2.2 0.2 2.0 0.2 0.4 0.4 0.9 0.5 0.8 0.1 8.0 2.2 0.2 2 DIF analyses 4: Flexmirt DIF sweep multilevel analyses multilevel DIF sweep 4: Flexmirt DIF analyses χ B-H c.v. = Benjamini B-H Hochberg c.v. critical values. ST = stereotype threat, C= control. Item 1 Item Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 Item 16 Item 17 Item 18 Item 19 Item 20 Item Table F5 Table Note. Note. than the BH-critical value. 212 Addendum

Table F6 Input parameters for simulation study.

α βST βC α βST βC (medium DIF/large DIF) (medium DIF/large DIF) Item 1 -0.14 -0.14 Item 11 Item 2 Item 12 -1.28 -1.28 0.58 1.15 -1.51 -1.51 0.71 0.71 0.50 (3.04/3.29) 2.54 0.45 Item 4 0.74 -1.71 -1.71 Item 14 0.40 Item 3 0.32 (2.84/3.09) 2.34 Item 13 0.35 0.27 0.62 0.62 1.12 -1.41 -1.41 -0.28 -0.28 Item 5 (-4.23/-4.48) -3.73 Item 15 0.68 0.60 0.60 Item 7 -0.27 -0.27 Item 17 0.42 Item 6 Item 16 0.62 Item 8 1.01 -1.22 -1.22 Item 18 -1.81 -1.81 0.58 (-2.55/-2.80) -2.05 1.07 0.32 Item 10 Item 20 0.44 -4.40 Item 9 -1.31 -1.31 Item 19 0.58 (1.26/1.51) 0.76 Note. 0.59 -0.92 -0.92 (-4.90/-5.15)

ST = stereotype threat, C= control. DIF occurs for the stereotype threat group.

A Addendum

213

References

▪▪ - forming DIF analyses. Applied Psychological Measurement 18 Ackerman, T. A., & Evans, J. A. (1994). The influence of conditioning scores in per , (4), 329–342. http://doi. ▪ ▪ org/10.1177/014662169401800404 ▪ ▪ Agnoli, F., Altoè, G., & Muzzatti, B. (n.d.). Unpublished study. ▪ - ▪ Agnoli, F., Altoè, G., & Pastro, M. (n.d.). Unpublished study. able research practices among italian research psychologists. PloS One 12 Agnoli, F., Wicherts, J. M., Veldkamp, C. L. S., Albiero, P., & Cubelli, R. (2017). Question , (3), 1–17. ▪ - ▪ http://doi.org/10.1371/journal.pone.0172792Lessons learned from DIF algebra problems Alexeev, N. (2008). . The University of Geor ▪ - ▪ gia, Athens. ative self-relevant stereotype activation: The effects of individuation. Journal of Experi- mentalAmbady, Social N., Paik, Psychology S. K., Steele,40 J., Owen-Smith, A., & Mitchell, J. P. (2004). Deflecting neg ▪ - ▪ , (3), 401–408. http://doi.org/10.1016/j.jesp.2003.08.003 Psychological Science 12Ambady, N., Shih, M., Kim, A., & Pittinsky, T. L. (2001). Stereotype susceptibility in chil dren: Effects of identity activation on quantitative performance. , ▪ - ▪ (5), 385–390. http://doi.org/10.1111/1467-9280.00371 al research association et al. as amici curiae in support of respondents. American Educational Research Association. (2012). Brief of the American education ▪▪ Standards for educational and psychologicalAmerican Educational testing. Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). ▪ ▪ Washington, DC: American Educational Research Association.Psycho- metrika 43 Andrich, D. (1978). A rating formulation for ordered response categories. ▪ - ▪ , (4), 561–573. ible paradigms? Medical Care 42 Andrich, D. (2004). Controversy and the Rasch Model: A characteristic of incompat , (Supplement), 7–16. http://doi.org/10.1097/01. ▪ ▪ mlr.0000103528.48582.7c Andrich, D., Sheridan, B., & Luo, G. (2010). Rasch models for measurement: RUMM2030. ▪ - ▪ Perth, Western Australia: RUMM Laboratory Pty Ltd. tude. Journal of Educational Measurement 10 Angoff, W. H., & Ford, S. F. (1973). Item-Race Interaction on a Test of Scholastic Apti A ▪ ▪ Investigating stereotype, (2), 95–106. threat as a source of black-white DIF - Arbuthnot, K. N. (2005). (Doctoral dissertation). Retrieved from https://www.ideals.illinois.edu/han ▪ ▪ dle/2142/79824 Stereotype threat. Theory, process, and application Aronson, J., & Dee, T. (2012). Stereotype threat in the real world. In M. Inzlicht & T. Schmader (Eds.), (pp. 264–280). ▪ - ▪ New York, NY: Oxford University Press. Adolescence and education: Vol.Aronson, 2. Academic J., & Good, motivation C. (2003). of adolescentsThe development and consequences of stereotype vul Agenerability Publishing. in adolescents. In F. Pajares & T. Urdan (Eds.), (pp. 299–330). Greenwich, CT: Information 214 Addendum

▪▪ - al components of test taking. Personnel Psychology 43 Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivation , (4), 695–716. http://doi. ▪ ▪ org/10.1111/j.1744-6570.1990.tb00679.x EuropeanAsendorpf, Journal J. B., Conner, of Personality M., De Fruyt,27 F., De Houwer, J., Denissen, J. J. A., Fiedler, K., … Wicherts, J. M. (2013). Recommendations for increasing replicability in psychology. ▪ ▪ , (2), 108–119. http://doi.org/10.1002/perApplied Psy- chological Measurement 29 Attali, Y. (2005). Reliability of speeded number-right multiple-choice tests. ▪ ▪ , (5), 357–368. http://doi.org/10.1177/0146621605276676 Augusteijn, H., van Aert, R. C. M., & van Assen, M. A. L. M. (2017). Preprint: The effect ▪ ▪ of publication bias on the assessment of heterogeneity. Retrieved from osf.io/gv25c a standardized mathematics evaluation situation: A hardworking role model or a giftedBagès, roleC., & model?Martinot, British D. (2011). Journal What of Social is the bestPsychology model for50 girls and boys faced with

, (3), 536–543. http://doi. ▪ - ▪ org/10.1111/j.2044-8309.2010.02017.x chological science. Perspectives on Psychological Science 7 Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psy , (6), 543–554. http://doi. ▪ ▪ org/10.1177/1745691612459060 the type I error rate in independent samples t tests: The power of alternatives and recommendations.Bakker, M., & Wicherts, Psychological J. M. (2014). Methods Outlier19 removal, sum scores, and the inflation of met0000014 , (3), 409–427. http://doi.org/10.1037/ ▪▪ Psychology of Women Quarterly 37 Ball, L. C., Cribbie, R. A., & Steele, J. R. (2013). Beyond gender differences: using tests of equivalence to evaluate gender similarities. , (2), ▪ ▪ 147–154. http://doi.org/10.1177/0361684313480483 with latent moderated structures to detect uniform and nonuniform measurement Barendse, M. T., Oort, F. AdvancesJ., & Garst, in G.Statistical J. A. (2010). Analysis Using94 restricted factor analysis

bias; a simulation study. , (2), 117–127. http://doi. ▪ ▪ org/10.1007/s10182-010-0126-1 for publication bias. Biometrics 50 Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test ▪ ▪ , (4), 1088–1101. - sure.Beilock, Journal S. L., of& ExperimentalDecaro, M. S. (2007).Psychology. From Learning, poor performance Memory, and to Cognition success under33 stress: A Working memory, strategy selection, and mathematical problem solving under pres , (6), 983– ▪ ▪ 998. http://doi.org/10.1037/0278-7393.33.6.983 Journal of Experimental Psychology: GeneralBeilock, 136S. L., Rydell, R. J., & McConnell, A. R. (2007). Stereotype threat and working memory: mechanisms, alleviation, and spillover. ▪ ▪ , (2), 256–276. http://doi.org/10.1037/0096-3445.136.2.256 and powerful approach to multiple testing. Journal of the Royal Statistical Society: Se- riesBenjamini, B (Methodological) Y., & Hochberg,57 Y. (1995). Controlling the false discovery rate: A practical ▪ ▪ , (1), 289–300. Jour- nal of Experimental Social Psychology 41 Ben-zeev, T., Fein, S., & Inzlicht, M. (2005). Arousal and stereotype threat. , (2), 174–181. http://doi.org/10.1016/j. jesp.2003.11.007 Addendum

215

▪▪ prediction. Journal of the American Statistical Association 91 Berger, J. O., & Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection and ▪ ▪ , (433), 109–122. Statistical theories of mental test scores Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), ▪ - ▪ (pp. 397–422). Information Age Publishing. od. British Medical Journal 310 Bland, J. M., & Altman, D. G. (1995). Multiple significance tests: The Bonferroni meth ▪ - ▪ , , 170. rameters: application of an EM algoritm. Psychometrika 46 Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item pa ▪ - ▪ , (4), 443–459. ditions of test speededness: Application of a mixture Rasch model with ordinal con- straints.Bolt, D. M., Journal Cohen, of A.Educational S., & Wollack, Measurement J. A. (2002).39 Item parameter estimation under con ▪ ▪ , (4), 331–348. regression correction to three nonparametric statistical tests. Journal of Educational MeasurementBolt, D. M., & Gierl,43 M. J. (2006). Testing features of graphical DIF: Application of a ▪ - ▪ , (4), 313–333. http://doi.org/10.1111/j.1745-3984.2006.00019.x formance: The role of interference in working memory. Journal of Experimental Social PsychologyBonnot, V., &43 Croizet, J. C. (2007). Stereotype internalization and women’s math per ▪ ▪ , (6), 857–866. http://doi.org/10.1016/j.jesp.2006.10.006 IntroductionBorenstein, M.,to Meta-Analysis Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Cumulative meta-analysis. In M. Borenstein, L. V. Hedges, J. P. T. Higgins, & H. R. Rothstein (Eds.), ▪ ▪ (pp. 371–376). Chichester: John Wiley & SonsMedical Ltd. Care 44 Borsboom, D. (2006). When does measurement invariance matter? , ▪ - ▪ (11), 176–181. http://doi.org/10.1097/01.mlr.0000245143.08679.cc Gender and Fair Assesment Bridgeman, B., & Schmitt, A. P. (2006). Fairness issues in test development and ad ministration. In W. W. Willingham & N. C. Cole (Eds.), (pp. ▪ ▪ 185–225). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc. Brief of Experimental Psychologists et al. (2012). Brief of Experimental Psychologists et al. as Amici Curiae Supporting Respondents, Fisher v. University of Texas, August 13, ▪ ▪ 2012 (No. 01–1015). worry in mediating the relationship between stereotype threat and performance. JournalBrodish, of A. Experimental B., & Devine, Social P. G. (2009).Psychology The 45role of performance–avoidance goals and

, (1), 180–185. http://doi.org/10.1016/j. ▪ ▪ jesp.2008.08.005 A gender differences in math performance. Journal of Personality and Social Psychology 76Brown, R. P., & Josephs, R. A. (1999). A burden of proof: Stereotype relevance and , ▪ ▪ (2), 246–257. experience of stereotype threat. Journal of Experimental Social Psychology 39 Brown, R. P., & Pinel, E. C. (2003). Stigma on my mind: Individual differences in the , (6), 626– ▪ ▪ 633. http://doi.org/10.1016/S0022-1031(03)00039-8 differentiation. Psychological Review 106 Bussey, K., & Bandura, A. (1999). Social cognitive theory of gender development and , (4), 676–713. Retrieved from http://www. ncbi.nlm.nih.gov/pubmed/10560326 Addendum

216

▪▪ -

ETSBuzick, Research H., & Stone, Report E. Series (2011).2 Recommendations for conducting differential item func tioning (DIF) analyses for students with disabilities based on previous DIF studies. ▪ ▪ , , 1-26. threat: The effect of expectancy on performance. European Journal of Social Psycholo- Cadinu,gy 33 M., Maass, A., Frigerio, S., Impagliazzo, L., & Latinotti, S. (2003). Stereotype ▪ - ▪ , (2), 267–285. http://doi.org/10.1002/ejsp.145 Cai, L. (2017). FlexMIRT® version 3.51: Flexible multilevel multidimensional item anal ▪ ▪ ysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group. EM algorithm. British Journal of Mathematical and Statistical Psychology 61 Cai, L. (2008). SEM of another flavour: Two new applications of the supplemented , , 309–329. ▪ ▪ http://doi.org/10.1348/000711007X249603 a multidimensional item response model. Applied Psychological Measurement 16 Camilli, G. (1992). A conceptual analysis of differential item functioning in terms of , (2), ▪ ▪ 129–147. http://doi.org/10.1177/014662169201600203 performance on a novel visuospatial task. Psychology of Women Quarterly 33 Campbell, S. M., & Collaer, M. L. (2009). Stereotype threat and gender differences in , (4), 437– ▪ - ▪ 444. http://doi.org/10.1111/j.1471-6402.2009.01521.x sessing item bias in item response theory. Applied Psychological Measurement 12 Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and as , (3), ▪ ▪ 253–260. Cor- tex 49 Chambers, C. D. (2013). Registered Reports: a new publishing initiative at Cortex. ▪ ▪ , (3), 609–610. Response Theory. Journal of Educational and Behavioral Statistics 22 Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using Item ▪ - ▪ , (3), 265–289. crease girls’ participation in the physical sciences? Sex Roles 65 Cherney, I. D., & Campbell, K. L. (2011). A league of their own: Do single-sex schools in , (9), 712–724. http:// ▪ - ▪ doi.org/10.1007/s11199-011-0013-6 ed interest. Sex Roles 63 Cheryan, S., & Plaut, V. C. (2010). Explaining underrepresentation: A theory of preclud ▪ ▪ , (7–8), 475–488. http://doi.org/10.1007/s11199-010-9835-x Cheung, I., Campbell,Perspectives L., LeBel, E. on P., Psychological Ackerman, R. ScienceA., Aykutoglu,11 B., Bahník, Š., … Yong, J. C. (2016). Registered replication report: Study 1 from Finkel, Rusbult, Kumashiro, & Hannon (2002). , (5), 750–764. http://doi. A ▪ ▪ org/10.1177/1745691616664694 activity to a social category undermines children’s achievement. Psychological Science 23Cimpian, A., Mu, Y., & Erickson, L. C. (2012). Who is good at this game? Linking an , ▪ ▪ (5), 533–541. http://doi.org/10.1177/0956797611429803 - dure.Clauser, Applied B. E., Mazor, Measurement K., & Hambleton, in Education R. K. 6(1993). The effects of purification of the matching criterion on the ıdentification of DIF using the Mantel-Haenszel proce , (4), 269–279. http://doi.org/10.1207/ ▪ - ▪ s15324818ame0604_2 entially functioning test items. Educational Measurement: Issues and Practice 17 Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differ , (1), 31–44. http://doi.org/10.1111/j.1745-3992.1998.tb00619.x Addendum 217

▪▪ Journal of Educational Measurement 42 Cohen, A., & Bolt, D. (2005). A mixture model analysis of differential item functioning. ▪ - ▪ , (2), 133–148. Cohen, A. S., & Ibarra, R.Gender A. (2005). Differences Examining in Mathematics. gender-related An Differentialintegrative psychologiItem Func- caltioning approach using insights from psychometric and multicontext theory. In A. M. Gallagher & J. C. Kaufman (Eds.), (pp. 143–171). New York, NY: Cambridge University Press. http://doi. ▪ - ▪ org/10.2307/2786930 sures in detection of DIF. Applied Measurement in Education 17 Cohen, A. S., & Kim, S. (1993). A comparison of Lord’s chi-square and Raju’s area mea ▪ ▪ Quantitative Methods In , Psychology(1), 39–52.112 Cohen, J. (1992). A power primer. , (1), 155– ▪ . Retrieved from www. ▪ 159. Test specifications for the redesigned SAT collegeboard.org CollegeBoard. (2015). ▪▪ Psy- chological Bulletin 88 Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. ▪ ▪ , (2), 322–328. studies submitted for review by a human subjects committee. Psychological Methods 2Cooper, H., Deneve, K., & Charlton, K. (1997). Finding the missing science: The fate of , ▪ ▪ (4), 447–452. groups: An experimental test of distinctiveness theory. Journal of Personality and So- cialCota, Psychology A. A., & Dion,50 K. L. (1986). Salience of gender and sex composition of ad hoc ▪ - ▪ , (4), 770–776. http://doi.org/10.1037//0022-3514.50.4.770 ceptance rates: A note on meta-analysis bias. Professional Psychology: Research and PracticeCoursol, A.,17 & Wagner, E. E. (1986). Effect of positive findings on submission and ac ▪ - ▪ , (2), 136–137. ty and degree of speeding. Psychometrika 16 Cronbach, L. J., & Warrington, W. G. (1951). Time-limit tests: estimating their reliabili ▪ - ▪ , (2), 167–188. mentary school children. Child Development 82 Cvencek, D., Meltzoff, A. N., & Greenwald, A. G. (2011). Math-gender stereotypes in ele , (3), 766–779. http://doi.org/10.1111/ ▪ ▪ j.1467-8624.2010.01529.x Annual Review of Sociology 40 Davidov, E., Meuleman, B., Cieciuch, J., Schmidt, P., & Billiet, J. (2014). Measurement equivalence in cross-national research. , , 55–75. http:// ▪ ▪ doi.org/10.1146/annurev-soc-071913-043137 A stereotype threat: Evidence from educational settings. Social Cognition 34 Davies, L. C., Conner, M., Sedikes, C., & Hutter, R. R. C. (2016). Math question type and , (3), 196– ▪ - ▪ 216. tioning: A mixture distribution conceptualization. International Journal of Testing De2 Ayala, R. J., Kim, S., Stapleton, L. M., & Dayton, C. M. (2002). Differential item func , ▪ ▪ (3–4), 243–276. http://doi.org/10.1080/15305058.2002.9669495Psychometrika 73 ▪ ▪ De Boeck, P. (2008). Random itemExplanatory IRT models. Item Response Models, (4), 533–559. De Boeck, P., & Wilson, M. (2004). . (P. De Boeck & M. Wilson, Eds.). New York, NY: Springer-Verlag. 218 Addendum

▪▪ using IRTrees. Journal of Educational Measurement, 54 Debeer, D., Janssen, R., & Boeck, P. (2017). Modeling skipped and not‐reached items ▪ ▪ (3), 333-363. sex–threat interaction. Intelligence 36 - tell.2008.01.008Delgado, A. R., & Prieto, G. (2008). Stereotype threat as validity threat: The anxiety– , (6), 635–640. http://doi.org/10.1016/j.in ▪▪ - tioning. Applied Measurement in Education 24 DeMars, C. E. (2011). An analytic comparison of effect Sizes for Differential Item Func , (3), 189–209. http://doi.org/10.1080/ ▪ ▪ 08957347.2011.580255 sDésert, progressive M., Préaux, matrices. M., &European Jund, R. Journal(2009). ofSo Psychology young and of already Education victims24 of stereotype threat: Socio-economic status and performance of 6 to 9 years old children on Raven’ ▪ ▪ , (2), 207–218. procedure for detecting DIF. Journal of Educational and Behavioral Statistics 18 Donoghue, J. R., & Allen, N. L. (1993). Thin versus thick matching in the Mantel-Haenszel , (2), ▪ ▪ 131–154. http://doi.org/10.3102/10769986018002131 mathematics achievement items. Journal of Educational Measurement 24 Doolittle, A. E., & Cleary, T. A. (1987). Gender-based differential item performance in ▪ ▪ , (2), 157–166. Standardization and the MH method. Applied Measurement in Education 2 Dorans, N. J. (1989). Two new approaches to assessing differential item functioning. , (3), 217– ▪ ▪ 233. Differential Item Functioning Dorans, N. J., & Holland, P. W. (1993). DIF Detection and Description: Mantel-Haenszel and Standardization. In P. W. Holland & H. Wainer (Eds.), ▪ - ▪ (pp. 35–66). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. proach to assessing unexpected differential item performance on the scholastic apti- tudeDorans, test. N. Journal J., & Kulick, of Educational E. (1986). Measurement Demonstrating23 the utility of the standardization ap ▪ ▪ , (4), 355–368. test performance: A meta-analysis. Learning and Individual Differences 47 Doyle, R. A., & Voyer, D. (2016). Stereotype manipulation effects on math and spatial , , 103–116. ▪ ▪ http://doi.org/10.1016/j.lindif.2015.12.018 theory models to multidimensional data. Applied Psychological Measurement 7 Drasgow, F., & Parsons, C. K. (1983). Application of unidimensional item response , (2), ▪ ▪ 189–199. http://doi.org/10.1177/014662168300700207 behavior. Developmental Psychology 29 A Droege, K. L., & Stipek, D. J. (1993). Children’s use of dispositions to predict classmates’ , (4), 646–654. http://doi.org/10.1037//0012- ▪ ▪ 1649.29.4.646 Bio- metrics 56 Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plot-based method. ▪ - ▪ , (2), 455–463. Development of Achievement Motivation - demicDweck, Press. C. S. (2002). The development of ability conceptions. In A. Wigfield & J. S. Ec cles (Eds.), (pp. 57–88). San Diego, CA: Aca ▪▪ Self-Esteem: Relations and changes at early adolescence. Journal of Personality 57 Eccles, J. S., Wigfield, A., Constance, A., Miller, C., Reuman, D. A., & Yee, D. (1989). , (2), 283–310. Addendum

219

▪▪ - es in children’s self- and task perceptions during elementary school. Child Develop- mentEccles,64 J., Wigfield, A., Harold, R. D., & Blumenfeld, P. (1993). Age and gender differenc ▪ . Retrieved from ▪ , (3), 830–847. http://doi.org/10.2307/1131221GRE: Guide to the Use of Scores www.ets.org/gre/guide Educational Testing Service. (2016). ▪▪ out-group comparisons moderate stereotype threat effects. Current Psychology 27 Elizaga, R. A., & Markman, K. D. (2008). Peers and performance: How in-group and , (4), ▪ ▪ 290–300. http://doi.org/10.1007/s12144-008-9041-y differences in mathematics: A meta-analysis. Psychological Bulletin 136 Else-Quest, N. M., Hyde, J. S., & Linn, M. C. (2010). Cross-national patterns of gender , (1), 103–27. ▪ ▪ http://doi.org/10.1037/a0018053 Item response theory for psychologists Embretson, S. E., & Reise, S. P. (2000). . Mahwah, ▪ ▪ New Jersey: Lawrence Erlbaum Associates, Inc. Sweden.Eriksson, Scandinavian K., & Lindholm, Journal T. (2007). of Psychology Making 48gender matter: The role of gender-based expectancies and gender identification on women’s and men’s math performance in , (4), 329–338. http://doi.org/10.1111/ ▪ - ▪ j.1467-9450.2007.00588.x Cognition & Emotion 6 Eysenck, M. W., & Calvo, M. G. (1992). Anxiety and performance: The pro cessing efficiency theory. , (6), 409–434. http://doi. ▪ - ▪ org/10.1080/02699939208409696 Psychometrika 30 Feldt, L. S. (1965). The approximate sampling distribution of Kuder-Richardson reli ▪ - ▪ ability coefficient twenty. , (3), 357–370. Educational Studies in Mathemat- icsFennema,21 E., Peterson, P. L., Carpenter, T. P., & Lubinski, C. A. (1990). Teachers’ attribu tions and beliefs about girls, boys, and mathematics. ▪ ▪ , (1), 55–69. http://doi.org/10.1007/BF00311015 meta-analyses.Ferguson, C. J., &Psychological Brannick, M. Methods T. (2012).17 Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of , (1), 120–128. http://doi.org/10.1037/ ▪ ▪ a0024445 accuracy of introductory psychology textbooks in covering controversial topics and urbanFerguson, legends C. J., aboutBrown, psychology. M., & Torres, Current A. V. (2016).Psychology Education or indoctrination? The

, 1–9. http://doi.org/10.1007/ ▪ A ▪ s12144-016-9539-7 Language TestingFidalgo, 31A. M., Alavi, S. M., & Amirian, S. M. R. (2014). Strategies for testing statistical and practical significance in detecting DIF with logistic regression models. ▪ ▪ , (4), 433–451. http://doi.org/10.1177/0265532214526748 item functioning detection using Mantel-Haenszel and SIBTEST: Implications for type IFidalgo, and type A. IIM., error Ferreres, rates. TheD., & Journal Muñiz, of J. Experimental(2004). Liberal Education and conservative73 differential

, (1), 23–39. http:// ▪ ▪ doi.org/10.3200/JEXE.71.1.23-40 - dures.Fidalgo, Methods A. M., Mellenbergh,of Psychological G. Research J., & Muñiz, Online J. (2000).5 Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel proce , (3), 43–53. 220 Addendum

▪▪ Applied Psychological Mea- surementFinch, W. H.29 (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT Likelihood Ratio. ▪ ▪ , (4), 278–295. http://doi.org/10.1177/0146621605275728 A comparison of four methods. Educational and Psychological Measurement 67 Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning: , (4), ▪ ▪ 565–582. http://doi.org/10.1177/0013164406296975 effect of different types of stereotype threat on women’s math performance? Journal of ResearchFinnigan, inK. Personality M., & Corker,63 K. S. (2016). Do performance avoidance goals moderate the ▪ ▪ , , 36–43. http://doi.org/10.1016/j.jrp.2016.05.009Psycholog- ical Bulletin 76 Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. ▪ ▪ , (5), 378–382. http://doi.org/10.1037/h0031619The influence of gender stereotype threat on mathematics test scores of Dutch high school students: A registered report. Flore, P. C., Mulder, J., & Wicherts, J. M. (n.d.). ▪▪ Current and best practices in conduct- ing and reporting DIF analyses. Flore, P. C., Oberski, D. L., & Wicherts, J. M. (n.d.). ▪▪ of girls in stereotyped domains? A meta-analysis. Journal of School Psychology 53 Flore, P. C., & Wicherts, J. M. (2015). Does stereotype threat influence performance , (1), ▪ ▪ 25–44. http://doi.org/10.1016/j.jsp.2014.10.002 reduces effects of stereotype threat on women’s math performance. Personality & Social PsychologyFord, T. E., Ferguson, Bulletin 30M. A., Brooks, J. L., & Hagadone, K. M. (2004). Coping sense of humor ▪ - ▪ , (5), 643–653. http://doi.org/10.1177/0146167203262851 cal journals with respect to sample size and statistical power. PloS One 9 Fraley, R. C., & Vazire, S. (2014). The N-Pact factor: Evaluating the quality of empiri , (10), 1–12. ▪ - ▪ http://doi.org/10.1371/journal.pone.0109019 ence. Psychonomic bulletin & review 21 Francis, G. (2014). The frequency of excess success for articles in Psychological Sci ▪ - ▪ , (5), 1180-1187. chology. Perspectives on Psychological Science 7 Francis, G. (2012). The psychology of replication and replication in psy , (6), 585–594. http://doi. ▪ ▪ org/10.1177/1745691612459520 Journal of Mathematical Psychology 57 Francis, G. (2013). Replication, statistical consistency, and publication bias. ▪ - ▪ , (5), 153–169. http://doi.org/10.1016/j.jmp.2013.02.003 Science 345 A Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social scienc ▪ - ▪ es: Unlocking the file drawer. , (6203), 1502–1505. iments: Evidence from a study registry. Social Psychological and Personality Science 7Franco, A., Malhotra, N., & Simonovits, G. (2016). Underreporting in psychology exper , ▪ ▪ (1), 8–12. http://doi.org/10.1177/1948550615598377 mixture model to detect differential item functioning. Journal of Educational Measure- mentFrederickx,47 S., Tuerlinckx, F., De Boeck, P., & Magis, D. (2010). RIM : A random item ▪ ▪ , (4), 432–457. determination of measurement invariance. Structural Equation Modeling 13 French, B. F., & Finch, W. H. (2006). Confirmatory Factor Analytic procedures for the , (3), 378– 402. http://doi.org/10.1207/s15328007sem1303 Addendum 221

▪▪ the invariant referent sets. Structural Equation Modeling 15 French, B. F., & Finch, W. H. (2008). Multigroup confirmatory factor analysis: Locating , (1), 96–113. http://doi. ▪ ▪ org/10.1080/10705510701758349 regression for differential item functioning detection. Educational and Psychological MeasurementFrench, B. F., & 67Maller, S. J. (2007). Iterative purification and effect size use with logistic ▪ ▪ , (3), 373–393. http://doi.org/10.1177/0013164406294781 American Economic Review: Papers &Fryer, Proceedings R. G., Levitt,98 S. D., & List, J. A. (2008). Exploring the impact of financial incentives on stereotype threat : Evidence from a pilot study. ▪ - ▪ , (2), 370–375. tomatic associations disrupt girls' math performance. Child Development, 85 Galdi, S., Cadinu, M., & Tomasetto, C. (2014). The roots of stereotype threat: when au (1), 250- ▪ - ▪ 263. ple-choice and constructed response mathematics items. Applied Measurement in Ed- ucationGamer, M.,12 & Engelhard, G. (1999). Gender differences in performance on multi ▪ ▪ , (1), 29–51. http://doi.org/10.1207/s15324818ame1201 reliability and agreement. R package version 0.84. Retrieved from http://cran.r-proj- ect.org/package=irrGamer, M., Lemon, J., Fellows, I., & Singh, P. (2012). irr: Various coefficients of interrater ▪▪ examination of stereotype threat effects on girls’ mathematics performance. Develop- mentalGanley, PsychologyC. M., Mingle,49 L. A., Ryan, A. M., Ryan, K., Vasilyeva, M., & Perry, M. (2013). An ▪ - ▪ , (10), 1886–1897. http://doi.org/10.1037/a0031412 theGelman, research A., & hypothesis Loken, E. (2014).was posited The gardenahead of of time. forking American paths: ScientistWhy multiple102 compari sons can be a problem, even when there is no “fishing expedition” or “p-hacking” and ▪ ▪ , (6), 460. - typeGerstenberg, threat effect F. X. onR., mathematicalImhoff, R., & Schmitt, performance. M. (2012). European “Women Journal are of bad Personality at math, but26 I”m not, am I ?’ Fragile mathematical self-concept predicts vulnerability to a stereo , , ▪ - ▪ 588–599. http://doi.org/10.1002/per sion 1.0.4. Retrieved from https://cran.r-project.org/package=Scale Giallousis, N. (2015). Scale: Likert Type Questionnaire Item Analysis. R package ver ▪▪ - - tiveGibson, performance. C. E., Losee, Social J., & PsychologyVitiello, C. (2014).45 A replication attempt of stereotype sus ceptibility (Shih, Pittinsky, & Ambady, 1999): Identity salience and shifts in quantita A , (3), 194–198. http://doi.org/10.1027/1864- ▪ ▪ 9335/a000184 cognitive skills that produce gender differences in mathematics: A demonstration of Gierl, M. J., Bisanz, J., Bisanz, G. L., & Boughton, K. A. (2003).40 Identifying content and ▪ ▪ the multidimensionality-based DIF analysis paradigm, (4), 281–306. percentage of DIF items is large. Applied Measurement in Education 17 Gierl, M. J., Gotzmann, A., & Boughton, K. A. (2004). Performance of SIBTEST when the , (3), 241–264. ▪ - ▪ http://doi.org/10.1207/s15324818ame1703 plier tests. Statistica Sinica 8 Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multi , , 647–667. 222 Addendum

▪▪

StructuralGlockner-Rist, Equation A., & Modeling:Hoijtink, H.A Multidisciplinary(2003). The Best Journal of both10 worlds: Factor analysis of dichotomous data using Item Response Theory and Structural Equation Modeling. , (4), 544–565. http://doi. ▪ - ▪ org/10.1207/S15328007SEM1004 ments: gender differences. The Quartely Journal of Economics 118 Gneezy, U., Niederle, M., & Rustichini, A. (2003). Performance in competitive environ ▪ - ▪ , (3), 1049–1074. tion of differential item functioning with the likelihood ratio test. Methodology 8 González-Betanzos, F., & Abad, F. J. (2012). The effects of purification and the evalua , (4), ▪ ▪ 134–145. http://doi.org/10.1027/1614-2241/a000046 and women’s achievement in high-level math courses. Journal of Applied Developmen- talGood, Psychology C., Aronson,29 J., & Harder, J. A. (2008). Problems in the pipeline: Stereotype threat ▪ ▪ , (1), 17–28. http://doi.org/10.1016/j.appdev.2007.10.004 and counter-stereotypic textbook images on science performance. The Journal of So- cialGood, Psychology J. J., Woodzicka,150 J. A., & Wingfield, L. C. (2010). The effects of gender stereotypic ▪ - ▪ , (2), 132–147. The Handbook of Research Synthesis and Me- ta-AnalysisGreenhouse, J. B., & Iyengar, S. (2009). Sensitivity analysis and diagnostics. In H. Coo per, L. V. Hedges, & J. C. Valentine (Eds.), ▪ - ▪ (2nd editio, pp. 417–434). New York, NY: Russell Sage Foundation. ity constrained hypotheses. Psychological Methods 19 Gu, X., Mulder, J., Deković, M., & Hoijtink, H. (2014). Bayesian evaluation of inequal , (2), 511–527. http://doi.org/ ▪ ▪ http://dx.doi.org/10.1037/met0000017 A general method for testing informative hypotheses. British Journal of Mathematical andGu, X.,Statistical Mulder, PsychologyJ., & Hoijtink,. H. (n.d.). Approximated adjusted fractional Bayes factors: ▪▪ statistical power of the Mantel-Haenszel procedure for detecting DIF: A meta-analysis. PsychologicalGuilera, G., Gómez-Benito, Methods 18 J., Hidalgo, M. D., & Sánchez-Meca, J. (2013). Type I error and ▪ ▪ , (4), 553–571. http://doi.org/10.1037/a0034306Science 320 Guiso, L., Monte, F., & Sapienza, P. (2008). Culture, gender and math. , (5880), ▪ - ▪ 1164–1165. tingency table methods for simultaneous detection of uniform and nonuniform DIF. JournalGüler, N., of &Educational Penfield, R. Measurement D. (2009). A46 comparison of the logistic regression and con

, (3), 314–329. http://doi.org/10.1111/j.1745- A ▪ ▪ 3984.2009.00083.x overview and tutorial. Tutorials in Quantitative Methods for Psychology 8 Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An ▪ ▪ , (1), 23–34. The science of sex differences in science and mathematics. Psychological Science in the PublicHalpern, Interest D. F., Benbow,8 C. P., Geary, D. C., Gur, R. C., Hyde, J. S., & Gernsbache, M. A. (2007). ▪ ▪ , (1), 1–51. http://doi.org/10.1111/j.1529-1006.2007.00032.x.The Medical Care 44 Hambleton, R. K. (2006). Good practices for identifying differential item functioning. ▪ ▪ , (11), 182–188. functions. Journal of Educational and Behavioral Statistics 23 Hanson, B. A. (1998). Uniform DIF and DIF defined by differences in item response , (3), 244–253. Addendum

223

▪▪ items on the Scholastic Aptitude Test. Applied Measurement in Education 6 Harris, A. M., & Carlton, S. T. (1993). Patterns of gender differences on mathematics ▪ ▪ , (2), 137–151. PeerJHartgerink, C. H. J., van Aert, R. C. M., Nuijten, M. B., Wicherts, J. M., & van Assen, M. A. L. M. (2016). Distributions of p-values smaller than .05 in psychology: what is going on ? ▪ ▪ . http://doi.org/10.7717/peerj.1935 underachievement. Child Development 84 Hartley, B. L., & Sutton, R. M. (2013). A stereotype threat account of boys’ academic , (5), 1716–1733. http://doi.org/10.1111/ ▪ . Retrieved ▪ cdev.12079 The global gender gap report Hausmann, R., Tyson, L. D., & Zahidi, S. (2012). ▪ ▪ from http://www3.weforum.org/docs/GGGR12/MainChapter_GGGR12.pdf estimations. Journal of Educational Statistics 6 Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related ▪ ▪ , (2), 107–128. Journal of Educational and Behavioral Statistics 32 Hedges, L. V. (2007). Effect sizes in cluster-randomized designs. ▪ ▪ , (4), 341–370. http://doi.org/10.3102/1076998606298043 group-randomized trials in education. Educational Evaluation and Policy Analysis 29Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values for planning , ▪ ▪ (1), 60–87. http://doi.org/10.3102/0162373707299706 Statistical Methods for Meta-Analysis Hedges, L. V., & Olkin, I. (1985). Parametric estimation of effect size from a series of experiments. In L. V. Hedges & I. Olkin (Eds.), (pp. ▪ ▪ 108–148). San Diego, CA: Academic Press. meta-analysis. Psychological Methods 9 Hedges, L. V., & Pigott, T. D. (2004). The power of statistical tests for moderators in , (4), 426–445. http://doi.org/10.1037/1082- ▪ ▪ 989X.9.4.426 numbers of high-scoring individuals. Science 269 Hedges, L. V, & Nowell, A. (1995). Sex differences in mental test scores , variability and ▪ - ▪ , (5220), 41–45. ple sizes on the detection of differential item functioning using the Mantel-Haenszel Herrera, A. N., & Gómez, J. (2008). InfluenceQuality of and equal Quantity or unequal42 comparison group sam

and logistic regression techniques. , (6), 739–755. http://doi. ▪ ▪ org/10.1007/s11135-006-9065-z blame and helplessness: Relationship to beliefs about goodness. Child Development 63Heyman, G. D., Dweck, C. S., & Cain, K. M. (1992). Young children’s vulnerability to self- , ▪ ▪ (2), 401–415. analysis for detecting differential item functioning: Effectiveness of R2 and delta log A oddsHidalgo, ratio M. effect D., Gómez-Benito, size measures. J., Educational & Zumbo, B.and D. Psychological (2014). Binary Measurement logistic regression74

, (6), ▪ ▪ 927–949. http://doi.org/10.1177/0013164414523618 and effect size: A comparison between logistic regression and Mantel-Haenszel pro- cedures.Hidalgo, M.Educational D., & López-Pina, and Psychological J. A. (2004). Measurement Differential64 item functioning detection

, (6), 903–915. http://doi. ▪ - ▪ org/10.1177/0013164403261769 Test Validity Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Man tel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. 224 Addendum

▪▪ Differential item functioning -

Holland, P. W., & Wainer, H. (1993). . Hillsdale, NJ: Law ▪ ▪ rence Erlbaum Associates, Inc. Scandinavian Journal of Statistics 6 Holm, S. (1979). A simple sequentially rejective multiple test procedure. ▪ ▪ , (2), 65–70. http://doi.org/10.2307/4615733 with item response theory models. British Journal of Educational Psychology 58 Holman, R., & Glas, C. A. W. (2005). Modelling non-ignorable missing-data mechanisms , , 1–17. ▪ ▪ http://doi.org/10.1348/000711005X47168 Houts, C. R., & Cai, L. (2016). flexMIRT® user’s manual version 3.5: Flexible multilevel multidimensional item analysis and test scoring. Chapel Hill, NC: Vector Psychometric ▪ ▪ Group. classroom circumstances. Journal of Educational Psychology 99 Huguet, P., & Régner, I. (2007). Stereotype threat among schoolgirls in quasi-ordinary , (3), 545–560. http:// ▪ ▪ doi.org/10.1037/0022-0663.99.3.545 school girls from stereotype threat. Journal of Experimental Social Psychology 45 Huguet, P., & Régner, I. (2009). Counter-stereotypic beliefs in math do not protect , (4), ▪ - ▪ 1024–1027. http://doi.org/10.1016/j.jesp.2009.04.029 tions. In Methods of meta-analysis. Correcting error and bias in research findings Hunter, J. E., & Schmidt, F. L. (2004). Technical questions in meta-analysis of correla (pp. ▪ ▪ 189–240). Thousand Oaks, CA: Sage Publications. of mathematics attitudes and affect. A meta-analysis. Psychology of Women Quarterly 14Hyde, J. S., Fennema, E., Ryan, M., Frost, L. A., & Hopp, C. (1990). Gender comparisons , ▪ - ▪ (3), 299–324. http://doi.org/10.1111/j.1471-6402.1990.tb00022.x ening environments. Journal of Experimental Social Psychology 42 Inzlicht, M., Aronson, J., Good, C., & McKay, L. (2006). A particular resiliency to threat , (3), 323–336. ▪ - ▪ http://doi.org/10.1016/j.jesp.2005.05.005 males.Inzlicht, Psychological M., & Ben-zeev, Science T. (2000).11 A threatening intellectual environment: Why fe males are susceptible to experiencing problem-solving deficits in the presence of ▪ ▪ , (5), 365–371. in private? The implications of threatening environments on intellectual processing. JournalInzlicht, ofM., Educational & Ben-Zeev, PsychologyT. (2003). Do95 high-achieving female students underperform

, (4), 796–805. http://doi.org/10.1037/0022- ▪ A ▪ 0663.95.4.796 Stereotype threat Inzlicht, M., & Schmader, T. (2012). . (M. Inzlicht & T. Schmader, Eds.). ▪ ▪ New York, NY: Oxford University Press. PLoS Medi- cine 2 Ioannidis, J. P. A. (2005). Why most published research findings are false. ▪ ▪ , (8), 696–701. http://doi.org/10.1371/journal.pmed.0020124 Epidemi- ology 19 Ioannidis, J. P. A. (2008). Why most discovered true associations are inflated. ▪ ▪ , (5), 640–648. http://doi.org/10.1097/EDE.0b013e31818131e7 Clinical Trials 4 Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant ▪ - ▪ findings. , (3), 245–253. http://doi.org/10.1177/1740774507079441 rameter. Statistics in Medicine 25 Jackson, D. (2006). The implications of publication bias for meta-analysis’ other pa , , 2911–2921. http://doi.org/10.1002/sim.2293 Addendum

225

▪▪ Personality & Social Psychol- ogyJamieson, Bulletin J. P.,35 & Harkins, S. G. (2009). The effect of stereotype threat on the solving of quantitative GRE problems: a mere effort interpretation. ▪ ▪ , (10),Theory 1301–1314. of probability http://doi.org/10.1177/0146167209335165 Press. Jeffreys, K. (1961). (third edit). New York, NY: Oxford University ▪▪ - fect size estimators under the MIMIC and Rasch models. Educational and Psychological MeasurementJin, Y., Myers, N.73 D., Ahn, S., & Penfield, R. D. (2012). A comparison of uniform DIF ef ▪ - ▪ , (2), 339–358. http://doi.org/Doi 10.1177/0013164412462705 fect size measure with the logistic regression procedure for DIF detection. Applied Mea- surementJodoin, M. in G., Education & Gierl, M.14 J. (2001). Evaluating type I error and power rates using an ef ▪ - ▪ , (4), 329–349. http://doi.org/10.1207/S15324818AME1404 able research practices with incentives for truth telling. Psychological Science 23 John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of question , (5), ▪ - ▪ 524–532. http://doi.org/10.1177/0956797611430953 reotype threat as a means of improving women’s math performance. Psychological Sci- enceJohns,16 M., Schmader, T., & Martens, A. (2005). Knowing is half the battle: teaching ste ▪ ▪ , (3), 175–179. http://doi.org/10.1111/j.0956-7976.2005.00799.x Spanish language versions of the mini-mental state examination. Medical Care 44 Jones, R. N. (2006). Identification of measurement differences between English and , (11), ▪ - ▪ 124–133. er for school psychologists. Journal of School Psychology 45 Jordan, A. H., & Lovett, B. J. (2007). Stereotype threat and test performance: A prim , (1), 45–59. http://doi. ▪ - ▪ org/10.1016/j.jsp.2006.09.003 tations and methods: Towards a more effectively self-correcting social psychology. JournalJussim, L.,of Crawford,Experimental J. T., SocialAnglin, Psychology S. M., Stevens,66 S. T., & Duarte, J. L. (2016). Interpre

, , 116–133. http://doi.org/10.1016/j. ▪ ▪ jesp.2015.10.003 continuum of alcohol problems in college students: A Rasch model analysis. Psychology ofKahler, Addictive C. W., Behaviors Strong, D.18 R., Read, J. P., Palfai, T. P., & Wood, M. D. (2004). Mapping the ▪ - ▪ , (4), 322–333. http://doi.org/10.1037/0893-164X.18.4.322 sis of the science and mathematics items in the University entrance examinations inKalaycioglu, Turkey. Journal D. B., & of Berberoglu, Psychoeducational G. (2011). Assessment Differential29 item functioning analy A , (5), 467–478. http://doi. ▪ ▪ org/10.1177/0734282910391623 Journal of the American Statistical Association 90 Kass, R. E., & Raftery, A. E. (1995). Bayes Factors. ▪ ▪ , (430), 773–795. Developmental psychopathology: Perspectives onKatz, adjustment, P. A., & Kofkin, risk, andJ. A. disorder.(1997). Race, gender, and young children. In S. S. Luthar, J. A. Press.Burack, D. Cicchetti, & J. R. Weisz (Eds.), (pp. 51–74). Cambridge, UK: Cambridge University ▪▪ Psychometrika 54

Kelderman, H. (1989). Item bias detection using loglinear IRT. , (4), 681–697. Addendum

226

▪▪ - icapping as a strategic means to cope with obtrusive negative performance expecta- tions.Keller, Sex J. (2002). Roles, 47Blatant stereotype threat and women's math performance: Self-hand ▪ - ▪ (3), 193-198. performance.Keller, J. (2007a). British Stereotype Journal threat of Educational in classroom Psychology settings:77 The interactive effect of do main identification, task difficulty and stereotype threat on female students’ maths , (2), 323–338. http://doi. ▪ ▪ org/10.1348/000709906XI threat: The moderating role of regulatory focus. Swiss Journal of Psychology 66 Keller, J. (2007b). When negative stereotypic expectancies turn into challenge or , (3), ▪ - ▪ 163–168. http://doi.org/10.1024/1421-0185.66.3.163 formance: Regulatory focus as a catalyst. European Journal of Social Psychology 38 187–212.Keller, J., &http://doi.org/10.1002/ejsp Bless, H. (2008). When positive and negative expectancies disrupt per , , ▪▪ - diates the disrupting threat effect on women’s math performance. Personality & Social PsychologyKeller, J., & Dauenheimer,Bulletin 29 D. (2003). Stereotype threat in the classroom: Dejection me ▪ ▪ , (3), 371–381. http://doi.org/10.1177/0146167202250218 Journal of Business and Psychology 29 Kepes, S., Banks, G. C., & Oh, I. S. (2012). Avoiding bias in publication bias research: The value of “null” findings. , (2), 183–203. http://doi. ▪ ▪ org/10.1007/s10869-012-9279-0 Personality and Social Psychology Review 2 Kerr, N. B. (1998). HARKing: Hypothesizing after the results are known. ▪ ▪ , (3), 196–217. of differential item functioning. Measurement 50 Khalid, M. N., & Glas, C. A. W. (2014). A scale purification procedure for evaluation , , 186–197. http://doi.org/10.1016/j. ▪ ▪ measurement.2013.12.019 and math-related outcomes: A prospective study of female college students. Psycho- logicalKiefer, ScienceA. K., & 18Sekaquaptewa, D. (2007). Implicit stereotypes, gender identification, ▪ ▪ , (1), 13–18. http://doi.org/10.1111/j.1467-9280.2007.01841.x functioning detection. Educational and Psychological Measurement 73 Kim, J., & Oshima, T. C. (2012). Effect of multiple testing adjustment in differential item , (3), 458–470. ▪ ▪ http://doi.org/10.1177/0013164412467033 and the likelihood ratio test on detection of differential item functioning. Applied Mea- surementKim, S., & Cohen,in Education A. S. (1995).8 A comparison of Lord’s chi square, Raju’s area measures, A ▪ ▪ , (4), 291–312. intelligence. Developmental Review 23 Kinlaw, R. C., & Kurtz-Costes, B. (2003). The development of children’s beliefs about , (2), 125–161. http://doi.org/10.1016/S0273- ▪ ▪ 2297(03)00010-8 SIBTEST with and without impact. Journal of Educational Measurement 45 Klockars, A. J., & Lee, Y. (2008). Simulated tests of differential item functioning using , (3), 271– ▪ - ▪ 285. http://doi.org/10.1111/j.1745-3984.2008.00064.x duced item bias using the iterative logit method. Journal of Educational Measurement 22Kok, F. G., Mellenbergh, G. J., & Van Der Flier, H. (1985). Detecting experimentally in , (4), 295–303. http://doi.org/10.1111/j.1745-3984.1985.tb01066.x Addendum 227

▪▪ - ative forward approach for DIF detection. Applied Psychological Measurement 39 Kopf, J., Zeileis, A., & Strobl, C. (2015a). A framework for anchor methods and an iter , (2), ▪ ▪ 83–103. http://doi.org/10.1177/0146621614544195 Educational and Psychological Measure- mentKopf, J.,75 Zeileis, A., & Strobl, C. (2015b). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. ▪ ▪ , (1), 22–56. http://doi.org/Doi 10.1177/0013164414529792 International Journal of Testing 13 Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2013). On the shortcomings of shortened tests : A literature review. , (3), 223–248. http://doi. ▪ - ▪ org/10.1080/15305058.2012.703734 types about mathematics and science and self-perceptions of ability in late child- hoodKurtz-Costes, and early B., adolescence.Rowley, S. J., Harris-Britt,Merrill-Palmer A., Quarterly& Woods, 54T. A. (2008). Gender stereo

, (3), 386–409. http://doi. ▪ - ▪ org/10.1353/mpq.0.0001 tent analysis. Journalism & Mass Communication Quarterly 92 Lacy, S., Watson, B. R., Riffe, D., & Lovejoy, J. (2015). Issues and best practices in con , (4), 791–811. http://doi. ▪ ▪ org/10.1177/1077699015607338 createLai, J.-S., a coreCella, item D., Chang, bank from C. H., the Bode, FACIT-Fatigue R. K., & Heinemann, Scale. Quality A. W. of (2003). Life Research Item banking12 to improve, shorten and computerize self-reported fatigue: An illustration of steps to , (5), ▪ ▪ 485–501. http://doi.org/10.1023/A:1025014509626 Psychology andLamont, Aging R.30 A., Swift, H. J., & Abrams, D. (2015). A review and meta-analysis of age- based stereotype threat: Negative stereotypes, not facts, do the damage. ▪ ▪ , (1), 180–193. http://doi.org/10.1037/a0038586 standardsLebel, E. P., in Borsboom, psychology. D., Perspectives Giner-sorolla, on R.,Psychological Hasselman, Science F., Peters,8 K. R., Ratliff, K. A., & Smith, C. T. (2013). PsychDisclosure.org: Grassroots support for reforming reporting , (4), 424–432. http:// ▪ ▪ doi.org/10.1177/1745691613491437 School Science and Mathematics 103 Leedy, M. G., LaLonde, D., & Runk, K. (2003). Gender equity in mathematics : Beliefs ▪ ▪ of students, parents, and teachers. , (6), 285–292. log-linear smoothing. Applied Psychological Measurement 37 Lei, P.-W., & Li, H. (2013). Small-sample DIF estimation using SIBTEST, Cochran’s Z, and , (5), 397–416. http://doi. ▪ - ▪ org/10.1177/0146621613478150 Sex Roles 54 A Lesko, A. C., & Corpus, J. H. (2006). Discounting the difficult: How high math-iden tified women respond to stereotype threat. , (1–2), 113–125. http://doi. ▪ - ▪ org/10.1007/s11199-005-8873-2 Educational Researcher 42Levine, F. J., & Ancheta, A. N. (2013). The AERA et al. Amicus Brief in Fisher v. Universi ty of Texas at Austin: Scientific Organizations Serving Society. , ▪ ▪ (3), 166–171. http://doi.org/10.3102/0013189X13486765 Psychometri- ka 61 Li, H., & Stout, W. (1996). A new procedure for detection of crossing DIF. ▪ ▪ , (4), 647–677. Edu- cational Research 41 Li, Q. (1999). Teachers’ beliefs and gender differences in mathematics: A review. , (1), 63–76. http://doi.org/10.1080/0013188990410106 228 Addendum

▪▪ - ciated with gender DIF. International Journal of Testing 4 Li, Y., Cohen, A. S., & Ibarra, R. A. (2004). Characteristics of mathematics items asso , (2), 115–136. http://doi. ▪ ▪ org/10.1207/s15327574ijt0402_2 Functioning. Applied Psychological Measurement 38 Li, Z. (2014a). A power formula for the SIBTEST procedure for Differential Item , (4), 311–328. http://doi. ▪ ▪ org/10.1177/0146621613518095 differential item functioning. Journal of Educational Measurement 51 Li, Z. (2014b). Power and sample size calculations for logistic regression tests for , (4), 441–462. ▪ ▪ http://doi.org/10.1016/0197-2456(90)90005-M Functioning. Applied Psychological Measurement 39 Li, Z. (2015). A power formula for the Mantel-Haenszel test for Differential Item , (5), 373–388. http://doi. ▪ ▪ org/10.1177/0146621613518095 Psychological Bulletin 136 Lindberg, S. M., Hyde, J. S., Petersen, J. L., & Linn, M. C. (2010). New trends in gender and mathematics performance : A meta-analysis. , (6), 1123– ▪ ▪ 1135. http://doi.org/10.1037/a0021276 ability: implications of stereotype threat for college admissions. Educational Psycholo- gistLogel,47 C. R., Walton, G. M., Spencer, S. J., Peach, J., & Mark, Z. P. (2012). Unleashing latent ▪ - ▪ , (1), 42–50. http://doi.org/10.1080/00461520.2011.611368 munication. Assessment and reporting of intercoder reliability. Human Communica- tionLombard, Research M., Snyder-Duch,28 J., & Bracken, C. C. (2002). Content analysis in mass com ▪ ▪ , (4), 587–604. Journal of Personality andLord, Social C. G., Psychology & Saenz, D.49 S. (1985). Memory deficits and memory surfeits: Differential cognitive consequences of tokenism for tokens and observers. ▪ . ▪ Applications, (4), 918–926. of item response theory to practical testing problems Lord, F. M. (1980). ▪ ▪ Hillsdale, NJ: Erlbaum. Statistical theories of mental test scores NC: Addison-Wesley. Lord, F. M., & Novick, M. R. (1968). . Charlotte, ▪▪ -

Love, J., Selker, R., Marsman, M., Jamil, T., Dropmann, D., Verhagen, A. J., & Wagenmak ▪ ▪ ers, E. J. (2015). JASP. Educational Measure- ment: Issues and Practice 26 Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. ▪ - A ▪ , (4), 29–37. ment: A cross-cultural study. Developmental Psychology 26 Lummis, M., & Stevenson, H. W. (1990). Gender differences in beliefs and achieve , (2), 254–263. http://doi. ▪ - ▪ org/10.1037//0012-1649.26.2.254 tion under small samples. British Journal of Mathematical and Statistical Psychology 65Magis, D., & Facon, B. (2012). Angoff’s delta method revisited: Improving DIF detec , ▪ ▪ (2), 302–321. http://doi.org/10.1111/j.2044-8317.2011.02025.x A counterexample with Angoff’s delta plot. Educational and Psychological Measure- mentMagis,73 D., & Facon, B. (2012). Item purification does not always improve DIF detection:

, (2), 293–311. http://doi.org/10.1177/0013164412451903 Addendum

229

▪▪ bias in the peer review system. Cognitive Therapy and Research 1 Mahoney, M. J. (1977). Publication prejudices: An experimental study of confirmatory ▪ - ▪ , (2), 161–175. formance in physics. International Journal of Science Education 35 Marchand, G. C., & Taasoobshirazi, G. (2013). Stereotype threat and women’s per , (18), 3050–3061. ▪ ▪ http://doi.org/10.1080/09500693.2012.683461 test. Journal of Experimental Social Psychology 14 Markus, H. (1978). The effect of mere presence on social facilitation : An unobtrusive ▪ - ▪ , , 389–397. Perceptual and Motor Skills 112 Marszalek, J. M., Barber, C., Kohlhart, J., & Holmes, C. B. (2011). Sample size in psycho logical research over the past 30 years. , (2), 331–348. ▪ ▪ http://doi.org/10.2466/03.11.PMS.112.2.331-348 Journal of Exper- imentalMartens, Social A., Johns, Psychology M., Greenberg,42 J., & Schimel, J. (2006). Combating stereotype threat : The effect of self-affirmation on women’s intellectual performance. ▪ ▪ , , 236–243. http://doi.org/10.1016/j.jesp.2005.04.010 sex-typed preferences and gender stereotypes. Child Development 61 Martin, C. L., & Little, J. K. (1990). The relation of gender understanding to children’s ▪ ▪ TIMSS, 2003(5), International 1427–1439. Mathematics Report. Martin, M. O., Gonzalez, E. J., & Chrostowski, S. J. (2003). ▪▪ TIMSS 2003 in- ternational science report Martin, M. O., Mullis, I. V. S., Gonzalez, E. J., & Chrostowski, S. J. (2004). ▪ ▪ . TIMSS & PIRLS International Study Center, Boston College. stereotypes about mathematics and reading: When girls improve their reputation in math.Martinot, Sex RolesD., Bagès,66 C., & Désert, M. (2012). French children’s awareness of gender ▪ ▪ , (3–4), 210–219. http://doi.org/10.1007/s11199-011-0032-3 and self-perceptions regarding math ability: When boys do not surpass girls. Social PsychologyMartinot, D., of & Education Désert, M.10 (2007). Awareness of a gender stereotype, personal beliefs ▪ ▪ , (4), 455–471. http://doi.org/10.1007/s11218-007-9028-9 on performance under threat. European Journal of Social Psychology 42 Marx, D. M., & Ko, S. J. (2012). Superstars “like” me: The effect of role model similarity , (7), 807–812. ▪ ▪ http://doi.org/10.1002/ejsp.1907 performance. Personality and Social Psychology Bulletin 28 Marx, D. M., & Roman, J. S. (2002). Female role models: Protecting women’s math test , (9), 1183–1193. http:// ▪ ▪ doi.org/10.1177/01461672022812004 orientation and social comparisons under threat. Journal of Personality and Social Psy- A chologyMarx, D. 88M., Stapel, D. A., & Muller, D. (2005). We can do it: The interplay of construal ▪ ▪ , (3), 432–446. http://doi.org/10.1037/0022-3514.88.3.432Psychometrika 47 Masters, G. N. (1982). A Rasch model for partial credit scoring. , (2), ▪ - ▪ 149–174. http://doi.org/10.1017/CBO9781107415324.004 Psychological Methods 9 Maxwell, S. E. (2004). The persistence of underpowered studies in psychological re search: Causes, consequences, and remedies. , (2), 147–163. ▪ ▪ http://doi.org/10.1037/1082-989X.9.2.147 n Contingency Tables. Journal of the American Statistical AssociationMaydeu-Olivares,100 A., & Joe, H. (2005). Limited- and full-Information estimation and goodness-of-fit testing in 2 , (471), 1009–1020. http://doi.org/10.1198/016214504000002069 Addendum

230

▪▪ differential item functioning using a variation of the Mantel-Haenszel procedure. Edu- cationalMazor, K. and M., Psychological Clauser, B. E., Measurement & Hambleton,54 R. K. (1994). Identification of nonuniform ▪ ▪ , (2), 284–291. The effects of matching on unidimensional subtest scores. Applied Psychological Mea- surementMazor, K. M.,22 Hambleton, R. K., & Clauser, B. E. (1998). Multidimensional DIF analyses: ▪ ▪ , (4), 357–367. Investiga- tiveMcAlinden, Ophthalmology C., Pesudovs, and Visual K., & Moore, Science J. 51E. (2010). The development of an instrument to measure quality of vision: The quality of vision (QoV) questionnaire. , (11), 5537–5545. http://doi.org/10.1167/ ▪ ▪ iovs.10-5341 on the salience of one’s gender in the spontaneous self-concept. Journal of Experimen- talMcguire, Social W.Psychology J., Mcguire,15 C. V, & Winton, W. (1979). Effects of household sex composition ▪ ▪ , , 77–90. - reotypeMcIntyre, threat. R. B., Paulson,European R. Journal M., Taylor, of Social C. A., Morin,Psychology A. L., 41& Lord, C. G. (2011). Effects org/10.1002/ejsp.774of role model deservingness on overcoming performance deficits induced by ste , (3), 301–311. http://doi. ▪▪ - ing of items and scales. Journal of Applied Psychology 95 Meade, A. W. (2010). A taxonomy of effect size measures for the differential function , (4), 728–743. http://doi. ▪ ▪ org/10.1037/a0018966 Journal of Educational Statistics 7 Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. ▪ ▪ , (2), 105–118. International Journal of Educational Research 13 Mellenbergh, G. J. (1989). Item bias and item response theory. ▪ ▪ , (2), 127–143. A conceptual introduction to psychometrics - tionalMellenbergh, Publishing. G. J. (2011). Classical analysis of item scores. In G. J. Mellenbergh (Ed.), (pp. 147–195). The Hague: Eleven Interna ▪▪ Psychometrika 58 Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. ▪ ▪ , (4), 525–543. Explanatory Item Response Models - lag.Meulders, M., & Xie, Y. (2004). Person-by-item predictors. In P. De Boeck & M. Wilson (Eds.), (pp. 213–240). New York, NY: Springer-Ver ▪▪ - A Journal of Edu- cationalMiller, D. Psychology I., Eagly, A. 107H., & Linn, M. C. (2015). Women’s representation in science pre dicts national gender-science stereotypes: Evidence from 66 nations. ▪ ▪ , (3), 631–644. magnitude of bias on a two-stage item bias estimation method. Applied Psychological MeasurementMiller, M. D., & 16Oshima, T. C. (1992). Effect of sample size, number of biased items, and ▪ ▪ , (4), 381–388. http://doi.org/10.1177/014662169201600410 Examination. Medical Care 44 Millsap, R. E. (2006). State examination measurement bias in the Mini-Mental State ▪ ▪ , (11), 171–175. Statistical approaches to measurement invariance - ledge.Millsap, R. E. (2011). Item response theory: Tests of invariance. In R. E. Millsap (Ed.), (pp. 191–231). New York, NY: Rout Addendum

231

▪▪ assessing measurement bias. Applied Psychological Measurement 17 Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for , (4), 297–334. ▪ - ▪ http://doi.org/10.1177/014662169301700401 ance on selection in two populations. Psychological Methods 9 Millsap, R. E., & Kwok, O.-M. (2004). Evaluating the impact of partial factorial invari , (1), 93–115. http://doi. ▪ ▪ org/10.1037/1082-989X.9.1.93 Miyake, A., Kost-Smith, L. E., Finkelstein,Science 330 N. D., Pollock, S. J., Cohen, G. L., & Ito, T. A.- (2010). Reducing the gender achievement gap in college science: A classroom study of values affirmation. , , 1234–1237. http://doi.org/10.1126/sci ▪ ▪ ence.1195996 gender belief explanation. Learning and Individual Differences 19 Moè, A. (2009). Are males always better than females in mental rotation? Exploring a , (1), 21–27. http:// ▪ ▪ doi.org/10.1016/j.lindif.2008.02.002 improves performance in mental rotation. Learning and Individual Differences 22 Moè, A. (2012). Gender difference does not mean genetic difference: Externalizing , (1), ▪ ▪ 20–24. http://doi.org/10.1016/j.lindif.2011.11.001 in mental rotation. Learning and Individual Differences 16 Moè, A., & Pazzaglia, F. (2006). Following the instructions! Effects of gender beliefs , (4), 369–377. http://doi. ▪ ▪ org/10.1016/j.lindif.2007.01.002 regression.Monahan, P. Journal O., McHorney, of Educational C. A., Stump, and Behavioral T. E., & Perkins, Statistics A. J.32 (2007). Odds ratio, delta, ETS classification, and standardization measures of DIF magnitude for binary logistic , (1), 92–109. http://doi. ▪ - ▪ org/10.3102/1076998606298035 Social Psychology 45 Moon, A., & Roeder, S. S. (2014). A secondary replication attempt of stereotype suscep tibility (Shih, Pittinsky, & Ambady, 1999). , (3), 199–201. http:// ▪ - ▪ doi.org/10.1027/1864-9335/a000193 strained hypotheses. Computational Statistics & Data Analysis 71 Mulder, J. (2014). Prior adjusted default Bayes factors for testing (in)equality con , , 448–463. http:// ▪ ▪ doi.org/10.1016/j.csda.2013.07.017 TIMSS 2011 international results in mathematics. College.Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012). Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston ▪▪ Psychological Science 18 A Murphy, M. C., Steele, C. M., & Gross, J. J. (2007). Signaling threat: How situational cues affect women in math, science, and engineering settings. , (10), ▪ ▪ 879–885. http://doi.org/10.1111/j.1467-9280.2007.01995.x bias analysis. Journal of Educational Statistics 10 Muthén, B., & Lehman, J. (1985). Multiple group IRT modeling: applications to item ▪ ▪ , (2), 133–142. threat susceptibility in Italian children. Developmental Psychology 43 Muzzatti, B., & Agnoli, F. (2007). Gender and mathematics: Attitudes and stereotype , (3), 747–759. http://doi.org/10.1037/0012-1649.43.3.747 Addendum

232

▪▪ and Simultaneous Item Bias procedures for detecting differential item func- tioning.Narayanan, Applied P., & Swaminathan, Psychological H. Measurement (1994). Performance18 of the Mantel-Haenszel

, (4), 315–328. http://doi. ▪ - ▪ org/10.1177/014662169401800403 uniform DIF. Applied Psychological Measurement 20 Narayanan, P., & Swaminathan, H. (1996). Identification of items that show non , (3), 257–274. http://doi. ▪ ▪ org/10.1177/014662169602000306 European Journal of Psychological Assessment 18 Navas-Ara, M. J., & Gómez-Benito, J. (2002). Effects of ability scale purification on the identification of DIF. , (1), 9–15. ▪ ▪ http://doi.org/10.1027//1015-5759.18.1.9 children.Neuburger, Zeitschrift S., Jansen, Für P., Heil,Psychologie M., & Quaiser-Pohl,220 C. (2012). A threat in the classroom. Gender stereotype activation and mental-rotation performance in elementary-school , (2), 61–69. http://doi.org/10.1027/2151- ▪ - ▪ 2604/a000097 European JournalNeuville, of E., Psychology & Croizet, of J. Education C. (2007).12 Can salience of gender identity impair math per formance among 7-8 years old girls ? The moderating role of task difficulty. ▪ ▪ , (3), 307–316. of minorities and women? A meta-analysis of experimental evidence. The Journal of AppliedNguyen, Psychology H. H. D., & Ryan,93 A. M. (2008). Does stereotype threat affect test performance ▪ - ▪ , (6), 1314–1334. http://doi.org/10.1037/a0012702 butions for success and failure in reading. Journal of Educational Psychology 71 Nicholls, J. G. (1979). Development of perception of own attainment and causal attri , (1), ▪ ▪ 94–99. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/438417Social Psychology 45 Nosek, B. A., & Lakens, D. (2014). Registered Reports. , (3), 137– ▪ ▪ 141. http://doi.org/10.1027/1864-9335/a000192 - ferencesNosek, B. in A., science Smyth, and F. L., math Sriram, achievement. N., Lindner, Proceedings N. M., Devos, of T.,the Ayala, National A., … Academy Greenwald, of Sci A.- encesG. (2009). of the National United States differences of America in gender-science106 stereotypes predict national sex dif

, (26), 10593–10597. http://doi.org/10.1073/ ▪ - ▪ pnas.0809921106 en’s math performance. Personality and Social Psychology Bulletin 29 O’Brien, L. T., & Crandall, C. S. (2003). Stereotype threat and arousal: Effects on wom , (6), 782–789. ▪ A ▪ http://doi.org/10.1177/0146167203252810 Journal of the Roy- al Statistical Society: Series B (Methodological) 57 O’Hagan, A. (1995). Fractional Bayes factors for model comparison. ▪ ▪ PISA 2009 results: What students, know(1), and 99–138. can do - student performance in reading, mathematics and science. enOECD. (2010). (Vol. I). http://doi.org/10.1787/9789264091450- ▪▪

Ong, M. L., Kim, S., Cohen, A., & Cramer, S. (2015). A comparison of differential item Quantitativefunctioning detection Psychology for Research. dichotomously The 79th scored annual items meeting using IRTPRO,of the psychometrics BILOG-MG, and so- ciety,IRTLRDIF. Madison, In A. Wisconsin, van der Ark, 2014 D. M. Bolt, W.-C. Wang, J. A. Douglas, & S.-M. Show (Eds.), Switzerland. (pp. 121–132). Springer International Publishing AG Addendum

233

▪▪ - cal science. Science 349 Open Science Collaboration. (2015). Estimating the reproducibility of psychologi , (6251), aac4716-aac4716. http://doi.org/10.1126/science. ▪ ▪ aac4716 Applied Psycho- logicalOrlando, Measurement M., & Thissen,27 D. (2003). Further investigation of the performance of S−Chi2: An item fit index for use with dichotomous Item Response Theory models. ▪ ▪ , (4), 289–298. http://doi.org/10.1177/0146621603253011 differences in achievement? Contemporary Educational Psychology 26 Osborne, J. W. (2001). Testing stereotype threat: Does anxiety explain race and sex , (3), 291–310. ▪ ▪ http://doi.org/10.1006/ceps.2000.1052 Educational Psychology 27 Osborne, J. W. (2007). Linking stereotype threat and anxiety. , ▪ - ▪ (1), 135–154. http://doi.org/10.1080/01443410601069929 sponse Theory. Journal of Educational Measurement 31 Oshima, T. C. (1994). The effect of speededness on parameter estimation in Item Re ▪ ▪ , (3), 200–219. Educational Measurement: Issues and Practice 27 Oshima, T. C., & Morris, S. B. (2008). Raju’s differential functioning of items and tests ▪ ▪ (DFIT). , (3), 43–50. 19 Oswald, D. L., & Harvey, R. D. (2001). Hostile environments, stereotype threat, and ▪ - ▪ math performance among undergraduate women, (4), 338–356. dure. Journal of Educational Measurement 49 Paek, I. (2012). A note on three statistical tests in the logistic regression DIF proce , (2), 121–126. http://doi.org/10.1111/ ▪ ▪ j.1745-3984.2012.00164.x model under the marginal maximum likelihood estimation context and its com- parisonPaek, I., with & Wilson, Mantel-Haenszel M. (2011). Formulatingprocedure in the short rasch test differential and small item sample functioning condi- tions. Educational and Psychological Measurement 71

, (6), 1023–1046. http://doi. ▪ - ▪ org/10.1177/0013164411400734 Applied Psychological Measurement 14 Park, D.-G., & Lautenschlager, G. J. (1990). Improving IRT item bias detection with iter ative linking and ability scale purification. , (2), ▪ ▪ 163–173. http://doi.org/10.1177/014662169001400205 A comparison of three Mantel-Haenszel procuders. Applied Measurement in Education Penfield,14 R. D. (2001). Assessing differential item functioning among multiple groups: , ▪ - ▪ (3), 235–259. http://doi.org/10.1207/S15324818AME1403 neity to the analysis of nonuniform DIF. The Alberta Journal of Educational Research 49Penfield, R. D. (2003). Applying the Breslow-Day test of trend in odds ratio heteroge A , ▪ ▪ (3), 231–243. - sis.Picho, The K., Journal Rodriguez, of Social A., &Psychology Finnie, L. (2013).153 Exploring the moderating role of context on the mathematics performance of females under stereotype threat : A meta-analy ▪ - ▪ , (3), 299–333. Sex Roles Picho, K., & Schmader, T. (2017). When do gender stereotypes impair math perfor mance ? A study of stereotype threat among Ugandan adolescents. , 1–12. ▪ - ▪ http://doi.org/10.1007/s11199-017-0780-9 tive analysis of young Ugandan women in coed and single-sex schools. The Journal of EducationalPicho, K., & Stephens,Research 105J. M. (2012). Culture, context and stereotype threat: A compara

, (1), 52–63. http://doi.org/10.1080/00220671.2010.517576 Addendum

234

▪▪ tests. Educational and Psychological Measurement 68 Pimentel, J. L., & Glas, C. A. W. (2008). Modeling nonignorable missing data in speeded ▪ ▪ , (6), 907–922. from http://www.mejudice.nl/artikelen/detail/de-mythe-van-de-voltooide-emanci- patiePrast, H. (2017). De mythe van de voltooide emancipatie. Retrieved August 27, 2017, ▪▪ rating scale model. Journal of Applied Measurement 8 Prieto, G., & Delgado, A. R. (2007). Measuring math anxiety (in Spanish) with the Rasch ▪ ▪ , (2), 149–160. Differential Item Functioning. Methodology: European Journal of Research Methods for thePrieto-Marañón, Behavioral and P., SocialAguerri, Sciences M. E., Galibert,8 M. S., & Attorresi, H. F. (2012). Detection of

, (2), 63–70. http://doi.org/10.1027/1614-2241/ ▪ - ▪ a000038 type threat: Women and mathematics. Journal of Experimental Social Psychology Pronin,40 E., Steele, C. M., & Ross, L. (2004). Identity bifurcation in response to stereo , ▪ ▪ (2), 152–168. http://doi.org/10.1016/S0022-1031(03)00088-X Psychometrika 53 Raju, N. S. (1988). The area between two item characteristic curves. , ▪ - ▪ (4), 495–502. eas between two item response functions. Applied Psychological Measurement 14 Raju, N. S. (1990). Determining the significance of estimated signed and unsigned ar , (2), ▪ ▪ 197–207. differential functioning of items and tests. Applied Psychological Measurement 19 Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of , (4), ▪ ▪ 353–368. http://doi.org/10.1177/014662169501900405 item functioning via latent variable modeling: A note on a multiple-testing pro- cedure.Raykov, T.,Educational Marcoulides, and G. Psychological a., Lee, C.-L., Measurement & Chang, C. (2013).73 Studying differential

, (5), 898–908. http://doi. ▪ ▪ org/10.1177/0013164413478165 in mathematics and reading during elementary and middle school: Examining direct cognitiveRobinson, assessmentsJ. P., & Lubienski, and S.teacher T. (2011). ratings. The development American Educational of gender achievement Research Journal gaps 48 , ▪ - ▪ (2), 268–302. http://doi.org/10.3102/0002831210372249 tel-Haenszel procedures for detecting differential item functioning. Applied Psycholog- icalRogers, Measurement H. J., & Swaminathan,17 H. (1993). A comparison of logistic regression and Man A ▪ ▪ , (2), 105–116. http://doi.org/10.1177/014662169301700201Psycho- logical Bulletin 86 Rosenthal, R. (1979). The file drawer problem and tolerance for null results. ▪ ▪ , (3), 638–641. http://doi.org/10.1037//0033-2909.86.3.638 Journal of Consulting and Clinical Psychology 58 Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in ▪ ▪ 20 years ? , (5), 646–656. results. Journal of Experimental Criminology 4 Rothstein, H. R. (2007). Publication bias as a threat to the validity of meta-analytic , (1), 61–81. http://doi.org/10.1007/ s11292-007-9046-9 Addendum

235

▪▪ - ple size and studied item parameters on SIBTEST and Mantel-Haenszel type I er- rorRoussos, performance. L. A., & Stout,Journal W. ofF. Educational(1996). Simulation Measurement studies 33of the effects of small sam

, (2), 215–230. http://doi. ▪ - ▪ org/10.1111/j.1745-3984.1996.tb00490.x der DIF. Applied Measurement in Education 14 Ryan, K. . E., & Chiu, S. (2001). An examination of item context effects, DIF, and Gen , (1), 73–90. http://doi.org/10.1207/ ▪ - ▪ S15324818AME1401 Journal of Personality andRydell, Social R. J., Psychology McConnell,96 A. R., & Beilock, S. L. (2009). Multiple social identities and ste reotype threat: Imbalance, accessibility, and working memory. ▪ ▪ , (5), 949–966. http://doi.org/10.1037/a0014846 stereotypes on learning. Journal of Personality and Social Psychology 99 Rydell, R. J., Rydell, M. T., & Boucher, K. L. (2010). The effect of negative performance , (6), 883–896. ▪ ▪ http://doi.org/10.1037/a0021139 Human Performance 16 Sackett, P. R. (2003). Stereotype threat in applied selection settings: A commentary. ▪ - ▪ , (3), 295–309. http://doi.org/10.1207/S15327043HUP1603 StereotypeSackett, P. R.,Threat & Ryan, A. M. (2012). Concerns about generalizing stereotype threat re search findings to operational high-stakes testing. In M. Inzlicht & T. Schmader (Eds.), ▪ ▪ (pp.Estimation 249–263). ofNew Latent York, Ability NY: Oxford Using University a Response Press. Patern of Graded Scores Samejima, F. (1969). ▪ ▪ . Richmond, VA: Psychometric Society. items. Journal of Educational Measurement 24 Scheuneman, J. D. (1987). An experimental, exploratory study of causes of bias in test ▪ ▪ , (2), 97–118. women’s math performance. Journal of Experimental Social Psychology 38 201.Schmader, T. (2002). Gender identification moderates stereotype threat effects on , (2), 194– ▪▪ working memory capacity. Journal of Personality and Social Psychology 85 Schmader, T., & Johns, M. (2003). Converging evidence that stereotype threat reduces , (3), 440– ▪ - ▪ 452. http://doi.org/10.1037/0022-3514.85.3.440 ferences: The role of stereotype endorsement in women’s experience in the math do- main.Schmader, Sex Roles T., Johns,50 M., & Barquissau, M. (2004). The costs of accepting gender dif ▪ - ▪ , (11–12), 835–850. type threat effects on performance. Psychological Review 115 Schmader, T., Johns, M., & Forbes, C. (2008). An integrated process model of stereo A , (2), 336–356. http://doi. ▪ - ▪ org/10.1037/0033-295X.115.2.336 ees on the SAT. Differential Item Functioning for Minority Examinees on the SAT 27 Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for minority examin , (1), ▪ ▪ 67–81. Differential speed- edness and item omit patterns on the SAT. Schmitt, A. P., Dorans, N. J., Crone, C. R., & Maneckshana, B. T. (1991). ▪▪ implications. Human Resource Management Review 18 Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and , (4), 210–222. http://doi. org/10.1016/j.hrmr.2008.03.003 Addendum

236

▪▪ activation and anticipated affect on women’s career aspirations. Sex Roles 76 Schuster, C., & Martiny, S. E. (2017). Not feeling good in STEM: Effects of stereotype , (1–2), ▪ - ▪ 40–55. http://doi.org/10.1007/s11199-016-0665-3 mance expectancies: Their effects on women’s performance. Journal of Experimental SocialSekaquaptewa, Psychology D., 39& Thompson, M. (2003). Solo status, stereotype threat, and perfor ▪ - ▪ , (1), 68–74. els: A comparison of area and parameter difference methods. Applied Psychological MeasurementSeybert, J., Stark,38 S., & Chernyshenko, O. S. (2013). Detecting DIF with ideal point mod ▪ - ▪ , (2), 151–165. http://doi.org/10.1177/0146621613508306 lidity. In Experimental and quasi-experimental designs for generalized causal inference Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Construct validity and external va ▪ - ▪ (pp. 64–102). Houghton Mifflin Company. Sex Roles 66 Shapiro, J. R., & Williams, A. M. (2012). The role of stereotype threats in undermin ing girls’ and women’s performance and interest in STEM fields. , (3–4), ▪ ▪ 175–183. http://doi.org/10.1007/s11199-011-0051-0 true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF.Shealy, R., Psychometrika & Stout, W. (1993).58 A model-based standardization approach that separates ▪ - ▪ , (2), 159–194. Applied Psychologi- calShih, Measurement C.-L., & Wang,33 W.-C. (2009). Differential item functioning detection using the mul tiple indicators, multiple causes method with a pure short anchor. ▪ - ▪ , (3), 184–199. http://doi.org/10.1177/0146621608321758 Psychological Science 10 Shih, M., Pittinsky, T. L., & Ambady, N. (1999). Stereotype susceptibility: Identity sa ▪ - ▪ lience and shifts in quantitative performance. , (1), 80–83. ability. Psychological Bulletin 86 Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reli ▪ - ▪ , (2), 420–428. Simmons, J. P., Nelson, L.Psychological D., & Simonsohn, Science U. 22 (2011). False-positive psychol ogy: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. , (11), 1359–1366. http://doi. ▪ ▪ org/10.1177/0956797611417632 replication reports at Perspectives on Psychological Science. Perspectives on Psycho- logicalSimons, Science D. J., Holcombe,9 A. O., & Spellman, B. A. (2014). An introduction to registered ▪ ▪ , (5), 552–555. http://doi.org/10.1177/1745691614543974 A Journal of Experimental Psychology: General Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2013). P-curve: A key to the file drawer. ▪ - ▪ , 1–40. Perspectives on Psychological Sci- enceSimonsohn,9 U., Nelson, L. D., & Simmons, J. P. (2014). p-Curve and effect size: Correct ing for publication bias using only significant results. ▪ - ▪ , (6), 666–681. http://doi.org/10.1177/1745691614553988 tial item functioning. Educational Research and Evaluation 19 Sireci, S. G., & Rios, J. A. (2013). Decisions that make a difference in detecting differen , (2–3), 170–187. http:// doi.org/10.1080/13803611.2013.767621 Addendum

237

▪▪

mathSmeding, performance. A., Dumas, JournalF., Loose, of F., Educational & Régner, I. Psychology (2013). Order105 of administration of math and verbal tests : An ecological intervention to reduce stereotype threat on girls ’ , (3), 850–860. http://doi. ▪ - ▪ org/10.1037/a0032094 issue.Smith, Sex J. L., Roles & White,47 P. H. (2002). An examination of implicitly activated, explicitly acti vated, and nullified stereotypes on mathematical performance: It’s not just a woman’s ▪ ▪ , (3–4), 179–191. Multilevel analysis. An introduction to basic and advancedSnijders, T.multilevel A., & Bosker, modeling R. J. (2012). Assumptions of the hierarchical linear model. In T. A. Snijders & R. J. Bosker (Eds.), ▪ - ▪ (pp. 152–173). London: Sage Publications. el for DIF analysis. Journal of Educational and Behavioral Statistics 34 Soares, T. M., Gonçalves, F. B., & Gamerman, D. (2009). An integrated Bayesian mod , (3), 348–377. ▪ ▪ http://doi.org/10.3102/1076998609332752 threat among low socioeconomic status individuals. Social Justice Research 20 Spencer, B., & Castano, E. (2007). Social class is dead. Long live social class! Stereotype , (4), ▪ ▪ 418–432. http://doi.org/10.1007/s11211-007-0047-7 Annual Review of Psy- chology 67 Spencer, S. J., Logel, C., & Davies, P. G. (2016). Stereotype Threat. ▪ ▪ , (1), 415–437. http://doi.org/10.1146/annurev-psych-073115-103235 math performance. Journal of Experimental Social Psychology 35 Spencer, S. J., Steele, C. M., & Quinn, D. M. (1999). Stereotype threat and women’s , (1), 4–28. http://doi. ▪ ▪ org/10.1006/jesp.1998.1373 - entialSrisurapanont, item functioning M., Kittiratanapaiboon, analysis of gender P., Likhitsathian, and age bias. S., Addictive Kongsuk, Behaviors T., Suttajit,37 S., & Junsirimongkol, B. (2012). Patterns of alcohol dependence in Thai drinkers: A differ , (2), ▪ - ▪ 173–178. http://doi.org/10.1016/j.addbeh.2011.09.014 type threat: Initial cognitive mobilization is followed by depletion. Journal of Person- alityStåhl, and T., Van Social Laar, Psychology C., & Ellemers,102 N. (2012). The role of prevention focus under stereo ▪ - ▪ , (6), 1239–1251. http://doi.org/10.1037/a0027678 Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). ExaminingJournal the of Applied effects Psychologyof differen tial89 item (functioning and differential) test functioning on selection decisions: when are statistically significant effects practically important? , ▪▪ (3), 497–508. http://doi.org/10.1037/0021-9010.89.3.497 A Stark, S., Chernyshenko,Journal O. of S., Applied & Drasgow, Psychology F. (2006).91 Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. , (6), 1292–1306. http://doi. ▪ - ▪ org/10.1037/0021-9010.91.6.1292 ency through a analysis. Perspectives on Psychological Science 11 Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transpar , (5), 702– ▪ ▪ 712. http://doi.org/10.1177/1745691616658637 and Performance. American Psychologist 52 Steele, C. M. (1997). A Threat in the Air. How Stereotypes Shape Intellectual Identity ▪ . New ▪ Whistling Vivaldi and, other(6), clues613–629. to how stereotypes affect us Steele, C. M. (2010). York, NY: W.W. Norton & Company. Addendum

238

▪▪ - mance of African Americans. Journal of Personality and Social Psychology 69 Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test perfor , (5), 797– ▪ ▪ 811. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/7473032 psychology of stereotype and social identity threat. Advances in Experimental Social PsychologySteele, C. M.,34 Spencer, S. J., & Aronson, J. (2002). Contending with group image: The ▪ ▪ , , 379–440. http://doi.org/10.1016/S0065-2601(02)80009-0 in cross-national consumer research. Journal of Consumer Research 25 Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance ▪ ▪ , (1), 78–107. butSteffens, not for M. girlsC., & andJelenec, women. P. (2011). Sex Roles Separating64 implicit gender stereotypes regarding math and language: Implicit ability stereotypes are self-serving for boys and men, , (5–6), 324–335. http://doi.org/10.1007/ ▪ - ▪ s11199-010-9924-x tion as moderators of stereotype threat in highly persistent women. Basic and Applied SocialSteinberg, Psychology J. R., Okun,34 M. A., & Aiken, L. S. (2012). Calculus GPA and math identifica ▪ - ▪ , (6), 534–543. http://doi.org/10.1080/01973533.2012.727319 ples using item response theory to analyze differential item functioning. Psychological MethodsSteinberg,11 L., & Thissen, D. (2006). Using effect sizes for research reporting: Exam ▪ ▪ , (4), 402–415. http://doi.org/10.1037/1082-989X.11.4.402 Journal of the American Statistical AssociationSterling, T. D.54 (1959). Publication decisions and their possible effects on inferences drawn from tests of significance — or vice versa. ▪ ▪ , (285), 30–34. Publica- tionSterne, bias J. inA. meta-C., & Egger, analysis: M. (2005). Prevention, Regression assessment methods and adjustments. to detect publication and other bias in meta-analysis. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), (pp. 99–110). New ▪ - ▪ York, NY: Wiley. dicting the performance and behavior of classmates. Journal of Applied Developmental PsychologyStipek, D. J., 11& Daniels, D. H. (1990). Children’s use of dispositional attributions in pre ▪ - ▪ , (1), 13–28. http://doi.org/10.1016/0193-3973(90)90029-J ematics performance and achievement? Review of General Psychology 16 Stoet, G., & Geary, D. C. (2012). Can stereotype threat explain the gender gap in math , (1), 93–102. ▪ - ▪ http://doi.org/10.1037/a0026617 gate DIF when a test is intentionally two-dimensional. Applied Psychological Measure- A mentStout, 21W., Li, H., Nandakumar, R., & Bolt, D. (1997). MULTISIB: A procedure to investi ▪ - ▪ , (3), 195–213. http://doi.org/10.1177/01466216970213001 Journal of Applied Social Psy- chologyStricker, 34L. J., & Ward, W. C. (2004). Stereotype threat, inquiring about test takers’ eth nicity and gender, and standardized test performance. ▪ - ▪ , (4), 665–693. ential item functioning in the Rasch model. Psychometrika 80 Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch Trees: A new method for detecting differ , (2), 289–316. http://doi. ▪ - ▪ org/10.1007/s11336-013-9388-3 mension within the Cook-Medley hostility scale: A Rasch analysis. Personality and In- dividualStrong, D. Differences R., Kahler,39 C. W., Greene, R. L., & Schinka, J. (2005). Isolating a primary di

, (1), 21–33. http://doi.org/10.1016/j.paid.2004.08.011 Addendum

239

▪▪ assessment of effect of publication bias on meta-analyses. BMJ 320 Sutton, A. J., Duval, S. J., Tweedie, R. L., Abrams, K. R., & Jones, D. R. (2000). Empirical ▪ ▪ , , 1574–1577. logistic regression procedures. Journal of Educational Measurement 27 Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using ▪ - ▪ , (4), 361–370. Organizational Research Methods 18 Tay, L., Meade, A. W., & Cao, M. (2015). An overview and practical guide to IRT mea surement equivalence analysis. , (1), 3–46. http:// ▪ - ▪ doi.org/10.1177/1094428114553062 International JournalTay, L., Vermunt, of Testing J. 13K., & Wang, C. (2013). Assessing the Item Response Theory with co variate (IRT-C) procedure for ascertaining differential item functioning. ▪ - ▪ , (3), 201–222. http://doi.org/10.1080/15305058.2012.692415 fect: When the same role model inspires or fails to inspire improved performance un- derTaylor, stereotype C. A., Lord, threat. C. G., Group McIntyre, Processes R. B., && IntergroupPaulson, R. Relations M. (2011).14 The Hillary Clinton ef

, (4), 447–459. http:// ▪ ▪ doi.org/10.1177/1368430210382680 item formats. Applied Measurement in Education 25 Taylor, C. S., & Lee, Y. (2012). Gender DIF in reading and mathematics tests with mixed , (3), 246–280. http://doi.org/10.1 ▪ ▪ 080/08957347.2012.687650 applications. Medical Care 44 Teresi, J. A. (2006a). Different approaches to differential item functioning in health ▪ - ▪ , (11), 152–170. Medical Care 44 Teresi, J. A. (2006b). Overview of quantitative measurement methods. Equivalence, in variance, and differential item functioning in health applications. , (11), ▪ - ▪ 39–49. http://doi.org/10.1097/01.mlr.0000245452.48613.45 ment. Quality of Life Research 16 Teresi, J. A., & Fleishman, J. A. (2007). Differential item functioning and health assess , (1), 33–42. http://doi.org/10.1007/s11136-007- ▪ ▪ 9184-6 Benjamini-Hochberg procedure for controlling the false positive rate in multiple com- parisons.Thissen, D., Journal Steinberg, of Educational L., & Kuang, and D. Behavioral(2002). Quick Statistics and easy27 implementation of the

, (1), 77–83. http://doi. ▪ - ▪ org/10.3102/10769986027001077 DifferentialThissen, D., Steinberg,Item Functioning L., & Wainer, H. (1993). Detection of differential item function- ing using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), (pp. 67–114). Hillsdale, NJ: Lawrence Erlbaum Associ ▪ - A ▪ ates, Inc. matics. Educational Studies in Mathematics 41 Tiedemann, J. (2000). Gender-related belief of teachers in elementary school mathe ▪ ▪ , (2), 191–207. Learning and Individual Differences 20 Titze, C., Jansen, P., & Heil, M. (2010). Mental rotation performance in fourth graders: No effects of gender beliefs (yet?). , (5), 459– ▪ ▪ 463. http://doi.org/10.1016/j.lindif.2010.04.003 stereotype threat: the moderating role of mothers’ gender stereotypes. Developmental PsychologyTomasetto, 47C., Alparone, F. R., & Cadinu, M. (2011). Girls’ math performance under

, (4), 943–949. http://doi.org/10.1037/a0024047 240 Addendum

▪▪ Adolescenti in genere. Stili di vitaTomasetto, e atteggiamenti C., Matteucci, dei giovani M. C., & in Pansu, Emilia P. Romagna (2010). Genere e matematica: si può ridurre la minaccia dello stereotipo in classe? In R. Ghigi (Ed.), ▪ ▪ (pp. 99–104). Roma: Carocci. An examination of girls ’ math performance in a single-sex classroom. Honors Projects. Twamley, E. E. (2009). The role of gender identity on the effects of stereotype threat : ▪ - ▪ Paper 15. Ulrich, R.,Journal & Miller, of Experimental J. (2015). p -Hacking Psychology. by postGeneral hoc 144selection with multiple opportu nities: Detectability by skewness test?: Comment on Simonsohn, Nelson and Simmons ▪ - ▪ (2014). , (6), 1137–1145. ta-analyses based on p values: Reservations and recommendations for applying p-uni- formVan Aert, and p-curve.R. C. M., PerspectivesWicherts, J. M.,on Psychological& van Assen, ScienceM. A. L. 11M. (2016). Conducting me

, (5), 713–729. http://doi. ▪ ▪ org/10.1177/1745691616650874 item functioning using logistic mixed models. Journal of Educational and Behavioral StatisticsVan den Noortgate,30 W., & De Boeck, P. (2005). Assessing and explaining differential ▪ - ▪ , (4), 443–464. http://doi.org/10.3102/10769986030004443 - ganizationalVandenberg, R.research. J., & Lance, Organizational C. E. (2000). Research A review Methods and synthesis3 of the measure ment invariance literature: Suggestions, practices, and recommendations for or , (1), 4–70. http://doi. ▪ ▪ org/10.1177/109442810031002 random-effects model. Journal of Educational and Behavioral Statistics 30 Viechtbauer, W. (2005). Bias and efficiency of meta-analytic variance estimators in the , (3), 261– ▪ ▪ 293. http://doi.org/10.3102/10769986030003261 Journal of Statistical Software 36 Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. ▪ ▪ , (3), 1–48. - Wagenmakers,Perspectives E. J., Beek, on T., Dijkhoff,Psychological L., Gronau, Science Q. 11 F., Acosta, A., Adams, R. B., … Zwaan, R. A. (2016). Registered replication report: Strack, Martin, & Step per (1988). , (6), 917–928. http://doi. ▪ ▪ org/10.1177/1745691616674458 PsychonomicWagenmakers, Bulletin E. J., Love,& Review J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., … Morey, R. D. (2016). Bayesian inference for psychology. Part II: Example applications with JASP. ▪ ▪ , 1–19. http://doi.org/10.3758/s13423-017-1323-7 A Perspectives on Psychological Sci- enceWagenmakers,7 E. J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R. A. (2012). An agenda for purely confirmatory research. ▪ ▪ , (6), 632–638. http://doi.org/10.1177/1745691612463078Journal of Experimental Social Psy- chology 39 Walton, G. M., & Cohen, G. L. (2003). Stereotype lift. ▪ - ▪ , (5), 456–467. http://doi.org/10.1016/S0022-1031(03)00019-2 cally underestimate the intellectual ability of negatively stereotyped students. Psycho- logicalWalton, Science G. M., &20 Spencer, S. J. (2009). Latent ability: Grades and test scores systemati ▪ ▪ , (9), 1132–1139. http://doi.org/10.1111/j.1467-9280.2009.02417.xSocial Issues and Policy Review 7 Walton, G. M., Spencer, S. J., & Erman, S. (2013). Affirmative meritocracy. , (1), 1–35. http://doi.org/10.1111/j.1751-2409.2012.01041.x Addendum 241

▪▪ - sessment of differential item functioning. Educational and Psychological Measurement 72Wang, W.-C., Shih, C.-L., & Sun, G.-W. (2012). The DIF-free-then-DIF strategy for the as , ▪ ▪ (4), 687–708. http://doi.org/10.1177/0013164411426157 for detecting differential item functioning. Educational and Psychological Measure- mentWang,69 W.-C., Shih, C.-L., & Yang, C.-C. (2009). The MIMIC method with scale purification ▪ ▪ , (5), 713–731. http://doi.org/10.1177/0013164409332228 - tel-HaenszelWang, W.-C., method. & Su, Y.-H. Applied (2004). Measurement Effects of in average Education signed17 area between two item characteristic curves and test purification on the DIF detection method via the Man , (2), 113–144. http://doi. ▪ ▪ org/10.1207/s15324818ame1702 Faculty Scholar- ship Paper 207. Wax, A. L. (2009). Stereotype threat: a case of overclaim syndrome? ▪ ▪ , functioning of DSM-IV depressive symptoms in individuals with a history of mania versusWeinstock, those L. without: M., Strong, An itemD., Uebelacker, response theory L. A., & analysis. Miller, I.Bipolar W. (2009). Disorders Differential11 item

, (3), 289– ▪ - ▪ 297. http://doi.org/10.1111/j.1399-5618.2009.00681.x vals in meta-analysis. Journal of Applied Psychology 75 Whitener, E. M. (1990). Confusion of confidence intervals and credibility inter , (3), 315–321. http://doi. ▪ - ▪ org/10.1037//0021-9010.75.3.315 ing analysis of covariance. The American Psychologist 60 Wicherts, J. M. (2005). Stereotype threat research and the assumptions underly , (3), 267–269. http://doi. ▪ . Re- ▪ org/10.1037/0003-066X.60.3.267Group differences in intelligence test performance - Wicherts, J. M. (2007). trieved from http://search.ebscohost.com/login.aspx?direct=true&db=psy ▪ - ▪ h&AN=2000-07612-009&site=ehost-live Journal of PersonalityWicherts, J. andM., Dolan, Social C. Psychology V., & Hessen,89 D. J. (2005). Stereotype threat and group dif ferences in test performance: a question of measurement invariance. , (5), 696–716. http://doi.org/10.1037/0022- ▪ ▪ 3514.89.5.696 reportingWicherts, psychologicalJ. M., Veldkamp, studies: C. L. S., A Augusteijn, checklist to H. avoid E. M., p-hacking. Bakker, M., Frontiers van Aert, in R. Psycholo C. M., &- gyvan7 Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and ▪ - ▪ , , 1–12. http://doi.org/10.3389/fpsyg.2016.01832 A Widaman, K. F., & Reise, S. P.The (1997). science Exploring of prevention: the measurement Methodological invariance advances of frompsy alcoholchological and instruments: substance abuse Applications research in the substance use domain. In K. J. Bryant, M.- Windle, & M. G. West (Eds.), (pp. 281–324). Washington, D.C.: American Psy ▪ ▪ chological Association. http://doi.org/10.1037/10222-009 Wigfield, A., Eccles, J. S., Yoon, K. S., Harold, R. D., Arbreton, A. JournalJ. A., Freedman-Doan, of Educational PsychologyC., & Blumenfeld,89 P. C. (1997). Change in children’s competence beliefs and subjective task values across the elementary school years: A 3-year study. , (3), 451–469. http://doi.org/10.1037/0022-0663.89.3.451 242 Addendum

▪▪ - ment.Williams, Journal V. S. of L., Educational Jones, L. V, and& Tukey, Behavioural J. W. (1999). Statistics Controlling24 error in multiple comparisons, with examples from state-to-state differences in educational achieve , (1), 42–69. http://doi. ▪ - ▪ org/10.3102/10769986024001042 trieved from http://cran.r-project.org/package=CTT Willse, J. T. (2014). CTT: Classical Test Theory Functions. R package version 2.1. Re ▪▪ - ralization Test. Language Assessment Quarterly 8 Winke, P. (2011). Investigating the reliability of the civics component of the U.S. Natu , (4), 317–341. http://doi.org/10.108 ▪ - ▪ 0/15434303.2011.614031 tests.”Wise, S.Applied L., & Demars, Psychological C. E. (2009). Measurement A clarification33 of the effects of rapid guessing on co efficient alpha: A note on Attali’s “reliability of speeded number-right multiple-choice ▪ ▪ , (6), 488–490. item functioning. Applied Psychological Measurement 33 Woods, C. M. (2009a). Empirical selection of anchors for tests of differential , (1), 42–57. http://doi. ▪ - ▪ org/10.1177/0146621607314044 parison to two-group analysis. Multivariate Behavioral Research 44 Woods, C. M. (2009b). Evaluation of MIMIC-model methods for DIF testing with com , (1), 1–27. http:// ▪ ▪ doi.org/10.1080/00273170802620121 DIF testing with multiple groups: Evaluation and comparison to two-group IRT.Woods, Educational C. M., Cai, and L., & Psychological Wang, M. (2012). Measurement The Langer-improved73 Wald test for

, (3), 532–547. http://doi. ▪ - ▪ org/10.1177/0013164412464875 tioning with multiple indicator multiple cause models. Applied Psychological Measure- mentWoods,35 C. M., & Grimm, K. J. (2011). Testing for nonuniform differential item func ▪ ▪ , (5), 339–361. http://doi.org/10.1177/0146621611405984 DIF testing with the schedule for nonadaptive and adaptive personality. Journal of Psy- chopathologyWoods, C. M., andOltmanns, Behavioral T. F., & Assessment Turkheimer,31 E. (2009). Illustration of MIMIC-Model

, (4), 320–330. http://doi.org/10.1007/ ▪ ▪ s10862-008-9118-9 Journal of Experimental Social Psychology 44 Wout, D., Danso, H., Jackson, J., & Spencer, S. J. (2008). The many faces of stereotype threat: Group- and self-threat. , (3), 792– ▪ ▪ 799. http://doi.org/10.1016/j.jesp.2007.07.005 A habit-formation. Journal of Comparative Neurology and Psychology 18 Yerkes, R. M., & Dodson, J. D. (1908). The relation of strength of stimulus to rapidity of ▪ ▪ , (5), 459–482. female drivers in a simulator run over jaywalkers. Accident Analysis & Prevention 40Yeung, N. C. J., & von Hippel, C. (2008). Stereotype threat increases the likelihood that , ▪ ▪ (2), 667–674. http://doi.org/10.1016/j.aap.2007.09.003Science 149 ▪ ▪ Zajonc, R. B. (1965). Social facilitation. , (3681), 269–274. Differential Item Functioning - Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), (pp. 337–348). Hills ▪ - ▪ dale, New Jersey: Lawrence Erlbaum Associates, Inc. Journal of Applied Psychology 102 Zigerell, L. J. (2017). Potential publication bias in the stereotype threat literature: Com ment on Nguyen and Ryan (2008). , (8), 1159–1168. Addendum

243

▪▪ -

Zumbo, B. D. (1999). A handbook on the theory and methods of differential item func- tioning (DIF): Logistic regression modeling as a unitary framework for binary and likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Re ▪ ▪ search and Evaluation, Department of NationalA measure Defense. of effect size for a model-based ap- proach for studying DIF Zumbo, B. D., & Thomas, D. R. (1997). . Working Paper of the Edgeworth Laboratory for Quantitative ▪ ▪ Behavioral Science,A reviewUniversity of ETS of Northern differential British item functioningColumbia: Prince assessment George, procedures: B.C. Flagging rules, minimum sample size requirements, and criterion refinement. ETS Re- searchZwick, ReportR. (2012). Series

. Retrieved from http://doi.wiley.com/10.1002/j.2333-8504.2012. ▪ - ▪ tb02290.x tel-Haenszel DIF analysis. Journal of Educational Measurement 36 Zwick, R., Thayer, D. T., & Lewis, C. (1999). An empirical Bayes approach to Man , (1), 1–28.

A 244 Addendum

Dankwoord

-

Dit boek heeft wat bloed, zweet en tranen gekost (toegegeven, het bloed beperk- te zich tot papercuts, veroorzaakt door de grote hoeveelheid vragenlijsten die laatstedoor mijn sectie handen bedank zijn ik gegaan). hen graag. Naast hard werken en een beetje geluk, zijn fam ilie, vrienden en collega’s onmisbaar om een proefschrift af te ronden. In deze - Jelte, zonder jou was dit hele avontuur nooit begonnen. Je kwam met een maargoed uitgedachtje was ook eenplan, steun harkte op eende momenten Talent Grant dat binnen, ik dat nodig en zorgde had. Ik voor kon een onze vlieg am- ende start. Je gaf me veel vrijheid om mijn proefschrift naar mijn hand te zetten, hebben doorgezet. Het meest wil ik je bedanken voor het plezier en de bulder- lachbitieuze die plannenregelmatig af enopsteeg toe wel uit schieten, jouw kantoor maar achteraftijdens onzeben ikafspraken. blij dat we Hopelijk allebei komen we elkaar in de toekomst nog regelmatig tegen.

- Jeroen, ik wil je graag bedanken voor het vertrouwen om deel uit te maken- van een fantastisch departement. Daniël, ik wil je bedanken voor je enthousi asme, je kritische blik en je heerlijke gevoel voor humor tijdens onze vergaderin gen. Joris, ik wil je bedanken voor de zeer efficiënte samenwerking. Paul, ik wil forje graag the warm bedanken welcome voor inhet Padova gastvrije, and warme for your welkom extraordinary in Columbus help inen removing voor alle kennis die je met me gedeeld hebt. Franca and Steve, I would like to thank you-

threatthe grammatical researchers. errors Alle fromleden my van dissertation. de psychometrics Also, I ammeetings truly grateful en de leesclub for the enwil ikthusiasm, bedanken positive voor de remarks bijeenkomsten and shared en het data waardevolle and materials advies. of several Onze secretaress stereotype- es Marieke en Anne-Marie wil ik bedanken voor de goede zorgen: wat zouden we als verstrooide wetenschappers toch zonder jullie moeten. Na aan het hart liggen

lunchesmijn collega’s en de engoede maatjes sfeer. Chris, Erwin, Florian, Hilde, Lianne en Inga. Alle andere A MTO collega’s die ik hiermee te kort doe: bedankt voor de vele borrels, de relaxte- ken voor mijn ontwikkeling als methodoloog. In het bijzonder wil ik secretaresse Graag wil ik de leden, docenten en bestuur van graduate school IOPS bedan school wil ik bedanken voor zijn steun over de jaren heen. De leden van de PhD councilEdith bedanken wil ik bedanken voor de voorfijne desamenwerking. gezellige vergaderingen. Paul van Veen In hetvan kader de TSB van graduate onder-

wijs wil ik graag Marcel, Wilco, Marjan, Robbie, Michèle, Pieter, Luc, Nico, Josine, Elise en Sara bedanken voor de fijne samenwerking. Graag bedank ik de tientallen docenten die me welkom hebben geheten op hun middelbare school. Lesuren zijn schaars, en ik waardeer het enorm dat ik niet alleen langs mocht komen om mijn onderzoek af te nemen, maar ook nog Addendum

245 eens bestookt werd met leuke suggesties en nieuwsgierige vragen. Lieve leerlin- aan het eind van mijn vragenlijst. Ik heb me erg vermaakt. Hoofdstuk drie was gen, bedankt voor jullie deelname en jullie lieve, kritische en brutale berichtjes nooit zo goed gelukt zonder mijn geweldige assistentes: Charlotte, Chris, Eda, eenIris, eervolle Marloes, vermelding: Maud, Marion, je zelfstandige Myrthe, Pleun, en professionele Tara, Tessa, en houding Zhané. hebben Wat ben me ik enormtrots op geholpen jullie harde tijdens werk dit op project. de middelbare Naast het scholen! harde werkenAssistente was Sofie het verdientook nog eens enorm gezellig, samen in de trein of op school. - Lieve roomies Coosje, Miggel en Roberto, medebewoners van de batcave en S720, wat was ik blij dat ik als nakomertje aan mocht sluiten in ons gezel lige kantoor. Weekendbesprekingen, het op bonte wijze vieren van publicaties en maarverjaardagen, het maakte relatie-, zowel publicatie- de bijzondere en caviaproblemen, als de doodgewone huwelijken dagen tot en een geboortes, feestje. roadtrips; het zorgde ervoor dat we misschien niet altijd even productief waren, proefschrift ook daadwerkelijk af te ronden. Pia, Leonie en Inga zorgden als hekkensluiters voor de rust die nodig was om dit

TheaterDe willeden ik vangraag improvisatietheatergroep bedanken voor de letterlijke Rataplan, vrolijke mijn noot collega’s de afgelopen van de MTO vier band, de heren van Een Ander Team en mijn lieve familie van het Studio Klasse jaar. De uurtjes op de planken waren heerlijke en broodnodige afleiding. My dear friends, who all understand the ups-and-downs of PhD life like no other; thanks yearsfor the would game have nights, been movie way lessnights, fun. city trips, dinners and girls nights. Chrissy, Gaby, Byron, Nina, Maaike, Tünde, Willem, Bastian, and Paul, without you the last

Mijn drie maatjes door dik en dun wil ik in het bijzonder bedanken: Michèle, bovenalFieke en voor Lisanne. je duwtje Michèle: in de bandmaatje, rug op het moment RSI-wandelpartner dat ik hem het en mede-MTO’er.hardste nodig Bedankt voor de vele grappen, je luisterend oor, je aanstekelijke ambitie, maar- herinneringhad. Fieke: lievedat er vriendin, meer is dan ReMa-survivor, het proefschrift borrelmaatje, en de wetenschap. en mede-sociaal Als geen ander psy A kancholoog. jij me Bedankt met een voor enkel de zinnetjefijne vriendschap, weer aan hetde schouder lachen krijgen. om op Lisanne:te leunen, warme en de - at terwijl je omcirkelt bent door langharige mannen met zwaarden is niet zomaar vriendin, lieve steun en toeverlaat en teckel-genezer. Een vriendschap die ontsta tijdenskapot te mijn krijgen, verdediging en ik hoop als datparanimfen we daar achternog jarenlang me zullen van staan. zullen genieten. Deze drie prachtige dames inspireren mij elk op hun eigen wijze, en ik ben trots dat zij

Tot slot mijn lieve familie. Loes en Jayden, bedankt voor de etentjes en alle vrolijkheid van de afgelopen jaren. Broertje, bedankt voor de bescherming, de gezonde competitie, en de band die met de jaren sterker wordt. Ik hoop dat die Addendum

246

- trend zich lang voort zal zetten. Lieve Nico en Lily, bedankt voor de warmte en het julliegeduld me voor bij datal mijn er in verhalen. het leven Lievenog zoveel pap en meer mam, is ombedankt van te voor genieten. alles. JullieDaarnaast heb ben niet alleen mijn liefde voor leren van jongs af aan gestimuleerd, ook brachten oplader-bezorgservice en de warme maaltijden tijdens de laatste paar maanden voorben ik mijn heel deadline. dankbaar Tot voor slot de bedank koppen ik thee jullie en voor koffie, de glazenopen deur wijn, tijdens stukjes de kaas, eerste de

paar maanden na mijn deadline. De veilige haven in Heerjansdam, de uurtjes bij uitgekomen.de open haard Dit en proefschrift de rustige rondjesis voor jullie. Kijfhoek zal ik nooit vergeten. Zo zijn we in het verleden genoeg hobbels tegengekomen, maar gelukkig zijn we er altijd beter

A

Stereotype Threat and Differential Item Functioning:

Stereotype Threat and A Critical Assessment Differential Item Functioning: A Critical Assessment

Paulette Carien Flore Paulette Carien Flore