Reliability & Validity

Another bit of jargon… Reliability & Validity In sampling… In measurement… • More is Better We are trying to represent a We are trying to represent a • Properties of a “good measure” population of individuals domain of behaviors – Standardization – Reliability – Validity • Reliability We select participants We select items – Inter-rater – Internal – External • Validity The resulting sample of The resulting scale/test of – Criterion-related participants is intended to items is intended to –Face – Content represent the population represent the domain – Construct For both “more is better” “more give greater representation” Whenever we’ve considered research designs and statistical conclusions, we’ve always been concerned with “sample size” We know that larger samples (more participants) leads to ... • more reliable estimates of mean and std, r, F & X2 • more reliable statistical conclusions • quantified as fewer Type I and II errors The same principle applies to scale construction - “more is better” • but now it applies to the number of items comprising the scale • more (good) items leads to a better scale… • more adequately represent the content/construct domain • provide a more consistent total score (respondent can change more items before total is changed much) Desirable Properties of Psychological Measures Reliability (Agreement or Consistency) Interpretability of Individual and Group Scores Inter-rater or Inter-observers reliability • do multiple observers/coders score an item the same way ? • critical whenever using subjective measures • dependent upon Standardization Population Norms Internal reliability -- do the items measure a central “thing” • Cronbach’s alpha α =.00 –1.00 higher is better Validity • more correlated items & more items improve higher α External Reliability – stability of scale/test scores over time Reliability • test-retest reliability – correlate scores from same test given 3-18 weeks apart • alternate forms reliability – correlate scores from two Standardization “versions” of the test Assessing Correlation between each item and a total comprised of all the other items Item corrected alpha if • negative item-total correlations item-total r deleted indicate either... • very “poor” item i1 .1454 .65 • reverse keying problems i2 .2002 .58 i3 -.2133 .71 What the alpha would be if that i4 .1882 .59 item were dropped i5 .1332 .68 • drop items with alpha if deleted larger than alpha i6 .2112 .56 i7 .1221 .60 Tells the for this set of items Coefficient Alpha = .58 Usually do several “passes” rather that drop several items at once. Assessing Assessing Pass #1 Pass #2, etc Item corrected alpha if All items with “-” item-total Item corrected alpha if Look for items with alpha-if- item-total r deleted correlations are “bad” item-total r deleted deleted values that are substantially higher than the i1 .0854 .65 • check to see that they have i1 .0812 .73 scale’s alpha value i2 .2002 .58 been keyed correctly i2 .2202 .68 • don’t drop too many at a time i3 -.2133 .71 • if they have been correctly i4 .1822 .70 keyed -- drop them • probably i7 i4 .1882 .59 i5 .0877 .74 • i3 would likely be dropped • probably not drop i1 & i5 i5 .0832 .68 i6 .2343 .64 • recheck on next “pass” i6 .0712 .56 i7 .0621 .78 • it is better to drop 1-2 items on i7 .0621 .60 Coefficient Alpha = .71 each of several “passes” Coefficient Alpha = .58 Validity (Consistent Accuracy) Criterion-related Validity -- does test correlate with “criterion”? • statistical -- requires a criterion that you “believe in” • predictive, concurrent, postdictive validity Face Validity -- do the items come from “domain of interest” ? non-statistical -- decision of “target population” Content Validity -- do the items come from “domain of interest”? non-statistical -- decision of “expert in the field” Construct Validity -- does test relate to other measures it should? • nonstatistical – does measure match the theory of the construct? • statistical -- Discriminant validity • convergent validity - +/-r with selected tests as should ? • divergent validity – r=0 correlate with others as should ? Criterion-related Validity “Is the test valid?” Do the test scores correlate with criterion behavior scores?? Jum Nunnally (one of the founders of modern psychometrics) claimed this was “silly question”! The point wasn’t that tests concurrent -- test taken now “replaces” criterion measured now • often the goal is to substitute a “shorter” or “cheaper” test shouldn’t be “valid” but that a test’s validity must be assessed • e.g., the written drivers test replaces road test relative to… predictive -- test taken now predicts criterion measured later • the construct it is intended to measure • want to estimate what will happen before it does • the population for which it is intended (e.g., age, level) • e.g., your GRE score (taken now) predicts grad school (later) postdictive – test taken now captures behavior & affect of before • the application for which it is intended (e.g., for classifying • most of the behavior we study “has already happened” folks into categories vs. assigning them • e.g., adult memories of childhood feelings or medical quantitative values) history When criterion behavior occurs Before Now Later So, the real question is, “Is this test a valid measure of this concurrent construct for this population in this application?” postdictive predictive That question can be answered! Test taken now Conducting a Predictive Validity Study example -- test designed to identify qualified “front desk personnel” for a major hotel chain -- 200 applicants - and 20 position openings A “proper” predictive validity study… • give each applicant the test (and “seal” the results) • give each applicant a job working at a front desk • assess work performance after 6 months (the criterion) • correlate the test (predictor) and work performance (criterion) Anybody see why the chain might not be willing to apply this design? Substituting concurrent validity for predictive validity What happens to the sample ... • assess work performance of all folks currently doing the job • give them each the test • correlate the test (predictor) and work performance (criterion) Problems? • Not working with the population of interest (applicants) Applicant pool -- target population • Range restriction -- work performance and test score variability are “restricted” by this approach Selected (hired) folks • current hiring practice probably not “random” • assuming selection basis is somewhat reasonable/functional • good workers “move up” -- poor ones “move out” Sample used in concurrent validity study • Range restriction will artificially lower the validity coefficient (r) • worst of those hired have been “released” • best of those hired have “changed jobs” What happens to the validity coefficient -- r Applicant pool r = .75 e c Hired Folks n a m r Sample used in o f r validity study e p r = .20 b o j - n o i r e t i r C Predictor -- interview/measure Face Validity “Continuum of content expertise” Does the test “look like” a measure of the construct of interest? • “looks like” a measure of the desired construct to a member Target Content of the target population population Experts • will someone recognize the type of information they are members responding to? • Possible advantage of face validity .. Researchers • If the respondent knows what information we are looking for, they can use that “context” to help interpret the questions Target population members assess Face Validity and provide more useful, accurate answers Content experts assess Content Validity • Possible limitation of face validity … • if the respondent knows what information we are looking for, Researchers – “should evaluate the validity evidence provided they might try to “bend & shape” their answers to what they about the scale, rather than the scale items !! think we want -- “fake good” or “fake bad” – unless they are truly a content expert” Content Validity Does the test contain items from the desired “content domain”? • Based on assessment by “subject matter experts” (SMEs) in that content domain • Is especially important when a test is designed to have low face validity • e.g., tests of “honesty” used for hiring decisions • Is generally simpler for “achievement tests” than for “psychological constructs” (or other “less concrete” ideas) • e.g., it is a lot easier for “math experts” to agree whether or not an item should be on an algebra test than it is for “psychological experts” to agree whether or not an items should be on a measure of depression. • Content validity is not “tested for”. Rather it is “assured” by the informed item selections made by experts in the domain. Construct Validity The statistical assessment of Construct Validity … • Does the test correspond with the theory of the construct & dos it Discriminant Validity interrelate with other tests as a measure of this construct should ? • Does the test show the “right” pattern of interrelationships with • We use the term construct to remind ourselves that many of the other variables? -- has two parts terms we use do not have an objective, concrete reality. • Convergent Validity -- test correlates with other measures of • Rather they are “made up” or “constructed” by us in our similar constructs attempts to organize and make sense of behavior and other psychological processes • Divergent Validity -- test isn’t correlated with measures of “other, different constructs” • attention to construct validity reminds us that our defense of the constructs we create is really

Reliability & Validity

Assessment Center Structure and Construct Validity: a New Hope

Construct Validity and Reliability of the Work Environment Assessment Instrument WE-10

Construct Validity in Psychological Tests

Questionnaire Validity and Reliability

How to Create Scientifically Valid Social and Behavioral Measures on Gender Equality and Empowerment

Construct Validity in Psychological Tests – the Case of Implicit Social Cognition

Reliability and Validity of the MBTI® Instrument

Handout 5: Establishing the Validity of a Survey Instrument STAT 335 – Fall 2016

Download Preprint

Document Resume

Stages of Psychometric Measure Development: the Example of the Generalized Expertise Measure (GEM)

Psychological Testing: Basic Concepts Commonand Misconceptions Anne Anastasi