Another bit of jargon… Reliability & In sampling… In measurement…

• More is Better We are trying to represent a We are trying to represent a • Properties of a “good measure” population of individuals domain of behaviors – Standardization – Reliability – Validity • Reliability We select participants We select items – Inter-rater – Internal – External • Validity The resulting sample of The resulting /test of – Criterion-related participants is intended to items is intended to –Face – Content represent the population represent the domain – Construct

For both  “more is better” “more give greater representation”

Whenever we’ve considered research designs and statistical conclusions, we’ve always been concerned with “sample size”

We know that larger samples (more participants) leads to ... • more reliable estimates of mean and std, r, F & X2 • more reliable statistical conclusions • quantified as fewer Type I and II errors

The same principle applies to scale construction - “more is better” • but now it applies to the number of items comprising the scale • more (good) items leads to a better scale… • more adequately represent the content/construct domain • provide a more consistent total score (respondent can change more items before total is changed much) Desirable Properties of Psychological Measures Reliability (Agreement or Consistency)

Interpretability of Individual and Group Scores Inter-rater or Inter-observers reliability • do multiple observers/coders score an item the same way ? • critical whenever using subjective measures • dependent upon Standardization Population Norms Internal reliability -- do the items measure a central “thing” • Cronbach’s alpha  α =.00 –1.00  higher is better Validity • more correlated items & more items improve  higher α

External Reliability – stability of scale/test scores over time Reliability • test-retest reliability – correlate scores from same test given 3-18 weeks apart • alternate forms reliability – correlate scores from two Standardization “versions” of the test

Assessing  Correlation between each item and a total comprised of all the other items Item corrected alpha if • negative item-total correlations item-total r deleted indicate either... • very “poor” item i1 .1454 .65 • reverse keying problems i2 .2002 .58 i3 -.2133 .71 What the alpha would be if that i4 .1882 .59 item were dropped i5 .1332 .68 • drop items with alpha if deleted larger than alpha i6 .2112 .56 i7 .1221 .60 Tells the  for this set of items Coefficient Alpha = .58 Usually do several “passes” rather that drop several items at once. Assessing  Assessing  Pass #1 Pass #2, etc Item corrected alpha if All items with “-” item-total Item corrected alpha if Look for items with alpha-if- item-total r deleted correlations are “bad” item-total r deleted deleted values that are substantially higher than the i1 .0854 .65 • check to see that they have i1 .0812 .73 scale’s alpha value i2 .2002 .58 been keyed correctly i2 .2202 .68 • don’t drop too many at a time i3 -.2133 .71 • if they have been correctly i4 .1822 .70 keyed -- drop them • probably i7 i4 .1882 .59 i5 .0877 .74 • i3 would likely be dropped • probably not drop i1 & i5 i5 .0832 .68 i6 .2343 .64 • recheck on next “pass” i6 .0712 .56 i7 .0621 .78 • it is better to drop 1-2 items on i7 .0621 .60 Coefficient Alpha = .71 each of several “passes” Coefficient Alpha = .58

Validity (Consistent Accuracy)

Criterion-related Validity -- does test correlate with “criterion”? • statistical -- requires a criterion that you “believe in” • predictive, concurrent, postdictive validity

Face Validity -- do the items come from “domain of interest” ? non-statistical -- decision of “target population”

Content Validity -- do the items come from “domain of interest”? non-statistical -- decision of “expert in the field” Construct Validity -- does test relate to other measures it should? • nonstatistical – does measure match the theory of the construct? • statistical -- Discriminant validity • - +/-r with selected tests as should ? • divergent validity – r=0 correlate with others as should ? Criterion-related Validity “Is the test valid?” Do the test scores correlate with criterion behavior scores?? Jum Nunnally (one of the founders of modern ) claimed this was “silly question”! The point wasn’t that tests concurrent -- test taken now “replaces” criterion measured now • often the goal is to substitute a “shorter” or “cheaper” test shouldn’t be “valid” but that a test’s validity must be assessed • e.g., the written drivers test replaces road test relative to… predictive -- test taken now predicts criterion measured later • the construct it is intended to measure • want to estimate what will happen before it does • the population for which it is intended (e.g., age, level) • e.g., your GRE score (taken now) predicts grad school (later) postdictive – test taken now captures behavior & affect of before • the application for which it is intended (e.g., for classifying • most of the behavior we study “has already happened” folks into categories vs. assigning them • e.g., adult memories of childhood feelings or medical quantitative values) history When criterion behavior occurs Before Now Later So, the real question is, “Is this test a valid measure of this concurrent construct for this population in this application?” postdictive predictive That question can be answered! Test taken now

Conducting a Predictive Validity Study example -- test designed to identify qualified “front desk personnel” for a major hotel chain -- 200 applicants - and 20 position openings A “proper” predictive validity study… • give each applicant the test (and “seal” the results) • give each applicant a job working at a front desk • assess work performance after 6 months (the criterion) • correlate the test (predictor) and work performance (criterion)

Anybody see why the chain might not be willing to apply this design? Substituting concurrent validity for predictive validity What happens to the sample ... • assess work performance of all folks currently doing the job • give them each the test • correlate the test (predictor) and work performance (criterion)

Problems? • Not working with the population of interest (applicants) Applicant pool -- target population • Range restriction -- work performance and test score variability are “restricted” by this approach Selected (hired) folks • current hiring practice probably not “random” • assuming selection basis is somewhat reasonable/functional • good workers “move up” -- poor ones “move out” Sample used in concurrent validity study • Range restriction will artificially lower the validity coefficient (r) • worst of those hired have been “released” • best of those hired have “changed jobs”

What happens to the validity coefficient -- r

Applicant pool

r = .75 e

c

Hired Folks n

a

m

r

Sample used in o

f

r

validity study e

p

r = .20 b

o

j

-

n

o

i

r

e

t

i

r

C

Predictor -- interview/measure Face Validity “Continuum of content expertise” Does the test “look like” a measure of the construct of interest? • “looks like” a measure of the desired construct to a member Target Content of the target population population Experts • will someone recognize the type of information they are members responding to? • Possible advantage of face validity .. Researchers • If the respondent knows what information we are looking for, they can use that “context” to help interpret the questions Target population members  assess Face Validity and provide more useful, accurate answers Content experts  assess Content Validity • Possible limitation of face validity … • if the respondent knows what information we are looking for, Researchers – “should evaluate the validity evidence provided they might try to “bend & shape” their answers to what they about the scale, rather than the scale items !! think we want -- “fake good” or “fake bad” – unless they are truly a content expert”

Content Validity Does the test contain items from the desired “content domain”? • Based on assessment by “subject matter experts” (SMEs) in that content domain • Is especially important when a test is designed to have low face validity • e.g., tests of “honesty” used for hiring decisions • Is generally simpler for “achievement tests” than for “psychological constructs” (or other “less concrete” ideas) • e.g., it is a lot easier for “math experts” to agree whether or not an item should be on an algebra test than it is for “psychological experts” to agree whether or not an items should be on a measure of depression. • Content validity is not “tested for”. Rather it is “assured” by the informed item selections made by experts in the domain. Construct Validity The statistical assessment of Construct Validity … • Does the test correspond with the theory of the construct & dos it Discriminant Validity interrelate with other tests as a measure of this construct should ? • Does the test show the “right” pattern of interrelationships with • We use the term construct to remind ourselves that many of the other variables? -- has two parts terms we use do not have an objective, concrete reality. • Convergent Validity -- test correlates with other measures of • Rather they are “made up” or “constructed” by us in our similar constructs attempts to organize and make sense of behavior and other psychological processes • Divergent Validity -- test isn’t correlated with measures of “other, different constructs” • attention to construct validity reminds us that our defense of the constructs we create is really based on the “whole • e.g., a new measure of depression should … package” of how the measures of different constructs relate • have “strong” correlations with other measures of “depression” to theory and to each other • have negative correlations with measures of “happiness” • So, construct validity “begins” with content validity (are these the • have “substantial” correlation with measures of “anxiety” right types of items) and then adds the question, “does this test relate as it should to other tests of similar and • have “minimal” correlations with tests of “physical health”, different constructs? “faking bad”, “self-evaluation”, etc.

Evaluate this measure of depression….

New Dep Dep1 Dep2 Anx Happy PhyHlth FakBad New Dep

Old Dep1 .61

Old Dep2 .49 .76

Anx .43 .30 .28

Happy -.59 -.61 -.56 -.75

PhyHlth .60 .18 .22 .45 -.35

FakBad .55 .14 .26 .10 -.21 .31

Tell the elements of discriminant validity tested and the “conclusion” Evaluate this measure of depression…. Population Norms New Dep Dep1 Dep2 Anx Happy PhyHlth FakBad In order to interpret a score from an individual or group, you must New Dep convergent validity (but bit lower than r(dep1, dep2) know what scores are typical for that population • Requires a large representative sample of the target population Old Dep1 .61 • preferably  random, research-selected & stratified

Old Dep2 .49 .76 more correlated with anx than dep1 or dep2 • Requires solid standardization  both administrative & scoring • Requires great inter-rater reliability (if subjective items) Anx .43 .30 .28 corr w/ happy about same as Dep1-2

Happy -.59 -.61 -.56 -.75 “too” r with PhyHlth The Result ?? A scoring distribution of the population. PhyHlth .60 .18 .22 .45 -.35 “too” r with FakBad • lets us identify “normal,” “high” and “low” scores FakBad .55 .14 .26 .10 -.21 .31 • lets us identify “cutoff scores” to define important populations and subpopulations (e.g., 70 for MMPI & 80 for WISC) This pattern of results does not show strong discriminant validity !!

Desirable Properties of Psychological Measures

Interpretability of Individual and Group Scores

Population Norms Scoring Distribution & Cutoffs

Validity Face, Content, Criterioin-Related, Construct

Reliability Inter-rater, Internal Consistency, Test-Retest & Alternate Forms

Standardization Administration & Scoring