Validity Is One of the Most Overused Words in Statistics and Research Methods s1

Total Page:16

File Type:pdf, Size:1020Kb

Validity Is One of the Most Overused Words in Statistics and Research Methods s1

PSY 5130 – Lecture 2 Validity

Validity is one of the most overused words in statistics and research methods.

We’ve already encountered statistical conclusion validity, internal validity, construct validity I, and external validity.

Now we’ll introduce criterion-related validity, concurrent validity, predictive validity, construct validity II, convergent validity, discriminant validity, and content validity. Whew!

A general conceptualization of the “validities” we’ll consider here . . .

All but content validity are concerned with the extent to which scores on a test correlate with positions of people on some dimension of interest to us.

Specific types of Validity

I. Criterion-Related Validity

“Dimension of interest” is performance on some task or job, e.g., job performance, GPA.

So Criterion-related Validity refers to the extent to which pre-employment or pre-admission test scores correlate with performance on some measurable criterion.

This is the type of validity that is most important for I-O selection specialists. But it is also applicable to schools deciding among applicants for admission, for example.

When someone uses the term, “validity coefficient” he/she is most likely referring to criterion-related validity. It’s the actual Pearson r between test scores and the criterion measure.

Two specific types of criterion-related validity often used in I/O psychology when choosing pre-employment tests to predict performance on the job.

A. Concurrent Validity

The correlation of test scores with job performance of current employees. The test scores and the criterion scores are obtained at the same time, e.g., from current employees of an organization. Most often computed.

B. Predictive Validity.

The correlation of test scores with later job performance scores of persons just hired. The test scores are obtained prior to hiring. Criterion scores are obtained after those who took the pretest have been hired. Computed only occasionally.

Validation Study. A study carried out by an organization in order to assess the validity of a test.

P513 Lecture 2: Validity - 1 Printed on 4/6/2018 Typical Criterion-related Validities. How good a job are I/O Psychologists doing?

From Schmidt, F. L. (2012). Validity and Utility of Selection Methods. Keynote presentation at River Cities Industrial-Organizational Psychology Conference, Chattanooga, TN, October.

Unless otherwise noted, all operational validity estimates are of the specific type of test as the only predictor and corrected for measurement error (i.e., unreliability) in the criterion measure and indirect range restriction (IRR) on the predictor measure to estimate operational validity for applicant populations.

This means that the correlations below are somewhat larger than those you would obtain from computing Pearson r without the corrections.

2012 1998 GMA testsa .65.51 Integrity testsb .46 .41 Employment interviews (structured)c .58 .51 Employment interviews (unstructured)d .60.38 Conscientiousnesse .22.31Really!!?? Reference checksf .26.26 Biographical data measuresg .35.35 Job experience h .13 .18 Person-job fit measuresi .18 SJT (knowledge)j .26 Assessment centersk .37.37 Peer ratingsl .49.49 T & E point methodm .11.11 Years of educationn .10.10 Interestso .10.10 Emotional Intelligence (ability)p .24 Emotional Intelligence (mixed)q .24 GPAr .34 Person-organization fit measuress .13 Work sample testst .33.54 SJT (behavioral tendency)u .26 Emotional Stabilityv .12 Job tryout procedurew .44 Behavioral consistency methodx .45 Job knowledgey .48

For a sample of undergraduates collected over the past 3 years, the correlation of GPA with ACT Comprehensive scores is .378.

P513 Lecture 2: Validity - 2 Printed on 4/6/2018 Factors affecting validity coefficients in selection of employees or students. Why aren’t correlations = 1?

1. Deficiencies in or Contamination of the selection test.

A. Test is deficient - doesn't measure characteristics that predict some parts of the job Test may predict one thing. Job may require something else.

Example:

Job: Manager: Requirements Cognitive ability Conscientiousness Interpersonal Skills

Test: Cognitive Ability Test

B. Test is contaminated - affected by factors other than the factors important for the job Example Job: Small parts assembly Requirements Manual Dexterity

Test: Computerized Manual dexterity Test Manual dexterity Computer skills

P513 Lecture 2: Validity - 3 Printed on 4/6/2018 2. Reliability of the Test and Reliability of the criterion

This was covered in the section on reliability ceiling

3. Range differences between the validation sample and the population in which the test will be used.

If the range (difference between largest and smallest) within the sample used for validation does not equal the range of scores in the population for which the test will be used, the validity coefficient obtained in the validation study will be different from that which would be obtained in use of the test.

A. Validation sample range is restricted relative to the population for which the test will be used- the typical case. e.g. Test is validated using current employees. It will then be used for the applicant pool consisting of persons from all walks of life, some of whom would not been capable enough to be hired.

The result is the correlation coefficient computed from the validation group will be smaller than that which would have been computed had the whole applicant pool been included in the validation study.

Why do we care about differences in range? When choosing tests, comparing different advertised validities requires that the testing conditions be comparable. A bad predictor validated on a heterogeneous sample may have a larger r than a good predictor validated on a homogenous sample. P513 Lecture 2: Validity - 4 Printed on 4/6/2018 B. Validation sample range is larger than that of the population for which the test will be used - less often encountered.

A test of mechanical ability is validated on a sample from the general college population, including liberal arts majors.

But the test is used for an applicant pool consisting of only persons who believe they have the capability to perform the job which requires considerable mechanical skill. So the range of scores in the applicant pool will be restricted relative to the range in the validation sample.

Bottom Line: I feel that criterion-related validity is the most important characteristic of a pre-employment test.

The Issue of Mindless Empiricism as a criticism of the focus on Criterion-related Validity. Note that the issue of criterion-related validity of a measure has nothing to do with that measure’s intrinsic relationship to the criterion. A test may be a good predictor of job performance even though the content of the test bears no relationship to the content of the job. This means that it does not have to make sense that a given test is a good predictor of the criterion. The bottom line in the evaluation of a predictor is the correlation coefficient. If it is sufficiently large, then that’s good. If it’s not large enough, that’s not good. Whether there is any theoretical or obvious reason for a high correlation is not the issue here.

Thus, focus on criterion-related validity only is a very empirical approach to the study of relationships of tests to criteria, with the primary emphasis on the correlation, and little thought given to the theory of the relationship.

Such a focus gets psychologists in trouble with those to whom they’re trying to explain the results. Consider the Miller Analogies Test (MAT) for example. Example item: “Lead is to a pencil as bullet is to a) lead, b) gun, c) killing d) national health care policy.” How do you explain to the parent of a student denied admission that the student’s failure to correctly identify enough analogies in the Miller Analogies Test prevents the student from being admitted to a graduate nursing program? There is a significant, positive correlation between (MAT) scores and performance in nursing programs, but the reason, if known, is very difficult to explain.

Do companies conduct validation studies?

Alas, many do not – because they lack expertise, because they don’t see the value, because of small sample sizes, or because of difficulty in getting the criterion scores, to name four reasons.

P513 Lecture 2: Validity - 5 Printed on 4/6/2018 Correcting validity coefficients for reliability differences and range effects

Why correct?

1. If I’m evaluating a new predictor, I want to compare it with others on a “level” playing field. That includes removing the effects of unreliability and of range differences between the different samples.

So the corrections here permit comparisons with correlations computed in difference circumstances.

Corrections are in the spirit of removing confounds whenever we can. Examples are standard scores and standardized tests. Both remove specific characteristics of the test from the report of performance.

2. In meta-analyses, the correlations that are aggregated must be “equivalent”

When comparing different selection tests validated on different samples, we need equivalence.

Standard corrections

1. Correcting for unreliability of the measures (This is based on the reliability ceiling formula.)

rXY

rtX,tY(1) = ------sqrt(rXX’)sqrt(rYY’)

The corrected r is labeled (1) because there is a 2nd correction, shown below, that is typically made.

Suppose rXY = .6, but assume rXX’ = .9 and rYY’ = .8.

Then rtX,tY would be .6/sqrt(.9)sqrt(.8) = .6 / (.95)(.89) = .6 / .85 = .71. This is 18% larger than the observed r.

Caution: The reliabilities and the observed r have to be good estimates of the population values, otherwise correction can result in absurd estimates of the true correlation.

In selection situations, we correct for unreliability in the criterion measure only.

The reasoning is as follows: We correct because we want to assess the “true” amount of a construct. In selection situations, the “true” job performance is available – it’s what we’ll observe over the years an employee is with the firm. So we correct for unreliability of a single measure of job performance. But we don’t correct for unreliability in the test because in selection situations, the observed test is the only thing we have. We might be interested in the true scores on the dimension represented by the test, but in selection, we can’t use the true scores, we can only use the observed scores.

So, in selection situations, the correction for unreliability is

rXY rX,tY(1)= ------sqrt(rYY’)

Note that the corrected correlation is labeled rX,tY, not rtX,tY to indicate that it is corrected only for unreliability of the criterion, Y.

P513 Lecture 2: Validity - 6 Printed on 4/6/2018 2. Correcting for Range Differences on the criterion variable.

This is applicable in some selection situations.

After correcting for unreliability of X and Y, a 2nd correction, for range differences, is made. SUse rtX,tY(1) * ------SVal rtX,tY(2) = ------2 S Use 2 2 sqrt(1 - r tX,tY(1) + r tX,tY(1)------) 2 S Val In this formula, rtX,tY(1) is the correlation corrected for unreliability from the previous page.

SUse is the standard deviation of Y in the population in which the test will be used. SVal is the standard deviation of Y in the validation sample.

3. Other corrections.

There is a growing literature on corrections in meta analyses and in selection situations.

References . . .

Stauffer, J. M., & Mendoza, J. L. (2001). The proper sequence for correcting correlation coefficients for range restriction and unreliability. Psychometrika, 66(1), 63-68.

Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology. Journal of Applied Psychology, 85(1), 112-118.

Hunter, J. E., & Schmidt, F. L. (2004). Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. 2nd Ed. Thousand Oaks, CA: Sage.

The bottom line of this is that if you’re involved in selection, you should be familiar with the language used when discussing validity of selection tests. That language will include the concepts of reliability, correction, and range restriction discussed here.

P513 Lecture 2: Validity - 7 Printed on 4/6/2018 II. Construct Validity II (recall that Construct Validity I was present in the fall.)

Definition: the extent to which observable test scores represent positions of people on underlying, not directly observable, dimensions, often call latent variables or theoretical constructs.

Typically, these constructs are characteristics that refer to complex collections of behaviors, so complex that it is difficult to determine whether a person possesses the characteristics in any reasonable length of time without using a specifically formulated test for that characteristic.

Examples of such variables are intelligence, depression, conscientiousness, emotional stability, leadership ability, motivating potential, resilience, etc etc etc

Such variables are typically the building blocks of psychological theory. Theories are about the relationships among such general, complex characteristic.

Such variables are often found when describing persons in high level positions within organizations. Start here on 1/24/17. Assessing Construct Validity.

How do we know a test measures an unobservable dimension?

This is kind of like pulling oneself up by one’s own bootstraps.

Solution: Ultimately construct validity is based on a subjective consensus of opinion of persons knowledgeable about the construct under consideration.

We begin with a measure of the construct that knowledgeable people agree measures the construct.

The construct validity of this first measure of a construct is purely subjective.

We correlate subsequent measures of the construct with the existing measure (or measures).

The construct validity of the 2nd and subsequent measures of a construct use the first measure as a baseline. It is based on objectively determined correlation coefficients.

Subsequent evaluations add to our knowledge of what the construct is.

P513 Lecture 2: Validity - 8 Printed on 4/6/2018 Assessing Construct Validity of the subsequent measures

Generally speaking, a new measure of a construct has high construct validity if

a. the new measure has high convergent validity, i.e., it correlates strongly with other purported measures of the construct, and b. the new measure has high discriminant validity, i.e., it correlates negligibly (i.e., near zero) with measures of other constructs which are unrelated (correlate zero with) the construct under consideration. Discriminant validity refers to lack of correlation.

Convergent validity: The correlation of a test with other purported measures of the same construct.

Two ways of assessing Convergent validity of a test

1. Correlation approach: Correlate scores on the test with other measures of same construct.

High positive correlations indicate that your test measures the same thing as those other tests. 2. Group Differences approach: Find groups known to differ on the construct.

Determine if they are significantly different on the test.

Example: Assessing the construct validity of a new measure of extraversion?

Suppose sales are determined to a considerable extent by extraversion.

Get a group of people with good sales and a group of people with low sales.

Presumably, the high sales group will be high in extraversion and the low sales group will be low.

Give both groups your test.

Compare means on your test using the t-test.

If the high sales group scores higher on your test than the low sales group, this is consistent with your test measuring a characteristic associated with high scales – extraversion.

P513 Lecture 2: Validity - 9 Printed on 4/6/2018 Discrmininant validity: The absence of correlation of a test with measures of other theoretically unrelated constructs.

Two ways of assessing Discriminant validity of a test.

1. Correlation approach: Correlate the test with measures of other, unrelated constructs. Near zero, correlations means good discriminant validity.

Conscientiousness: correlation with extraversion should be zero since they’re different constructs.

2. Group Differences approach: Show that groups known to differ on other constructs are not significantly different on the test.

Suppose you’ve developed a new measure of conscientiousness. To assess its discriminant validity

Find a group high on extraversion (sales people) and a group low on extraversion (clerks).

Give them all the conscientiousness test and compare the mean difference between the two groups on the test.

If the conscientiousness test has good discriminant validity, there will not be a significant difference between the 2 groups, since sales people and clerks are probably about equal in conscientiousness.

If it does not have discriminant validity, the two groups will differ significantly.

So, establishing construct validity involves correlating a test with other measures of the same construct and of different constructs. We expect high correlations with measures of the same construct and zero correlations with different constructs.

Note that high power is required when demonstrating discriminant validity. If there is discriminant validity, there will not be a relationship in true scores(1 above) or there will not be a difference in true means (2 above). You must be able to argue that the absence of a relationship or difference was not due to low power.

“The Good Test” (Sundays at 10 PM on The Learning Channel)

High reliability: Large positive reliability coefficient - .8 or better to guarantee acceptance

Good Validity

Good Convergent Validity: Large, correlations in expected direction with other measure of the same construct.

Good Discriminant Validity: Small correlations with measures of other independent constructs.

P513 Lecture 2: Validity - 10 Printed on 4/6/2018 Examples of assessing Construct validity from our research

Convergent Validity of Bifactor model measures of the Big 5 vs. Scale score measures.

Typically, psychological constructs are assessed using summated scores. (See the next lecture for more than you ever wanted to know about summated scores.)

We have been investigating the possibility that responses to personality items are affected by an “affective bias”, a tendency express the affective state of the respondent his or her response to the content of an item.

We believe that measures of the Big Five dimensions with this “affective bias” removed will be better estimates of the dimensions – “purifying” them, if you will.

At the same time, there is considerable evidence of the usefulness of the Big Five summated scale scores.

For that reason, our “purified” measures should still exhibit convergent validity with the summated scale scores.

Evidence

NEO-FFI-3 questionnaire. N=736.

The Scale Scores The “purified” Scores

Convergent validity correlations of scale scores with “purified” scores.

Extraversion Agreeableness Conscientiousness Stability Openness .867 .915 .881 .981 .909

So the “Purified” measures correlate strongly with the scale score measures, as they should.

But the correlations are not perfect, meaning that perhaps the “contamination” that is present in the scale scores is not in the “purified” scores.

P513 Lecture 2: Validity - 11 Printed on 4/6/2018 Do the HEXACO-PI-R measures of the Big Five exhibit convergent validity with NEO-FFI-3 measures.

This is a simple, straightforward test of convergent validity.

The NEO-FFI questionnaire has been used for many years.

The HEXACO questionnaire has been more recently promoted.

The HEXACO is said to measure the Big Five plus one more measure – Honesty/Humility.

What is the convergent validity of corresponding scale scores from the two questionnaires.

F2014 Neo-FFI plus HEXACO DualResponders 141227. N= 1195

NEO-FFI-3 HEXACO-PI-R

Convergent validity correlations of NEO-FFI-3 scale scores with HEXACO-PI-R scale scores. eXraversion Agreeableness Conscientiousness Stability Openness .781 .532 .763 .444 .733

Discriminant validity correlations are in red below. Most of them are reasonably small, although one, the correlation of NEO S with HEXACO Extraversion is quite large. Correlations hx ha hc hs ho nx Pearson Correlation .781 .174 .185 -.052 -.026 Sig. (2-tailed) .000 .000 .000 .075 .371 na Pearson Correlation .233 .532 .307 -.216 .105 Sig. (2-tailed) .000 .000 .000 .000 .000 nc Pearson Correlation .326 .174 .763 -.003 -.048 Sig. (2-tailed) .000 .000 .000 .923 .095 ns Pearson Correlation .509 .348 .237 .444 -.023 Sig. (2-tailed) .000 .000 .000 .000 .421 no Pearson Correlation .035 .059 .077 -.086 .733 Sig. (2-tailed) .228 .042 .008 .003 .000

Wow!! If you’re measuring Agreeableness or Stability, you have to decide which questionnaire to use. It appears that those two scales from the NEO-FFI-3 measure something different from the HEXACO scales of the same name.

Others have reported similar results. The NEO and HEX measure Agreeableness and Stability differently.

P513 Lecture 2: Validity - 12 Printed on 4/6/2018 Convergent and Discriminant Validity of Response Inconsistency

We’ve been studying a measure of response inconsistency, defined by the standard deviation of responses to items within the same scale.

An overall measure for a questionnaire is the average of standard deviations of responses to items within all the scales within that questionnaire.

We compute an Inconsistency measure from the NEO-FFI-3 and an Inconsistency measure from the HEXACO-PI-R administered to the same respondents.

Here’s a scatterplot illustrating the convergent validity of inconsistency measured in the two questionnaires.

Pearson r of NEOSD with HEXSD is .613, p < .001,

So the two measures exhibit pretty good convergent validity.

Because they’re so highly correlated, I formed a composite Inconsistency measure, MEANSD, which is the mean of the NEO and HEX SD measures.

Is Inconsistency separate from the Big 5 dimensions? Here are discriminant validity correlations of inconsistency with scale scores from the two questionnaires.

Correlations nx na nc ns no hx ha hc hs ho means Pearson r .097 .075 .122 -.077 .132 .066 -.081 .089 -.051 .037 d Sig. (2-tailed) .000 .004 .000 .004 .000 .007 .001 .000 .039 .130 N 1445 1445 1445 1445 1445 1670 1670 1670 1670 1670

Although some of the correlations are significantely different from zero, the largest is only -.132, so generally, I feel that it’s reasonable to conclude that inconsistency has high discriminant validity. It’s not measuring what the Big Five or HEXACO scales are measuring.

Of course, the question is, “Is it measuring anything worthwhile?” We’ll see.

P513 Lecture 2: Validity - 13 Printed on 4/6/2018 Is grit different from conscientiousness?

Recently, there has been some attention give to a concept called grit.

(Mike – play the excerpt from Chicago Med here.)

Sample Conscientiousness item: I plan ahead and organize things, to avoid scrambling at the last minute.

Sample Grit item: I have overcome setbacks to conquer an important challenge.

The question: Should we “allow” grit into the museum of psychological constructs, along with Extraversion, Agreeableness, Conscientiousness, Emotional Stability, Openness to Experience, Self-esteem, Depression, etc.

Correlations hx ha hc hs ho hh grit Grit 12-item Pearson .375 .140 .534 .129 -.020 .266 Scale (Duckworth, Correlation Peterson, Sig. (2-tailed) .000 .000 .000 .001 .612 .000 Matthews, & Kelly, N 645 645 645 645 645 645 2007)

The correlation of 0.534 with Conscientiousness is one of those gray area correlations, in my view.

If it were .8 or .9, there would be no question that the grit scale was just measuring Conscientiousness.

If the correlation were .1 or .2, I’d have no problem with considering grit as a psychological characteristic, separate from Conscientiousness.

But, .534??

P513 Lecture 2: Validity - 14 Printed on 4/6/2018 III. Content Validity The extent to which test content represents the content of the dimension of interest. Example of a test with high content validity

A test of general arithmetic ability that contains items representing all the common arithmetic operations – addition, subtraction, multiplication, and division. Example of a test with low content validity A test of general arithmetic ability that contains only measurement of reaction time and spatial ability. Note that the issue of whether or not a test contains the content of the dimension of interest has no direct bearing on whether or not scores on the test are correlated with position on that dimension. Of course, the assumption is that tests with high content validity will show high correlations with the dimensions represented by the tests. Why bother with Content Validity in view of the previous emphasis on correlations? 1. Time and Money. In many selection situations, it is easier to demonstrate content validity than it is criterion-related validity. A validation study designed to assess criterion-related validity requires at last 200 participants. In a small or medium-sized company, it may be impossible to gather such data in a reasonable period of time. (The VALDAT data on the validity of the formula score as a predictor of performance in our programs has been gathered over a period of 12 years, increasing at the rate of about 20 / year.. We didn’t have 200 until 10 years into the project.) 2. Politics. It is easier to make lay persons understand the results of a content valid test than one that has high criterion-related validity but is not content valid. This includes the courts. Assessing Content Validity: The Content Validity Ratio 1. Convene a panel of subject matter experts (SMEs). 2. Have each judge rate each item on the test as a) Essential, b) Useful, or c) Not necessary. 3. Label the total number of judges as N. For each item . . . 3. Compute the number of judges rating the item as essential. Label this count, NE.

NE – N/2 4. For each item, compute the Content Validity Ratio (CVR) as CVR = ------. N/2 5. Compute the mean of the individual item CVRs as the test CVR. Note that the CVR can range from +1, representing highest possible content validity to -1, representing lowest possible content validity. Tables of “statistically significant” CVRs have been prepared. Aiken, L.R. (1980). Content validity and reliability of single items or questionnaires. Educational and Psychological Measurement, 40, 955–959. Penfield, R. & Giacobbi, P. (2004) Applying a score confidence interval to Aiken’s item content-relevance index. Measurement in Physical Education and Exercise Science, 8(4), 213-225. Schmidt, F. L. (2012). Cognitive tests used in selection can have content validity as well as criterion validity: A broader research review and implications for practice. International Journal of Selection and Assessment, 20, 1-13.

P513 Lecture 2: Validity - 15 Printed on 4/6/2018

Recommended publications