<<

Problems With Individual Difference Measures Based on Some Componential Cognitive Paradigms William P. Dunlap, Tulane University Robert S. Kennedy, Essex Corporation Mary M. Harbeson, Tulane University Jennifer E. Fowlkes, Essex Corporation

This article demonstrates that slope and ratio scores measures such as slope, difference, or ratio scores. may have the same psychometric difficulties—low re- This approach appears promising, as aspects of difference scores. liability—as Empirically, direct human cognitive performance exist that are not ad- measures and derived scores from Baron’s, Collins’, equately measured by traditional IQ and ability tests Meyer’s, and Posner’s cognitive paradigms were ex- amined in terms of their reliabilities and cross-correla- (Carroll, 1976). tions. Reliabilities of the direct measures and their in- This research stemmed from three basic obser- tercorrelations were high. The derived measures, vations concerning the constructs developed in some which were slope, ratio, and difference scores, had re- cognitive paradigms: liabilities near zero their cross-correla- and, therefore, 1. Although existing cognitive paradigms have tions were also low. It is concluded that derived intuitive because are scores, although intuitively appealing as measures of appeal they factorially mental operations, may have inherent psychometric rich (i.e., they measure a number of different difficulties that render them of little value for differen- mental constructs), the variance shared across tial prediction. Index terms: cognitive paradigms, various paradigms may be high, because a thread difference scores, individual differences, prediction, common to many of them involves contrasting ratio scores, reliability, slope scores. latency-based performance on simple versions of a task to performance on increasingly com- In recent years measures of individual differ- plex forms of the same task; ences based on componential cognitive theory have 2. Many derived scores in cognitive paradigms been developed to supplement or replace traditional are based on slopes, which may have the same psychometric measures frequently used for select- inherent statistical difficulties as difference and ing and training individuals (Carroll, 1976; Farr, scores (Carter, Krause, & Harbeson, 1984; Hunt, 1978, 1983, 1984; Office of Naval percent 1986); and Research, 1987; Stemberg, 1986). Most compo- 3. Ratio scores may have the same psychometric nential theories of are variations of an difficulties as difference and slope scores, which information-processing model involving sequences cause their reliabilities to be low and ren- of mental operations (Carroll, 1988; Farr, 1984; may der them impotent as predictors. Wickens, 1984). Researchers often attempt to iso- Clearly, measures denoting structure in intelli- late these mental operations by computing derived gence function do not need to be good differential APPLIED PSYCHOLOGICAL MEASUREMENT predictors to provide useful descriptions of cog- Vol. No. 1, March 1989, 9-17 13, pp. nitive processes. The extent to which most persons © Copyright 1989 Applied Psychological Measurement Inc 0146-6216/89/010009-09$1.70 show the characteristic in question may be impor-

9

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 10 tant, correct, and descriptive; however, if there is Bittner (1982), however, found that the difference not a large and reliable range of individual differ- score, which supposedly measures interference, is ences in the characteristic, it will suffer as a pre- not reliable. In addition, the difference score does dictor. The present study was concerned with the not appear to measure a construct separable from use of the derived scores for predictive purposes. the scores from which it was derived. The use of derived cognitive scores as individual In both examples, although differences exist in difference measures is typified by the work of Rose group performance between a simple and a more and colleagues (Rose, 1978; Rose & Fernandes, complex condition, individual difference scores are 1977), who developed an information-processing not reliable. In each case, the basic scores were performance battery to be used as a selection tool. reliable and were as highly correlated with each When developing the battery, they attempted to other as was statistically possible. include information-processing tests with high re- Slope, ratio, and difference scores do not always liability, statistical independence, and construct va- have reliabilities so low that they would incapa- lidity. Each information-processing task was sup- citate derived scores as predictors. For examples posed to involve at least one of the following of derived scores with reasonably useful reliabili- operations in addition to and responding: ties, see Jensen (1965) regarding derived scores constructing, transforming, storing, retrieving, from the Stroop (1935) phenomenon, or Rose (1974) searching, and comparing. Different information- and Rose and Fernandes (1977) for derived scores processing tests were assumed to tap different men- from other paradigms. However, given the follow- tal operations, and individual differences on dif- ing derivation and demonstration, it behooves re- ferent cognitive tasks were assumed to indicate searchers to address the reliability of derived scores relative skill in each operation. before recommending them as predictors. From a differential prediction standpoint, the question is not whether mental operations or in- Reliability formation-processing components exist, but whether the individual differences are sufficient in these Practice effects. Evaluating performance based constructs to make them useful predictors. Mental on tests that are not stable with repeated testing operations intuitively appear to take place in stages creates at least two related dangers. First, con- that take a measurable amount of time. The latency structs that change over time are unstable, and pre- of certain processes can be demonstrated (Brown, dicting from measures of that construct will be 1958; Peterson & Peterson, 1959; Sperling, 1960). compromised. Second, measurements of nonstable The fact that mental events take a finite amount of constructs can be misconstrued as unreliable. An time, however, does not necessarily mean that there adequate assessment of reliability can only be per- are reliable indicators of individual differences in formed on a stabilized variable. Because reliability those events. defines the upper limit to validity, it is essential to For example, it has been known since the time establish reliabilities of putative measures of in- of Donders (1868/1969; Smith, 1968) that reaction dividual differences of cognitive ability prior to time for four choices takes longer than simple re- employing such scores in prediction or selection. action time. However, the reaction time slope, which Difference scores. It has long been recognized purportedly measures speed of complex informa- that difference scores frequently have low relia- tion processing, has long been known to be unre- bility (Cronbach & Furby, 1970). Whenever the liable. Another example is the Stroop (1935) phe- correlation between different tasks is near the re- nomenon. It virtually always takes longer to name liabilities of those tasks, difference scores will be the color of words printed in conflicting colors than severely limited in reliability. This can be seen it does to read the names of the colors printed in from the following equation for reliability of dif- black and white. Harbeson, Krause, Kennedy, and ference scores (Cohen & Cohen, 1975, p. 64):

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 11

ence scores and ratio scores share the common statistical difficulty of low reliability whenever rab approaches the average of r_ and rbb. The goal of the present research was to study where r_ and r66 are the reliabilities of Task A the reliability of derived measures from four cog- and Task B, and rab is the correlation. nitive paradigms. These paradigms were: gra- scores. Measures based on Slope slope scores, phemic and phonemic analysis (Baron, 1973; Baron which appear with increasing frequency in cogni- & McKillop, 1975); semantic retrieval tive research have demonstrable inher- paradigms, (Collins & Quillian, 1969); lexical decision making ent statistical weaknesses. studies have Empirical (Meyer, Schvaneveldt, & Ruddy, 1974); and letter shown that a number of measures have low slope classification (Posner & Mitchell, 1967). reliabilities (Carter et al., 1986), which indicates The paradigms studied used a common method, that slope scores may have lower reliabilities- reaction time to letters or words; therefore, the raw often near zero-than the scores from which they scores should share common variance. The derived are derived. This occurs because, mathematically, scores were either differences, slopes, or ratios; slope scores can be recognized as either difference the question addressed was whether such scores scores or as weighted averages of difference scores. were sufficiently reliable to support their use in The same statistical difficulties arise under com- selection or prediction. The usefulness of these de- mon conditions (Cronbach & Furby, 1970). If the rived scores in or describing cog- base line (usually task difficulty) is fixed, then the nitive processing was not questioned; rather, these slope can be defined in terms of linear trend coef- paradigms were examined in terms of the psycho- ficients. For three points along the base line, these metric usefulness of the emergent scores for dif- coefficients + are - 1, 0, 1, representing nothing ferential prediction. more than the difference between the most and least difficult task. For four points, the coefficients are - 3, - 1, + 1, + 3, amounting to three times the Method difference score derived from the most extreme tasks, plus one times the difference score on in- Examinees termediate tasks. The examinees were 19 enlisted men be- Ratio scores. When the base line is not fixed, Navy tween the of 18 and 24 who volunteered for but a variable in its own right, the slope becomes ages at the Naval in New a ratio of two variables. The reliability of ratio duty Biodynamics Laboratory Orleans. All examinees were recruited, evaluated, scores, as shown below, under quite reasonable and in accordance with conditions is identical to the reliability of difference employed procedures spec- ified in of the Instruction 2900.30 scores or slope scores, and thus suffers from the Secretary Navy Series and Bureau of Medicine and In- same statistical deficiencies. The reliability of a Surgery struction 3900.6. These instructions are based on ratio can be derived by substituting a and b appro- consent and meet or exceed the priately into Cohen and Cohen’s (1975, p. 68) voluntary provi- sions of national and international equation, where a is the numerator and b is the prevailing guide- denominator of the ratio variable, which results in lines.

Task Descriptions Under the condition that SQ (the standard devia- Graphemic and phonemic analysis. This task tion of scores on Task A) equals Sb (the standard was developed by Baron (1973; Baron & Mc- deviation of scores on Task B), Equation 2 ob- Killop, 1975) to study visual versus auditory or viously simplifies to Equation 1; therefore, differ- articulatory reading strategies. Examinees were re-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 12 quired to judge whether phrases made sense or not the basic scores representing Phonemic Facilita- under three conditions: Sense (e.g., our new car), tion, Graphemic Interference, and Phonemic Sim- Homophone (e.g., it’s knot so), or Nonsense (e.g., ilarity ; and response times for words and for non- a drop of ran). These were combined in pairs to words correctly identified (Rose & Fernandes, 1977). form three basic conditions, Sense vs. Nonsense, Letter classification. Posner and Mitchell ( 1967) Sense vs. Homophone, and Homophone vs. Non- used this task to study matching or recognition of sense. Theoretically, graphemic encoders would do stimuli of various levels of complexity. Examinees better on Sense phrases and acoustic encoders would made same or different judgments on pairs of let- do better on Homophone phrases. There were 20 ters based on three criteria. Letters were classified phrases in each condition, and the stimuli were by physical appearance (AA vs. AB), name iden- displayed for approximately 3.5 seconds each. Four tity (Aa vs. Ab), or category (both vowels or con- variables were recorded: response time for each of sonants such as AE or BC, or not matched, such the three conditions, and the ratio of Sense vs. as AB). There were 36 trials per day in each of Homophone time to Homophone vs. Nonsense time. the first two conditions and 32 in the third. The Semantic memory retrieval. Collins and Quil- stimulus items were each displayed for 2 seconds. lian (1969) designed this task to study the hierar- Five scores were calculated, including response times chical organization of semantic information in for each of three conditions for same judgments, memory. Examinees were required to judge whether and two difference scores (search time for name, sentences describing three complexities of superset and search time for category). and property relationships were true; for example, Apparatus and Procedure A peach is a peach, A peach is a fruit, or A peach is food. Sentences describing more complex rela- The stimulus material was presented by means tionships were expected to require more processing of black and white slides shown on a Kodak Ek- time than those describing simple relationships. tograph 240 Audio ViewerTM . The rate of presen- There were 24 sentences per day including two tation was controlled by preprogrammed tape cas- sentences in each category. Each sentence was dis- settes. Examinees responded by pushing one of two played for approximately 3.5 seconds. The vari- buttons (yes or no) on boxes that were fastened to ables were response times for the correct positive their desk tops. The answer and the response times sentences for superset and property at three levels were displayed on an automatic timing device and of difficulty (accounting for six variables), and the recorded on an answer sheet by the experimenter. slopes of superset and property sentences, for a The examinees were tested in groups of four total of eight scores. beginning at 8 a.m. for 15 consecutive workdays. Lexical decision making. Meyer et al. ( 1974) Each group .~...... ~&dquo;~ed at the same time each day. used this task to study the effects of graphemic and The four tests were administered in the same order phonemic factors on word recognition. Letter strings to each group of examinees, but the order was were paired according to graphemic and phonemic varied across groups and days. There was a break similarity or dissimilarity: for example, clash, flash; of two or three minutes between tests while the cheap, deep; rough, dough; brake, note. Each item experimenter changed carousels and cassette tapes, was displayed for approximately 3.5 seconds. Ex- and a five-minute break between tests halfway aminees were asked to judge whether they were through testing. The duration of each test was ap- words or nonwords. Each day the examinees were proximately 15 minutes per day. Total testing time shown a list of 40 letter strings, of which half were was approximately 1.5 hours per day per group, words and half were nonwords. Nine scores were including breaks. recorded, the times to the sec- including response Results ond stimulus of each pair that were either phonem- ically or graphemically the same or different (four The 26 variables analyzed are presented in Table basic scores); three difference scores derived from 1. Scores 5, 6, and 7 from Meyer’s paradigm and

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 13

Table 1 Descriptions and Means and Standard Deviations (in msec) for the 26 Variables Analyzed

Scores 13 and 14 from Posner’s paradigm are dif- points per examinee per measure. For each mea- ference scores and, therefore, might be anticipated sure, the 28 possible reliabilities [(8 x 7)/2] were to have psychometric problems (Carter et al., 1986). computed, then these correlation coefficients were Slope Scores 21 and 22 from Collins’ paradigm averaged. For cross-correlations, the 64 possible should show psychometric problems similar to dif- correlation coefficients were computed and aver- ference scores. Finally, the ratio score (Score 26) aged. These averaged reliabilities (diagonal ele- from Baron’s paradigm should be vulnerable to the ments) and cross-correlations (off-diagonal ele- same psychometric problems as the difference and ments) are presented in Table 2. slope scores above. As the correlation matrix in Table 2 shows, the Means and standard deviations for each variable rows and columns corresponding to the difference, (shown in Table 1) and average reliabilities and slope, and ratio scores have consistently low cor- cross-correlations were computed in the following relations. On the diagonal, it can be seen that these manner. Preliminary examination revealed that most variables also have very low reliabilities. There- variables had stabilized by day 5, and that perfor- fore, the failure of these variables to predict other mance on the final two days, 14 and 15, tended to variables is not surprising: They do not even predict be somewhat erratic on some tests. Therefore, the themselves. The average rs shown in Table 2 dem- data were averaged over days 6 to 13, eight data onstrate that the derived scores are poorly related

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 14

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 15 to other variables, a fact easily predictable from ent statistical weaknesses that might undermine their their very low reliabilities and from the literature usefulness for selection and/or prediction. How- (Carter et al., 1986). Although this lack of cross- ever, one potential problem in this research is that correlation may indicate that different domains of only 19 examinees were studied. Although a sam- cognitive functioning are being sampled, exami- ple size of 19 argues for substantial instability of nation of the reliabilities reveals a more parsimon- the correlation coefficients estimated, these are not ious explanation: Because reliability places an up- single estimates but are correlations averaged over per limit on validity, such measures were a number of trials. Dunlap, Silver, Hunter, and psychometrically deficient at the outset. Bittner (1985) and Dunlap, Silver, and Bittner (1986) To demonstrate the problems inherent in the dif- studied the impact of averaging cross-correlations ference, slope, and ratio scores from a psycho- and reliabilities, respectively, across trials with small metric perspective, Table 3 presents the derived sample sizes, and concluded that the standard er- measures (d), the basic scores from which they rors of the resulting estimates were dramatically were derived (i,j), and the relevant cross-corre- improved by the averaging process, particularly lations and reliabilities (from Table 2), to permit when the underlying estimated correlation was small. computation of the predicted reliabilities of derived With a population correlation near zero, the pre- scores from Equation 1. The next to last column cision of a cross-correlation estimated across eight in Table 3 presents the predicted reliabilities, and repeated measures, as in the present study with an the last column presents the actual reliabilities. It actual sample of 19, is as accurate as a single es- can be seen that only Score 14, Posner’s Category timate of the same correlation from a sample of Retrieval, has any psychometric basis for reliability slightly less than 200 examinees. Reliabilities near and its actual reliability, although a modest .55, zero are estimated with a precision equivalent to a was the highest of the derived scores. single estimate from somewhat over 100 exami- nees. Therefore, the low average reliabilities and cross-correlations revealed for Discussion difference, slope, and ratio scores in the present study are not an The results indicate that the theoretically derived artifact of small sample size. scores from these cognitive paradigms have inher- Another potential problem in interpreting the

Table 3 Predicted and Actual Reliabilities (rdd) of Derived Scores (d) From Basic Scores (i, j) by Equation 1 for Four Cognitive Paradigms

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 16 correlations may be range restriction, because the Baron, J., & McKillop, B. J. (1975). Individual differ- ences in of visual examinees were a highly selected group. On the speed phonemic analysis, analysis, and reading. Acta Psychologica, 39, 91-96. other hand, if range restriction were a substantial Brown, J. (1958). Some tests of the decay theory of all the reliabilities problem, correlations, including immediate memory. Quarterly Journal of Experimen- and intercorrelations of raw scores, should be low, tal , 10, 12-21. which is clearly not the case. Carroll, J. B. (1976). Psychometric tests as cognitive tasks: A new "structure of intellect." In L. B. Resnick Of course, theoretical constructs may be impor- The nature Hills- tant to and less (Ed.), of (pp. 27-56). description understanding, although dale NJ: Erlbaum. useful for differential prediction purposes. Cog- Carroll, J. B. (1988). Individual differences in cognitive nitive constructs need not be independent or ex- functioning. In R. D. Atkinson, R. J. Herrnstein, G. plain new and untapped variance to be useful in Lindzey, & R. Duncan Lule (Eds.), Stevens handbook New York: structural theory. For differential prediction pur- of (pp. 813-862). however, the latter two are im- Wiley. poses, qualities Carter, R. C., Krause, M., & Harbeson, M. M. (1986). portant. Beware the reliability of slope scores for individuals. Finally, all difference, slope, or ratio scores are Human Factors, 28, 673-683. & P. not necessarily unreliable or of low validity (e.g., Cohen, J., Cohen, (1975). Applied multiple regres- sion/correlation the behavioral sciences. & but on analysis for Mumaw, Pellegrino, Kail, Carter, 1984); Hillsdale NJ: Erlbaum. statistical grounds, they are quite vulnerable to Collins, A. M., & Quillian, M. R. (1969). Retrieval problems with reliability, and certainly issues of time for semantic memory. Journal of Verbal Learn- reliability with the target population should be ad- ing and Verbal Behavior, 8, 240-247. L. & L. How we should dressed before they are included in a prediction/ Cronbach, J., Furby, (1970). measure should we? selection "change"—or Psychological battery. Bulletin, 74, 68-80. No factor of the data was at- analysis present Donders, F. C. (1969). On the speed of mental pro- tempted because of the small sample size, but the cesses. (W. G. Koster, Trans.) Acta Psychologica, present findings have important implications for 30, 412-431. (Original work published 1868) N. & A. Jr. factor analyses of the derived scores using these Dunlap, W. P., Silver, C., Bittner, C., with small In- In the full-factor the (1986). Estimating reliability samples: paradigms. model, diagonal creased with correlations. Human with precision averaged of the correlation matrix factored is replaced Factors, 28, 685-690. best estimates of the reliabilities of the variables Dunlap, W. P., Silver, N. C., Hunter, R. E., & Bittner, analyzed. These reliabilities establish the upper limit A. C., Jr. (1985). Averaged cross-correlations: A on the communality, h2, of each variable, where methodology for validity assessment in small samples. In R. Eberts & C. G. Eberts Trends in er- h2 is the proportion of variance a given variable (Eds.), II 13-21). Amsterdam: shares with the factors. In the the gonomics/human factors (pp. resulting past, Elsevier. fact that these derived scores do not load on factors Farr, M. J. (1984). . Naval Re- that load heavily on the basic scores has been in- search Reviews, 1, 33-36. terpreted as an indication that the derived scores Harbeson, M. M., Krause, M., Kennedy, R. S., & Bitt- ner, A. C., Jr. (1982). The as a index some &dquo;new&dquo; or &dquo;emergent&dquo; factor. How- Stroop performance evaluation test for environmental research. Journal of the data that the ever, present argue strongly Psychology, 111, 223-233. these derived scores do not load heavily on the Hunt, E. (1978). Mechanics of verbal ability. Psycho- more traditional factors is that, because of their logical Review, 85, 109-130. low reliability, they have little &dquo;true score&dquo; vari- Hunt, E. (1983). On the nature of intelligence. Science, 219, 141-146. ance to share with any factor. Hunt, E. (1984). Intelligence and mental competence. References Naval Research Reviews, 1, 37-42. Jensen, A. R. (1965). Scoring the Stroop test. Acta Psy- Baron, J. (1973). Phonemic stage not necessary for read- chologica, 24, 398-408. ing. Quarterly Journal of Experimental Psychology, Meyer, D. E., Schvaneveldt, R. W., & Ruddy, M. G. 25, 214-246. (1974). Functions of graphemic and phonemic codes

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 17

in visual recognition. Memory and Cognition, 2, 309- Smith, E. E. (1968). Choice reaction time: An analysis 321. of the major theoretical positions. Psychological Bul- Mumaw, R. J., Pellegrino, J. W., Kail, R. V., & Carter, letin, 69, 77-110. P. (1984). Different slopes for different folks: Process Sperling, G. (1960). The information available in brief analysis of spatial aptitude. Memory and Cognition, visual presentations. Psychological Monographs, 74 12, 515-521. (Whole No. 498). Office of Naval Research. (1987). Psychological Sci- Sternberg, R. J. (1986). Inside intelligence. American ences Division 1985 Programs. Arlington VA: Office Scientist, 74, 137-143. of Naval Research. Stroop, J. R. (1935). Studies of interference in serial Peterson, L. R., & Peterson, M. J. (1959). Short-term verbal reactions. Journal of Experimental Psychology, retention of individual verbal items. Journal of Ex- 18, 643-662. perimental Psychology, 58, 193-198. Wickens, C. D. (1984). Engineering psychology and Posner, M. I., & Mitchell, R. F. (1967). Chronometric human performance. Columbus OH: Charles E. Mer- analysis of classification. Psychological Review, 74, rill. 392-409. Rose, A. M. (1974). Human information processing: An assessment and research battery (Tech. Rep. No. 46). Acknowledgments Ann Arbor MI: University of Michigan, Department of Psychology. The authors thank Andrew Rose, who generously loaned in Rose, A. M. (1978). An information processing ap- the stimulus materials for the basic data collected proach to performance assessment (Rep. No. AIR these studies. 58500-11/78-FR). Washington DC: American Insti- tutes for Research. Rose, A. M., & Fernandes, K. (1977). An information Author’s Address processing approach to performance assessment: Ex- perimental investigation on an information processing Send requests for reprints or further information to Rob- performance battery (Tech. Rep. No. 1). Arlington ert S. Kennedy, Essex Corporation, 1040 Woodcock VA: American Institutes for Research. Road, Suite 227, Orlando FL 32803, U.S.A.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/