Analyzing Test Content Using and Multidimensional Scaling Stephen G. Sireci and Kurt F. Geisinger Fordham University

A new method for evaluating the content such as ratings of test items by content experts representation of a test is illustrated. Item similari- (Osterlind, 1989). were obtained from content domain ex- ty ratings In order to retain the importance of ensuring perts in order to assess whether their ratings cor- content domain and at the same responded to item groupings specified in the test representation blueprint. Three expert judges rated the similarity time to avoid use of the term &dquo;validity,&dquo; several of items on a 30-item multiple-choice test of study test specialists have suggested replacing &dquo;content skills. The similarity data were analyzed using a validity&dquo; with a more technically correct term. multidimensional followed scaling (MDS) procedure Messick (1975, 1980) suggested the terms &dquo;con- by a hierarchical cluster analysis of the MDS tent relevance&dquo; and &dquo;content stimulus coordinates. The results indicated a strong representation,&dquo; correspondence between the similarity data and the Guion (1978) offered the term &dquo;content domain arrangement of items as prescribed in the test sampling,&dquo; and Fitzpatrick (1983) recommend- blueprint. The findings suggest that analyzing item ed use of the term &dquo;content representativeness.&dquo; similarity data with MDS and cluster analysis can Thus, regardless of the terminology employed, provide substantive information pertaining to the the of a test to its content representation of a test. The advantages ability represent underlying and disadvantages of using MDS and cluster content domain continues to be an issue of para- analysis with item similarity data are discussed. mount importance in test construction. Index terms: cluster analysis, content validity, The present paper presents and explores a new multidimensional scaling, similarity data, test to evaluate the content construction. approach designed rep- resentation of a test. Because item response data The term &dquo;content validity&dquo; traditionally has are not used in this approach, the terms &dquo;con- been used to refer to how well the items on a test tent representation&dquo; and &dquo;content representa- represent the underlying domain of skill or tiveness&dquo; are used to encompass the psychometric knowledge tested (Thorndike, 1982). However, concerns previously attributed to content validity. use of this term has been criticized by several Previous Methods of Evaluating Test Content theorists who describe validity as a unitary con- cept (Fitzpatrick, 1983; Guion, 1977; Messick, Evaluations of test content can be classified 1975, 1989a, 1989b; Tenopyr, 1977). Content as either subjective or empirical. Subjective validity has become controversial because many methods employ subject matter experts (SMES) to current test specialists define validity in terms of evaluate and rate the relevance and representa- inferences derived from test scores, but studies tiveness of test items to the domain of knowledge of content validation rarely employ test score tested. Examples of subjective methods for data. Rather, a test’s content is usually evaluating the content representation of a test are &dquo;validated&dquo; through more subjective methods provided by Hambleton (1980, 1984), Lawshe (1975), and Morris and Fitz-Gibbon (1978). APPLIED PSYCHOLOGICAL MEASUREMENT Hambleton’s method an vol. 16, No. 1, March 1992, 17 31 (1980, 1984) provides pp. index that is @ Copyright 1992 Applied Psychological Measurement Inc. item-objective congruence appropri- 0146-6216/92/01001715$2. 00 ate for criterion-referenced tests in which each

17 Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 18 item is linked to a single objective. This index and Osterlind (1989) recommended that the reflects an averaging of SME ratings regarding the judges not be informed of the item-objective extent to which items measure their specified ob- specifications of the test blueprint. Furthermore, jective. Lawshe’s (1975) procedure results in a Crocker et al. pointed out that item ratings often content validity index that represents an averag- differ due to minor changes in the wording of the ing of SME item relevance ratings for items re- directions to the judges. Clearly, it would be tained on a test after the review process. The pro- beneficial to modify these procedures to avoid cedure developed by Morris and Fitz-Gibbon imposing the test blueprint on the content (1978) provides several indices that reflect an domain experts’ judgments and to gain economy averaging of SME judgments regarding the objec- of time, money, and personnel. tives measured by each item, the relevance of The empirical methods of content assessment these objectives to the purpose of the testing, the have also been criticized. These criticisms stem appropriateness of the item formats, and the ap- from the presence of content-irrelevant factors propriateness of the expected item difficulties. associated with item response data (Green, 1983). Rather than relying on subjective evaluations The ability of a test to represent its correspond- of test items, some test theorists recommend em- ing content domain is a quality inherent in the pirical analyses of item response data to discover test independent of examinee responses to the underlying content structure. Empirical studies items. Item difficulty, the ability level and of test content include applications of multidi- variability of the examinee population, motiva- mensional scaling (MDS) and cluster analysis tion, guessing, differential item functioning, and (Napior, 1972; Oltman, Sticker, & Barrows, 1990) social desirability are variables that may affect and applications of (Cattell, the results of item response analyses but are 1957). These analytic procedures result in dimen- irrelevant to assessment of content representa- sions, factors, and clusters that are presumed to tion. Analyses employing test response data allow be relevant to the content domains measured by the performance of the tested population to the test. Davison (1985) reviewed applications of determine the relationships between the test items factor analysis and MDS to test intercorrelations. while ignoring inherent item characteristics. His review, accompanied by monte carlo com- Although such analyses may be relevant in parisons of the two procedures, revealed that evaluations of construct validity or criterion- MDS often led to a more parsimonious represen- related studies, they are not central to evaluations tation of test structure than did factor analysis. of test content (Messick, 1989b). This finding was explained by the presence of A pseudo-empirical study conducted by item difficulty factors that appeared in the fac- Tucker (1961) used factor analysis to evaluate test tor analyses but did not emerge as dimensions content, yet avoided the use of item response in the MDS solutions. data. Tucker factor analyzed the ratings provided The subjective methods of evaluating test con- by the SMEs of the relevance of test items to the tent have been criticized severely due to their lack content domain tested. Two factors were iden- of practicality. Crocker, Miller, and Franks (1989) tified : The first factor was interpreted as &dquo;a and Thorndike (1982) point out that these tech- measure of general approval of the sample items&dquo; niques are rarely used in practice. One reason for (p. 584). The second factor was interpreted as a this lack of application is that subjective methods measure of the differences between two groups tend to support the content structure of a test of judges regarding which item types they con- implicitly, because presenting judges with the sidered more relevant (recognition items or content objectives of the test may bias their reasoning items). Through this procedure, Tucker judgments by imposing an external structure on avoided the problems associated with factor their ratings. Indeed, both Crocker et al. (1989) analyzing dichotomous test data. However, his

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 19 study did not directly evaluate the content areas uate the test items. Two of the judges had former- comprising the test. The items were not rated in ly taught a Study Skills course and were selected terms of their relevance to the specific content for their knowledge of the subject domain. The areas of the test, and the arrangement of items third judge, also familiar with the subject in the test blueprint was not directly evaluated. domain, was a psychometrician with many years The present study borrows from Tucker (1961) of experience in the construction and evaluation in that subjective data obtained from SNtES were of educational tests. All the judges were ignorant employed to evaluate test content. It deviates of the test blueprint. from Tucker’s method by the instructions altering Procedure given to the SMEs and analyzing the data using MDS and cluster analysis, rather than factor The SST was distributed to each of the three analysis. MDS and cluster analysis were con- judges independently. The original order of the sidered more appropriate than factor analysis in items on the test was randomly scrambled using this case because similarity ratings were gathered, a random sorting program. This procedure was and because previous research (e.g., Napior, used to control for any order effects that might 1972; Oltman et al., 1990) demonstrated the abili- have influenced the judges’ similarity ratings. The ty of these techniques to uncover content-relevant task of each judge was to &dquo;Judge how similar the test structure. The obtained item clusters also test questions are to each other according to the were compared directly with the test’s blueprint following .&dquo; The scale presented to the specifications. judges was a 5-point Likert-type scale ranging from &dquo;not at all to Method 1, similar,&dquo; 5, &dquo;extremely similar.&dquo; Description of the Test The judges were not given any criteria on which to rate the of the test items. The content analysis was performed on a test similarity of study skills (SST; Sireci, 1988) constructed to This ambiguity in instructions was used to avoid test the knowledge acquired by students at the biasing their ratings in favor of supporting the test The rated the of end of a five-session Study Skills course. This test blueprint. judges similarity item and entered their into a is a 30-item, four-option multiple-choice exam every pair ratings matrix. Because were not keyed to the concepts and skills that were taught reciprocal comparisons each a 30 x 30 lower- in the course. The blueprint of the test specified requested, judge provided matrix. six content areas derived directly from the course triangular syllabus (see Table 1). Analyses The Judges A series of weighted MDS (INDSCAL) analyses Three judges (SMES) were employed to eval- were performed on the three matrices of item similarity ratings. The INDSCAL analyses were Table 1 followed by hierarchical cluster analyses of the Study Skills Test Blueprint MDS stimulus coordinates. INDSCAL analyses. The three similarity matrices were analyzed using the INDSCAL model with the ALSCAL MDS program of SPSSX (Takane, Young, & de Leeuw, 1977; Young, Takane, & Lewyckyj, 1978). INDSCAL is an individual dif- ferences MDS model that allows for differences among the raters in their relative weighting of the dimensions. The INDSCAL model was originally

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 20 formulated by Carroll and Chang (1970) and data were by necessity transformed to dis- is a generalization of the classical MDS model in- similarity data as a preliminary step in the troduced by Torgerson (1958) and later expanded MDS analysis. With dissimilarities, a larger by Shepard (1962) and Kruskal (1964). In the number indicates greater dissimilarity, rather than INDSCAL model, each rater’s dissimilarity matrix greater similarity. For this dataset, values of 5 is multiplied by a vector of weights (w) consisting were converted to 1, values of 4 were converted of elements Wka that represent the relative to 2, and so forth. The data for all the ALSCAL emphasis rater k places on dimension a. The analyses were treated as ordinal data, and ties in distances between stimuli in the INDSCAL model the data were untied using the primary approach are computed by incorporating this weighting to ties (Kruskal, 1964). factor into the formula used Cluster analyses. After the appropriate by classical MDS. Thus, the INDSCAL model multidimensional solutions were identified, defines the distances between two objects i and j the item coordinates from these solutions as: were analyzed using the hierarchical cluster analysis program in SPSSx. The cluster analyses were performed on the item coordinates to iden- tify homogeneous item subsets within the multi- where d1j = the Euclidean distance between dimensional space. An inspection of these points i and j, clusters facilitated interpretation of the item con- x~ = the coordinate of point i on dimen- figurations and provided a more direct com- sion a, and parison of the item groupings specified in the test r = the maximum dimensionality of blueprint. the solution. The method of average linkage between Young et al. (1978) describe ALSCAL as an groups (Johnson, 1967; Sokal & Michener, 1958) &dquo;alternating least-squares approach&dquo; to MDS was used as the basis for the clustering. This that transforms the observed similarity data method maximizes the average distance between into distances that are subsequently configured all pairs of items belonging to different clusters. in the multidimensional space. This alternating Items are joined to a cluster when their average least-squares approach specifies a similarity value is most similar to the average (S-STRESS) that is minimized during the data similarity values of the other members of the transformation process. The original similarity cluster. Table 2 RSQ and STRESS Values for INDSCAL Analysis

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 21

Results and the individual fit indices for each matrix revealed differences the MDS (judge) slight among Analyses judges. Although one- through six-dimensional The individual STRESS and RSQ values listed representations of the data were attempted, the in Table 2 indicate that there is a moderate degree six-dimensional solution could not be computed. of error variance in the data, especially for Judges This was due to the presence of a singular matrix, 1 and 2. The third judge is the least aberrant, generated internally by ALSCAL, that could not exhibiting consistently lower values of STRESS be inverted. Thus, only one- through five-dimen- and higher values of RSQ. Table 3 shows that in sional representations of the data were obtained. the four- and five-dimensional solutions, Judge Interjudge agreement. Table 2 presents the 3 has relatively smaller weights on the highest fit indices of STRESS (departure of data from the dimension in comparison to the other judges. In model) and RSQ (proportion of variance contrasting these weights with the rater weights accounted for by the model) for the INDSCAL obtained in the three-dimensional solution, it solutions. The STRESS goodness-of-fit index appears that the judges are most similar in the generated by ALSCAL is Kruskal’s STRESS for- lower-dimensional space (two and three dimen- mula 1 (see Schiffman, Reynolds, & Young, 1981, sions), whereas the differences among the judges p. 175) and should be distinguished from are magnified in the higher-dimensional solutions S-STRESS, the loss function minimized in compu- (four and five dimensions). tation of the coordinates. Table 3 provides the The addition of a fourth or fifth dimension dimension weights for each of the three judges. appears to be contributing information regarding Inspection of the individual dimension weights individual differences among the judges. How-

Table 3 Weights for the Judges from the INDSCAL Analysis

*Proportion of variance among the judges accounted for by the dimension. The sum of the weights across dimensions equals RSQ.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 22

Figure 1 Rater Space From the Three-Dimensional INDSCAL Solution (la through lc), and a Subset of the 4-Dimensional INDSCAL Rater Space (ld) a. Dimension 1 Versus Dimension 2 b. Dimension 1 Versus Dimension 3

ever, because no data were gathered on the dif- dimensional space, the judges appear less similar. ferential characteristics of the judges and because The average STRESS and RSQ values (see Table the primary objective of the analysis was to 2) indicate that there is a relatively large amount discover information about the stimuli rather of error in these data. This finding may be due than the raters, an empirical investigation of these to the fact that the 5-point scale used to rate the differences was not conducted. stimuli was too restrictive and, therefore, more Figure la through Ic display the rater space ties were present in the data than would be ideal. from the three-dimensional INDSCAL solution, Davison (1983) stated that a scale containing 6 and Figure Id displays the first dimension plotted through 9 points &dquo;usually works quite well&dquo; (p. against the fourth dimension from the four- 42). However, many researchers using the method dimensional solution. In comparing these con- of paired comparisons employ larger scales. For figurations, it can be seen that the judges are example, Messick (1958) employed an 11-point relatively similar in three-dimensional space, scale; Wainer, Hurt, and Aiken (1976) and Wainer whereas the fourth dimension separates Judge 3 and Kaye (1978) employed a 15-point scale. from the other two. Thus, in two- or three- Selection of dimensionality. Two criteria often dimensional space, the judges can be perceived used to determine the appropriate dimensional as similar in their item ratings; in the higher- representation of a dataset are fit and inter-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 23 pretability (Davison, 1983). MDS solutions that not mean that the dimension has no interpreta- fit the data well and are readily interpretable are tion&dquo; (p. 57). Because the focus of this analysis desired. However, a paradox exists between fit was to obtain information regarding the relation- and interpretability. Higher-dimensional solu- ships among the test items, acceptance of the tions usually exhibit better fit but are usually higher-dimensional configurations will be con- more difficult to interpret. tingent on their interpretability. In this study, in- The STRESS and RSQ values provided by terpretation of the solution space was enhanced ALSCAL indicated that the five-dimensional solu- by a priori knowledge of the items-namely, their tion exhibited the best fit. To investigate further content area specifications designated in the test the best-fitting solution, these data were re- blueprint. Therefore, interpretation of the analyzed using the MULTISCALE-II computer pro- stimulus space focused on item groupings that gram (Ramsay, 1981, 1986). Because MULTISCALE reflected common content attributes. computes MDS distances using maximum Figures 2 through 5 present selected stimulus likelihood, the log likelihood of each solution was configurations from the two- through four- used in a x2 difference test to determine whether dimensional INDSCAL solutions. The item con- the addition of another dimension provided tent area specifications and the item symbols significant improvement in fit (Ramsay, 1980, necessary to interpret the configurations are 1986). The results of this analysis also indicated presented in Table 1. In the INDSCAL model, the that the five-dimensional solution exhibited the orientation of the stimulus space is unique. best fit. [However, not all researchers, e.g.; Arabie Therefore, rotation of the dimensions is not et al. (1987), agree that this test is appropriate necessary and the dimensions are directly inter- for the INDSCAL model.] pretable. This feature obviates the need to search Although the five-dimensional solution for other meaningful directions in the stimulus exhibited the best fit, Kruskal and Wish (1978) 2 out that solutions Figure pointed higher-dimensional Stimulus Configuration From the may provide better fit to the data because the Two-Dimensional INDSCAL Solution spatial configurations adapt to random error. Therefore, goodness of fit is not a sufficient criterion for the selection of appropriate dimensionality. Davison (1983) and Kruskal and Wish (1978) asserted that readily interpretable solutions are preferable over solutions containing uninterpret- able dimensions, even if the higher-dimensional solutions exhibit superior fit. Certain MDS solu- tions are more intuitively appealing to investi- gators and, therefore, the interpretability criterion often carries the most weight in determining dimensionality (Arabie et al., 1987, p. 36). However, accepting solutions based on inter- pretability is controversial. For example, Schiff- man et al. (1981) asserted that &dquo;Dimensions which can not be interpreted probably do not exist&dquo; (p. 12), although Kruskal and Wish (1978) pointed out that &dquo;the fact that a particular investigator can not interpret a dimension does

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 24

Figure 3 Three-Dimensional INDSCAL Stimulus Configuration Illustrating Item Content Area Specifications: Study Habits (Diamonds), Time Management (Squares), Classroom Learning (Flags), Textbook Learning (Balloons), Preparing for Exams (Cylinders), and Taking Exams (Spades); Ellipsoids Illustrate Results From Hierarchical Cluster Analysis

space (Arabie et al., 1987; Schiffman et al., 1981). in the two-dimensional solution. The addition of Figure 2 presents the stimulus configuration the third dimension-Text-Related Versus Non- resulting from the two-dimensional solution. Text Related-further separated the content area Dimension 1 (horizontal)-labeled Actions ver- of Textbook Learning from the other content sus Preparations-appears to separate the con- areas. With only a few exceptions, items cor- tent areas pertaining to scheduling and organiz- responding to the same content areas were ing activities (Time Management, Study Habits) perceived as highly similar by the SMEs. The from the more action-oriented activities (Taking circles surrounding the items in Figure 3 reflect Exams, Textbook Learning). Dimension 2 (verti- the results of the cluster analysis that are de- cal)-Exam-Related Versus Non-Exam Related- scribed below. appears to separate the content areas pertaining The four-dimensional solution was more dif- to exam-related activities (Preparing for Exams, ficult to interpret than the lower-dimensional Taking Exams) from those less related to ex- solutions. Dimensions 1 through 3 reproduced aminations (Classroom Learning, Textbook the three dimensions obtained in the three- Learning). dimensional solution, but the fourth dimension The three-dimensional solution is presented in was not readily interpretable in terms of the con- Figure 3, in which common symbols represent the tent attributes of the items; it distinguished content area specifications of the items. The first somewhat between the content areas of Taking two dimensions reflect the dimensions obtained Exams and Preparing for Exams. A three-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 25

Figure 4 Three-Dimensional Subspace From the Four-Dimensional INDSCAL Stimulus Configuration (Content Area Symbols are the Same as Figure 3)

dimensional subset of the four-dimensional ing,&dquo; and that four dimensions were necessary to solution that illustrates this distinction is distinguish between the content areas of &dquo;Tak- presented in Figure 4. All possible three- ing Exams&dquo; and &dquo;Preparing for Exams.&dquo; dimensional subsets of the four-dimensional Cluster Analysis solution are not presented because the first three dimensions were highly similar to the three- Tables 4 and 5 show the items clustered at each dimensional solution. stage of the clustering solution. In Figure 3, The five-dimensional solution resulted in circles have been placed around those items that configurations highly similar to the lower-dimen- formed clusters at each stage, with the exception sional solutions. Three dimensions emerged of the last three stages where the major content corresponding to those noted in the three- areas were collapsed. dimensional solution and another dimension The cluster analysis performed on the three- separated &dquo;Taking Exam&dquo; items from &dquo;Preparing dimensional coordinates revealed two item for Exams&dquo; items. However, the fifth dimension clusters that corresponded directly to two of the could not be interpreted, even when considering content areas specified in the test blueprint, characteristics of the items not relevant to con- &dquo;Classroom Learning&dquo; and &dquo;Textbook Learn- tent (e.g., positively worded versus negatively ing&dquo; (see Table 4 and Figure 3). The three- worded items). Thus, the five-dimensional solu- dimensional cluster analysis merged the content tion was not regarded as a valid representation areas of &dquo;Time Management&dquo; and &dquo;Study of the data. The results generally indicate that Habits&dquo; to form one cluster, and merged &dquo;Tak- at least three dimensions were necessary to ing Exams&dquo; and &dquo;Preparation for Exams&dquo; to distinguish between the content areas of form another. One exception to these item group- &dquo;Classroom Learning&dquo; and &dquo;Textbook Learn- ings was Item 6, whose content area designation

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 26

Table 4 Clustering Solution From Three-Dimensional INDSCAL Coordinates

was &dquo;Preparing for Exams,&dquo; but clustered several item clusters that were also congruous together with the &dquo;Classroom Learning&dquo; items. with the test blueprint (see Table 5). Two of the Thus, with the exception of one item, of the four clusters observed in the three-dimensional substantive clusters that emerged in the three- analysis also emerged: The cluster consisting of dimensional cluster analysis, two directly cor- &dquo;Textbook Learning&dquo; items and the cluster con- responded to content areas prescribed in the test sisting of the &dquo;Time Management&dquo; and &dquo;Study blueprint, and two represented combinations of Habits&dquo; items. The content areas of &dquo;Taking two highly-related content areas. Figure 5 Exams&dquo; and &dquo;Preparing for Exams&dquo; clustered presents a revised grouping of the items where the as specified in the blueprint with two exceptions: content areas of &dquo;Time Management&dquo; and Items 6 and L. Item L was specified as a &dquo;Tak- &dquo;Study Habits&dquo; are combined (plotted as ing Exams&dquo; item, but it clustered with the diamonds), and Item 6 is reclassified as belong- &dquo;Preparing for Exams&dquo; cluster. Item 6 again ing to the &dquo;Classroom Learning&dquo; content area clustered with items belonging to the &dquo;Classroom (plotted as flags). Learning&dquo; content domain. The four-dimen- The cluster analysis resulting from the four- sional cluster analysis did not fully support the dimensional INDSCAL coordinates revealed retention of a &dquo;Classroom Learning&dquo; cluster.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 27

Table 5 Clustering Solution From Four-Dimensional INDSCAL Coordinates

Four of the items in this domain did cluster finding is interesting considering that the SMES together (Items 4, B, E, and F); however, two of were not informed of the content areas of the test the items (C and 0) clustered together with the and that they were not instructed to rate the items &dquo;Textbook Learning&dquo; cluster. Thus, the cluster according to content characteristics. The task analysis stemming from the four-dimensional presented to the SMES in this study might be INDSCAL solution made finer distinctions be- effective in circumventing an expectancy bias that tween items corresponding to the same content may occur when SMEs are informed of the con- area (e.g., &dquo;Classroom Learning&dquo; items) and tent areas of the test blueprint or of item-content items corresponding to different content areas area specifications. (&dquo;Taking Exams&dquo; and &dquo;Preparing for Exams&dquo; The results also illustrated the utility of MDS items). and cluster analysis to identify groups of items perceived to be similar the Slates. The MDS Discussion by configurations allow for visual inspection of item The results indicate that the SMEs relied groupings, and the cluster analyses facilitated this primarily on content characteristics of test items visual inspection. It should be noted that, in this when making their similarity judgments. This study, the purpose of the hierarchical cluster

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 28

Figure 5 Stimulus Configuration From the Three-Dimensional Solution Illustrating &dquo;Revised&dquo; Content Area Specifications: Time Management/Study Habits (Diamonds), Classroom Learning (Flags), Textbook Learning (Balloons), Preparing for Exams (Squares), and Taking Exams (Cylinders)

analysis was to facilitate visual inspection of Inspection of a series of MDS and cluster the MDS configurations, not to provide an analyses allows an investigator to determine alternative inspection of the data. Because the the level of scrutiny to be imposed on the items. clustering was performed on the MDS item co- If the investigator wishes to have only highly ordinates, it is not surprising that the items within homogeneous content areas, items that cluster the resulting clusters were proximal in the MDS as prescribed in the lower-dimensional space but space. Had the cluster analyses been performed do not cluster as prescribed in the higher- on the original similarity data (using the INDCLUS dimensional space may be rejected. An in- program designed for three-way data; see Carroll vestigator can decide whether to flag these & Arabie, 1983) discrepancies between the MDS items for inspection (as the four-dimensional and clustering solutions could have been solution would suggest), consider them investigated. homogeneous (as the three-dimensional solution It should also be noted that because the would suggest), or integrate both positions within clustering was performed on the unweighted item a hierarchical model. Thus, the eclectic informa- coordinates, an assumption was made that the tion obtained through joint MDS and cluster dimensions were equally important to the SMEs. analyses illustrates Napior’s (1972) contention Future analyses should consider weighting the that joint MDS/cluster analyses can detect the dimensions according to their proportion of &dquo;more subtle and complex relations&dquo; among variance accounted for, or performing separate stimuli that would not be detected by cluster cluster analyses for each SME. analysis alone (p. 165).

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 29

Benefits of the Procedure for Test Construction item comparisons among groups of judges, or by increasing the time interval required for the In addition to portraying global content struc- judges to make their comparisons. ture, the method presented here may also prove Another limitation of the current study was useful for item selection purposes. Items that that only one sample of three SMES was emerge as outliers or do not cluster with other employed. Future research should employ larger items in their prescribed content area can be groups of SMES to determine if the dimensions flagged for removal or modification. The pro- and clusters are consistent across samples com- posed method may also be useful in item prising different numbers of judges, and to cross- sensitivity review. Judges who are members of validate the results obtained in the different concerned (e.g., minority) groups could be samples. Osterlind (1989) recommended that a employed to judge the similarity of the items. The minimum of four or five judges is necessary to stimulus configurations of test items from these evaluate a test of moderate size. judges could be compared to the stimulus con- The quality of the SMEs employed is of crucial figuration of items derived from an original importance. For this method, or any method of group of judges to determine if there are any evaluating test content, to be successful, it is discrepancies regarding individual items. Items imperative that the judges be knowledgeable of that cluster differently between the two groups the specified content domain and that they are of judges may be flagged for further sensitivity representative of the domain of all possible review. INDSCAL analyses of the two or more qualified judges. groups of judges may also identify sensitive items A major limitation of the present study was and/or content areas. that data on item-domain relevance were not The present method also may be useful in pro- gathered. Thus, although the present study viding information regarding the appropriate assessed the content structure of the blueprint, number of items to include in a given content it did not directly assess the relevance of the items area. If content areas overlap it may be due to to the content areas defined by the test blueprint. an insufficient number of items in the content This limitation could be remedied by gathering areas, rather than truly overlapping content both item similarity data and item relevance data domains. In this way, the MDS and cluster as described below. analyses can provide information regarding the Implications for Future Research number of items necessary to adequately repre- sent a content domain. Although the results of the present study valuable in understanding the content Limitations of the Procedure proved structure of the test, it could be supplemented Although the proposed procedure shows by other methods to provide additional informa- potential as a test construction tool, there are pro- tion pertaining to the test’s content representa- blems and limitations. One problem is that rating tion. For example, the results from the present the similarity of test items becomes increasingly method could be compared with results from complex as test size increases. A 30-item test re- item analyses employing test response data. It quires 435 item comparisons. The item similari- would be interesting to identify the item-to-total ty matrix will increase exponentially as the test score correlations or item-to-content area score size increases. The larger the number of com- correlations for those items that do not cluster parisons to be made, the greater the demand on as predicted. If these correlations are relatively the judges. This problem may be alleviated small, it may support the removal of those items through a reduction in the number of stimulus from the test. Napior’s (1972) method of multi- comparisons (Spence, 1982, 1983), by dividing the dimensional item analysis would provide results

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 30 that could be directly comparable to the data col- 25-38. lected in the present study. Davison (1985) Carroll, J. D., & Arabie, P. (1983). INDCLUS: An in- dividual differences of the ADCLUS describes advantages of using MDS to analyze test generalization model and the MAPCLUS algorithm. Psycho- item intercorrelations. Such are to analyses likely metrika, 48, 157-169. supplement analyses of item similarity ratings. Carroll, J. D., & Chang, J. J. (1970). An analysis of The data gathered in the present study could individual differences in multidimensional scaling also be subjected to a confirmatory MDS analysis via an n-way generalization of "Eckart-Young" 238-319. in which the distances between the items are decomposition. Psychometrika, 35, Cattell, R. B. (1957). and motivation struc- constrained to their Personality according blueprint specifi- ture and measurement. New York: Harcourt, Brace, cations. Borg and Lingoes (1980) describe a and World. procedure in which a &dquo;pseudo-matrix&dquo; is con- Crocker, L. M., Miller, D., & Franks E. A. (1989). structed that represents the hypothesized item Quantitative methods for assessing the fit between test and curriculum. Measurement in Educa- relations specified in the test blueprint. The Applied tion, 2, 179-194. distances from the data could be con- resulting Davison, M. L. (1983). Multidimensional scaling. New strained by the distances resulting from a MDS of York: Wiley. the pseudo-data matrix. The degree to which the Davison, M. L. (1985). Multidimensional scaling ver- data fit the pseudo-data would then be taken as sus components analysis of test intercorrelations. Bulletin, 97, 94-105. an index of the test’s fit to its blueprint. This pro- Psychological Fitzpatrick, A. R. (1983). The meaning of content cedure would also allow for items to be deleted validity.Applied7, Psychological Measurement, 3-13. until a satisfactory fit of test to blueprint was Green, S. B. (1983). Identifiability of spurious factors obtained. Heiser and Meulman (1983) describe with linear factor analysis with binary items. Ap- other methods of constrained MDS that could plied Psychological Measurement, 7, 139-147. Guion, R. M. Content The source of also be used to impose blueprint constraints on (1977). validity: discontent. Measurement, the item my Applied Psychological configurations. 1, 1-10. Future applications of this procedure should Guion, R. M. (1978). Scoring of content domain also gather item-domain ratings from the SMEs samples: The problem of fairness. Journal of Ap- to evaluate the relevance of the items to their plied Psychology, 63, 499-506. Hambleton, R. K. (1980). Test score and stan- perceived content areas. This could be ac- validity dard setting methods. In R. A. Berk (Ed.), the slvtES the complished by having provide Criterion-referenced measurement: The state of the art similarity ratings, and then asking them to rate (pp. 80-123). Baltimore: Johns Hopkins Universi- how strongly each item corresponded to each of ty Press. the content areas specified in the test blueprint. Hambleton, R. K. (1984). Validating the test score. In These &dquo;item-relevance&dquo; ratings could be used in R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 199-230). Baltimore: Johns a where the multiple regression procedure Hopkins University Press. relevance data were regressed on the dimensions. Heiser, W. J., & Meulman, J. (1983). Constrained If the items were relevant to their specified con- multidimensional scaling, including confirmation. tent areas and the MDS solution supported the Applied Psychological Measurement, 7, 381-404. Johnson, S. C. Hierarchical clustering schemes. test blueprint, then the content areas would help (1967). 32, 241-254. in of the dimensions. Psychometrika, interpretation Kruskal, J. B. (1964). Nonmetric multidimensional References scaling: A numerical method. Psychometrika, 29, 115-129. Arabie, P., Carroll, J. D., & DeSarbo, W. J. (1987). Kruskal, J. B., & Wish, M. (1978). Multidimensional Three-way scaling and clustering. Newbury Park CA: scaling. Newbury Park CA: Sage. Sage. Lawshe, C. H. (1975). A quantitative approach to con- Borg, I., & Lingoes, J. C. (1980). A model and tent validity. Personnel Psychology, 28, 563-575. algorithm for multidimensional scaling with exter- Messick, S. (1958). The perception of social attitudes. nal constraints on the distances. Psychometrika, 45, Journal of Abnormal and Social Psychology, 52,

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 31

57-66. (pp. 29-46). Minneapolis: University of Minnesota Messick, S. (1975). The standard problem: Meaning Press. and values in measurement and evaluation. Spence, I. (1983). Monte carlo simulation studies. American Psychologist, 30, 955-966. Applied Psychological Measurement, 7, 405-426. Messick, S. (1980). Test validity and the ethics of assess- Takane, Y., Young, F. W., & de Leeuw, J. (1977). ment. American Psychologist, 35, 1012-1027. Nonmetric individual differences multidimensional Messick, S. (1989a). Meaning and values in test valida- scaling: An alternating least squares method with tion : The science and ethics of assessment. Educa- optimal scaling features. Psychometrika, 42, 7-67. tional Researcher, 18, 5-11. Tenopyr, M. L. (1977). Content-construct confusion. Messick, S. (1989b). Validity. In R. Linn (Ed.), Educa- Personnel Psychology, 30, 47-54. tional measurement (3rd ed.), (pp. 13-103). Thorndike, R. L. (1982). Applied . Boston: Washington DC: American Council on Education. Houghton Mifflin. Morris, L. L., & Fitz-Gibbon, C. T. (1978). How to Torgerson, W. S. (1958). Theory and methods of scal- measure achievement. Beverly Hills: Sage. ing. New York: Wiley. Napior, D. (1972) Nonmetric multidimensional tech- Tucker, L. T. (1961). Factor analysis of relevance niques for summated ratings. In R. N. Shepard, A. judgements: An approach to content validity. Paper K. Romney, & S. B. Nerlove (Eds.), Multidimensional presented at the Invitational Conference on Testing scaling: Theory and applications in the behavioral Problems, Princeton NJ. [Reprinted in A. Anastasi sciences. Volume 1: Theory (pp. 157-178). New York: (Ed.), Testing problems in perspective, (1966), (pp. Seminar Press. 577-586) Washington D.C.: American Council on Oltman, P. K., Stricker, L. J., & Barrows, T. S. (1990). Education.] Analyzing test structure by multidimensional scal- Wainer, H., Hurt, S., & Aiken, L. (1976). Rorschach ing. Journal of Applied Psychology, 75, 21-27. revisited: A new look at an old test. Journal of Con- Osterlind, S. J. (1989). Constructing test items. Norwell sulting and Clinical Psychology, 44, 390-399. MA: Academic Press. Wainer, H., & Kaye, K. (1978). Multidimensional scal- Ramsay, J. O. (1980). Some small sample results for ing of concept learning in an introductory course. maximum likelihood estimation in multidimen- Journal of Educational Psychology, 66, 591-598. sional scaling. Psychometrika, 45, 139-144. Young, F. W., Takane, Y., & Lewyckyj, R. (1978). Ramsay, J. O. (1981). How to use MULTISCALE. In ALSCAL: A nonmetric multidimensional scaling S. S. Schiffman, M. L. Reynolds, & F. W. Young program with several difference options. Behavioral (Eds.), Multidimensional scaling: Theory, methods, Research Methods and Instrumentation, 10, 451-453. and applications (pp. 211-235). New York: Academic Press. Acknowledgments Ramsay, J. O. (1986). MULTISCALE-II manual [Com- puter program manual]. Montreal Quebec: McGill The authors are indebted to the helpful comments and sug- University. gestions provided by two anonymous reviewers of an earlier Schiffman, S. S., Reynolds, M. L., & Young, F. W. version of this paper. The authors thank the following for (1981). Multidimensional scaling: Theory, methods, their comments on the work as it progressed: Bruce Biskin, and applications. New York: Academic Press. Herman Friedman, Barbara Helms, Janet F Carlson, Shepard, R. N. (1962). The analysis of proximities: Soonmook Lee, Samuel Messick, Thanos Patelis, and Multidimensional scaling with an unknown distance Howard Wainer. Gratitude is also extended to James Ram- function. Psychometrika, 27, 125-140. say for his helpful comments and for providing the Sireci, S. G. (1988). The SST: A test of study skills. Un- muLTisc4LE-ii program. An earlier version of this paper published test, Fordham University, Bronx NY. was presented at the 1990 Annual Meeting of the North- Sokal, R., & Michener, C. D. (1958). A statistical eastern Educational Research Association, Ellenville NY. method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 38, 1409-1438. Author’s Address Spence, I. (1982). Incomplete experimental designs for multidimensional scaling. In R. G. Goledge & J. Send requests for reprints or further information to N. Rayner (Eds.), Proximity and preference: Problems Kurt F. Geisinger, 601 Culkin Hall, SUNY College, in the multidimensional analysis of large data sets Oswego NY 13126, U.S.A.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/