Available online at www.sciencedirect.com

Intelligence 36 (2008) 574–583

Investigating the ‘g’-saturation of various stratum-two factors using automatic item generation ⁎ Martin E. Arendasy , Andreas Hergovich, Markus Sommer

University of Vienna, Department of Psychology, Psychometric Technology Group, Austria Received 9 December 2006; received in revised form 27 November 2007; accepted 27 November 2007 Available online 4 February 2008

Abstract

Even though researchers agree on the hierarchical structure of there is considerable disagreement on the g-saturation of the lower stratum-two factors. In this article it is argued that the mixed evidence in the research literature can be at least partially attributed to the representation of the tests used to measure the stratum-two factors. In the study described here two top-down approaches to automatic item generation were used to build construct representation directly into the items. This enabled a clearer substantive interpretation of the shared commonalities extracted by the various stratum-two factors and helped to rule out alternative, but mathematically equivalent, structural models based on independent empirical evidence. This approach was used to investigate the g-saturation of five stratum-two factors using a sample of 240 male and female respondents. The results support the assumption that fluid intelligence exhibits the highest g-saturation. © 2007 Elsevier Inc. All rights reserved.

Keywords: CHC models; g-saturation; Construct representation; Automated item generation

1. Theoretical background and confirmatory factor analyses of Wechsler-like test batteries. For instance, Gignac (2006) investigated Despite a growing consensus on a hierarchical struc- the g-saturation of the subtests included in different ture of intelligence with g at the apex and several Wechsler-like test batteries. The author used a single- broader stratum-two factors, researchers still disagree on trait-correlated uniqueness confirmatory factor analysis the g-saturation of the broader stratum-two factors. and specified a g-factor in addition to correlated resid- Recently some researchers (cf. Gignac, 2006; Robin- uals between subtests assumed to measure either (1) son, 1999) have argued that general intelligence is crystallized intelligence (Gc), (2) visual processing (Gv), more closely associated with crystallized intelligence or (3) fluid intelligence (Gf) in order to control for their (Gc). This claim is based on the results of exploratory shared commonalities when estimating their g-satura- tion. The results indicated that the crystallized intel- ligence (Gc) subtests had the highest average factor loadings on the g-factor. This is in contrast to the results ⁎ Corresponding author. University of Vienna, Department of Psychology, Liebiggasse 5, A-1010 Vienna, Austria. Tel.: +43 1 obtained by Roberts, Goff, Anjoul, Kyllonen, Pallier 4277 47848. and Stankov (2000) as well as Gustafsson (1984; 2002; E-mail address: [email protected] (M.E. Arendasy). Gustafsson & Balke, 1993; Undheim & Gustafsson,

0160-2896/$ - see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.intell.2007.11.005 M.E. Arendasy et al. / Intelligence 36 (2008) 574–583 575

1987), who reported that fluid intelligence (Gf)wasvir- researchers need to resort to supplementary empirical tually indistinguishable from the higher-order g-factor in evidence in order to come up with a substantive inter- their studies. pretation of the results obtained. Usually this empirical The picture is further complicated by results obtained evidence is derived from prior factor analyses conducted by Bickley, Keith and Wolfe (1995), who investigated the by other researchers or expert ratings of the cognitive factorial structure of 16 subtests of the revised Wood- demands of the individual subtests (cf. McGrew, 1997). cock–Johnson battery (WJ-R) using a hierarchical con- Neither of these two approaches is without difficulties. firmatory factor analysis. These researchers reported an When resorting to prior factor analytical research it must almost equal g-saturation of fluid intelligence (Gf:.88), be borne in mind that the assignment of a subtest to a quantitative reasoning (Gq: .86) and crystallized intelli- certain stratum-two factor may change depending on the gence (Gc: .87). This suggests that all three stratum-two other subtests included in the analysis (cf. McGrew, factors are equal with regard to their g-saturation. This 1997). conclusion was further supported by the finding that the Expert ratings, on the other hand, depend largely on model fit decreased significantly once the standardized the expertise of the raters and involve a certain amount of factor loadings of the three stratum-two factors on ‘g’ subjectivity in comparison to other methodological were set to 1. These results were also replicated by Carroll approaches. Furthermore, expert-novice research indi- (2003), who used a more extended set of variables taken cates that experts sometimes either disagree on the cog- from the WJ-R sample. nitive processes respondents use to solve a given item Taken together, the literature provides rather mixed set, or even fail to accurately identify them. This phe- evidence with regard to the g-factor saturation of the nomenon is commonly referred to as the experts' blind stratum-two factors fluid intelligence (Gf), crystallized spot (cf. Nathan & Petrosino, 2003). This problem is intelligence (Gc) and quantitative reasoning (Gq). In the highlighted in Ashton and Lee's (2006) criticism of the following section we discuss several for this model specifications chosen by Gignac (2006). Using mixed evidence. the same data set and method of analysis, the authors specified alternative correlated uniqueness based on their 2. Reasons for the mixed evidence on the g-factor own task analysis and demonstrated that their model saturation not only fitted the data equally well but was in fact psychometrically identical to the model specified by The inconsistencies in the results on the g-saturation Gignac (2006). However, the model specified by Ashton of the stratum-two factors can be attributed to differences and Lee (2006) led to substantively different results with between the studies in: (1) the stratum-two factors and regard to the g-saturation of the individual subtests. Their (2) the homogeneity of the sample investigated, as well results indicated that the non-crystallized subtests had a as (3) the construct representation (Embretson, 1983)of higher average factor loading on the g-factor than the the individual subtests. In the following discussion we crystallized subtests. focus on the latter and outline how it affects Alternatively one could circumvent the problem as- results obtained with confirmatory factor analyses. sociated with the construct representation (Embretson, In several studies (e.g. Flanagan & McCrew, 1998; 1983) of the individual subtests by building the construct Johnson & Bouchard, 2005; McGrew, 1997) some representation directly into the items. However, this subtests have been observed to load on more than one requires a systematic -driven item construction stratum-two factor. This has been taken as evidence that approach that seeks to maximize the construct-related some subtests are far from pure measures of the intended variance in the item difficulties and simultaneously stratum-two factor and would be best described as mixed minimizes unwanted variance arising from non-con- measures. This claim is in line with research on the struct-related factors. In the following section we will cognitive processes respondents use to solve such subtests thus outline such a theory-driven item construction (e.g. Arendasy & Sommer, 2005; Embretson & Gorin, approach using the current debate on the g-saturation of 2001; Meo, Roberts & Marucci, 2007). However, the the broader stratum-two factors of the Cattell–Horn– use of mixed measures turns the substantive interpreta- Carroll model (CHC model) outlined above. However, tion of the extracted factors into a difficult enterprise and it should be noted that the approach could also be might even lead to biased results (cf. Ashton & Lee, 2006; applied to research on alternative models of intelli- Fischer, 1974). gence in order to circumvent problems arising from Since factor analyses do not establish the meaning issues of construct representation within the framework of the shared commonalities captured by each factor, of these models. 576 M.E. Arendasy et al. / Intelligence 36 (2008) 574–583

3. Automatic Item Generation as a tool for building either multiple regression analyses or the linear logistic construct representation into the item construction test model (LLTM: Fischer, 1973). If the radicals (Irvine, process 2002) derived from the theoretical model contribute at least R2 =.80 to the item parameters, the construct represen- In the last few years Automatic Item Generation (AIG) tation (Embretson, 1983) of the subtest can be assumed. has been introduced as a new item construction technology In this sense top-down approaches such as the auto- in applied (Irvine, 2002). According to the matic min-max approach (Arendasy, 2005; Arendasy & various top-down approaches to automatic item genera- Sommer, 2007) outlined above serve to build construct tion, the construction process of the item generator starts representation directly into the item construction process with the definition of the latent trait to be measured. In the by combining research in cognitive psychology, indivi- present study we used global definitions of the stratum-one dual differences and applied psychometrics. factor as our starting point (cf. McGrew, 1997). In a second step a literature review was conducted to identify the 4. Formulation of the problem cognitive processes, solution strategies and structures that characterize the latent trait. This literature Based on these theoretical and methodological con- review took account of research on individual differences siderations the aim of the present study was the inves- as well as studies in cognitive psychology in order to tigation of the g-saturation of the stratum-two factors identify item features that affect the cognitive processes fluid intelligence (Gf), crystallized intelligence (Gc), and solution strategies respondents use to solve the given quantitative reasoning (Gq) and visual processing (Gv) items. These item features constitute the radicals (Irvine, using a test battery consisting of ten subtests con- 2002) that are incorporated into the generative framework structed by means of top-down approaches to automatic of an automatic item generator (cf. Arendasy, 2005; item generation in order to demonstrate the utility of Arendasy & Sommer, 2005, 2007). These radicals are this approach in research on the structure of human assumed to systematically affect certain properties of an intelligence. In this test battery each of the five stratum- item – e.g. its difficulty – due to their effect on the two factors is represented by two subtests. We chose cognitive processes used by respondents to solve the items. this approach in order to ensure the construct repre- Arendasy et al. (Arendasy, 2005; Arendasy & sentation (Embretson, 1983) of the individual subtests Sommer, 2005, 2007) have recently pointed out that the and allow for a less ambiguous model specification and specification of radicals (Irvine, 2002) alone does not interpretation of the commonalities captured by each of suffice to ensure (1) a high psychometric quality (e.g. fit of the five stratum-two factors. the Rasch Model: Rasch, 1960/1980) and (2) a satisfying level of construct representation (Embretson, 1983). The 5. Method authors argue that items must be generated in a manner that triggers solution strategies inherently related to the 5.1. Procedure latent trait to be measured and simultaneously reduces interfering variance arising from non-construct-related The study was carried out during a university seminar cognitive processes. This latter aim is accomplished held by the first author. Eight of the ten subtests were through the use of a set of ‘functional constraints’ that administered via TestWeb, which enables web-based must be integrated into the quality control framework (cf. psychological assessment. The remaining two subtests Arendasy, 2005; Arendasy & Sommer, 2005, 2007)ofthe were administered using the Vienna Test System, which item generator. The ‘functional constraints’ are derived enables computerized psychological assessment. Partici- from studies in cognitive psychology and psychometric pants were tested in groups of one to three respondents. experiments specifically designed to investigate their The tests were presented in one session lasting around effect on the psychometric quality of automatically gen- 2.5 h, including two breaks of about 10 min each. Each erated items (e.g. Arendasy & Sommer, 2005; Arendasy, group of respondents was tested by a student test Sommer, Gittler & Hergovich, 2006). administrator, who was personally present during that Once the item generator has been constructed and the time. The respondents were informed that they were necessity and sufficiency of the ‘functional constraints’ going to work on different ability tests in three has been demonstrated in a series of psychometric ex- consecutive sessions and that there would be two breaks periments, the validity of the model can be evaluated in between the sessions. All respondents worked on the test terms of its ability to predict the difficulties of the items battery in the same sequence: (1) Visual Short-term generated on the basis of the specified radicals, using , (2) Verbal Short-term Memory, (3) Arithmetic M.E. Arendasy et al. / Intelligence 36 (2008) 574–583 577

Table 1 Item generators, name of the test, task description and size of the item pool, radicals used in generative framework and their joint explanative power regarding item difficulty parameters, psychometric benchmarks used to evaluate the potential of the radicals and functional constraints, references to prior studies Item Name of Task description K a N b Radicals (generative R2c Psychometric References generator the test framework) benchmark (quality control framework) d EstiGen CE: Computational The goal of the task 80 746 • number sense .88 RM / LLTM Arendasy and Estimation is to estimate the • required level of Sommer (2007) result of complex estimation accuracy arithmetic • arithmetic complexity problems. • rounding difficulty NGen ANF: Arithmetic The goal of the task 80 890 • number of arithmetic .90 RM / LLTM Arendasy and Fluency is to find the proper operators Sommer arithmetic operations • number of different (2007) to complete an operators equation. • arithmetic proficiency • identify ability of arithmetic operations GeomGen GEOM: Figural The goal of the task 180 6817 • number of rules .92 RM / LLTM Arendasy (2005), Inductive is to find the rules • rule complexity Arendasy and Reasoning governing the figural • number of elements Sommer (2005) matrices and to complete • degree of the matrices by applying these rules. ZafoGen NID: Numerical The goal of the task is to 80 646 • length of the .88 RM / LLTM Arendasy, Hornke, Inductive find the rules governing number series Sommer, Häusler, Reasoning the number series and to • rule complexity Wagner-Menghin, continue the series by • number of rules Gittler, Bognar, applying these rules. • periodicity and Wenzl (2007) WGen SYN: Synonyms The goal of this task is to 77 672 • word frequency of .89 RM / MRA Arendasy et al. choose a German word the source word (2007) from a list of four answer • word frequency of alternatives that has the the target word same meaning as the given • word frequency source word. of the distractors • semantic similarity of the source word and the answer alternatives WFGen VF: Verbal Fluency The goal of this task is to 76 553 • word frequency .88 RM / MRA Arendasy et al. ensample character strings • number of characters (2007) into a German word. to be ensampled • relative amount of common character combinations • dissimilarity between the character string and the target word – VI: Visual Short- The goal of the task is to 114 560 • number of icons to .88 RM / MRA Hornke (2002) term-Memory remember and recall the be remembered location of icons presented • size of the area on a schematic city map. enclosed by a circumference around the icons poor vs. good gestalt of the icon pattern (continued on next page) 578 M.E. Arendasy et al. / Intelligence 36 (2008) 574–583

Table 1 (continued) Item Name of Task description K a N b Radicals (generative R2c Psychometric References generator the test framework) benchmark (quality control framework) d – VE: Verbal Short- The goal of the task is to 114 560 • number of bus stops .88 RM / MRA Hornke (2002) term-Memory remember and recall to be remembered station names that are • presentation rate of the presented sequentially individual bus stops along a bus route shown • level of concreteness on a schematic city map. of the bus stop names – 3DC: Spatial The goal of this task is to 142 1298 • number of rotations .80 RM / LLTM Gittler (1990) Comprehension mentally rotate the distracter • pattern complexity cubes and compare them • position of the correct to the target cube. answer alternative • similarity of the distractors EsGen ELT: Endless The goal of this task is 160 4196 • visual complexity .88 RM / LLTM Arendasy (2000, Loops to determine from which of the loops 2005), Gittler and side a picture of a given • 3D-complexity Arendasy (2003) (endless loop) • global dominance has been taken. effect • complete overlap of loop elements a k=number of items in item pool. b N=size of calibration sample. c R2 =proportion of variance of item difficulty parameters explained by radicals. d model fit: RM=Rasch Model (Rasch, 1960/1980), LLTM=Linear Logistic Test Model (Fischer, 1973), MRA=Multiple Linear Regression Analysis.

Fluency, (4) Synonyms, (5) Figural , assumed based on independent psychometric evidence (6) Computational Estimation, (7) Verbal Fluency, obtained in prior validation studies by means of multiple (8) Endless Loops, (9) Numerical Inductive Reasoning regression analyses and/or the linear logistic test model and (10) Spatial Comprehension. All the subtests were (Fischer, 1973). presented as computerized adaptive tests (CAT). 5.3. Sample 5.2. Measures The sample consisted of 126 (52.5%) male and 114 The test battery used in this study consisted of ten (47.5%) female respondents aged between 16 to 63 years subtests. Two subtests were used to measure each of the (M=37.11; SD=12.13). The median age was 36 years. A five stratum-two factors quantitative reasoning (Gq), fluid total of 25 (10.4%) respondents had completed 9 years of intelligence (Gf), crystallized intelligence (Gc), short-term schooling but no vocational training, while 91 (37.9%) memory (Gsar) and visual processing (Gv). The selection respondents had completed vocational educational. of the subtests was based on the following criteria: (1) the Ninety-nine (41.3%) respondents had a high school item construction process of each subtest should be based leaving certificate qualifying them for university entrance, on a theoretical model specifying the cognitive processes and 25 (10.4%) respondents had graduated from needed to solve the items, (2) the items should have been university or college. Neither the gender distribution constructed using a top-down approach to automatic item (χ2 [1]=.04; p=.831) nor the age distribution (χ2 [9] generation, (3) in prior studies the radicals contributed at =16.45; p=.058) differs significantly from that expected least R2 =.80 to the item difficulty parameters and (4) the on the basis of the Austrian census of 2001. Furthermore, radicals of subtests assumed to load on different factors the distribution of educational level (χ2 [4]=9.44; should not overlap (for details see Table 1). p=.051) was also in line with expectations based on the In using these criteria we ensured that the construct Austrian census of 2001. The sample can thus be regarded representation (Embretson, 1983) of each subtest could be as representative within the age range investigated. M.E. Arendasy et al. / Intelligence 36 (2008) 574–583 579

Table 2 Descriptive statistics for all measures (above) and fit statistics for the CFA (below)

Measure Mean SD Skew Kurtosis rtt CE −.12 0.7 −.04 1.14 .80 a ANF −.50 1.64 .38 .26 .80 a GEOM −.89 1.11 .35 .33 .80 a NID −.08 1.06 .95 1.05 .80 a SYN .69 1.11 −.95 1.61 .80 a VF −.04 1.34 −.19 −.02 .80 a VI −.02 1.97 .14 −.63 .80 a VE −.37 1.06 −.24 −.94 .80 a 3DC −.76 1.18 .16 −1.06 .80 a ELT .62 1.19 .37 .94 .80 a

Model χ2 df P χ2/df CFI RSMEA SRMR CHC model 45.96 31 .041 1.48 .98 .045 .039 g-factor model 92.51 35 b.001 2.64 .93 .083 .055 Content factor model 57.01 32 .004 1.78 .96 .060 .050

G=Gf model 47.27 32 .040 1.48 .97 .045 .039 G=Gq model 61.15 32 .001 1.91 .96 .062 .047 G=Gc model 92.40 32 b.001 2.89 .93 .089 .108 G=Gv model 165.59 32 b.001 5.18 .84 .132 .109 G→Gf =Gc =Gq =Gv 158.81 33 b.001 4.81 .85 .126 .118 Note: ANF: Arithmetic Fluency; CE: Computational Estimation; GEOM: Figural Inductive Reasoning; NID: Numerical Inductive Reasoning; SYN: Synonyms; VF: Verbal Fluency VI: Visual Short-term Memory; VE: Verbal Short-term Memory; ELT: Endless Loops; 3DC: Spatial Comprehension. a target reliability (CAT).

5.4. Tested models three stratum-two method factors and a higher-order g- factor. In this model we assumed that the subtests Based on the CHC model, we assumed that the ‘Arithmetic Flexibility’, ‘Computational Estimation’ and observed correlations between these ten subtests can be ‘Numerical Inductive Reasoning’ load on a numerical explained by five stratum-two factors and a higher-order intelligence factor (N), while the subtests ‘Synonyms’, g-factor. According to this theoretical model, the subtests ‘Verbal Fluency’ and ‘Verbal Short-term Memory’ load ‘Arithmetic Flexibility’ and ‘Computational Estimation’ on a factor (V), and the subtests would load on Gq, while the subtests ‘Figural Inductive ‘Endless Loops’, ‘Spatial Comprehension’, ‘Figural In- Reasoning’ and ‘Numerical Inductive Reasoning’ were ductive Reasoning’ and ‘Visual Short-term Memory’ load assumed to load on Gf. ‘Synonyms’ and ‘Verbal Fluency’ on a figural intelligence factor (F). Furthermore, these were assumed to load on Gc, while the subtests ‘Verbal three content factors were assumed to load on a higher- Short-term Memory’ and ‘Visual Short-term Memory’ order g-factor. Even though this model is not supported by would load on Gsar. The subtests ‘Endless Loops’ and prior research on the construct representation of our ‘Spatial Comprehension’ were assumed to load on Gv. subtests, we specified this model to rule out the possibility Furthermore, the five stratum-two factors were assumed that the correlations between our ten subtests can be to load on a higher-order g-factor (cf. Bickley et al., explained equally well by an alternative model. This 1995; Carroll, 1993, 2003). The variance of the higher- model is referred to as the content factor model. order g-factor was set to 1 for identification and scaling We also tested several restricted CHC models in purposes. In the following sections this model is referred order to investigate the g-saturation of the stratum-two to as the CHC model. factors Gf, Gc, Gq and Gv. Since each restricted CHC The CHC model was contrasted with two alternative model is nested within the CHC model, Δχ2 tests were models. In the first alternative model we assumed that the used to compare the fit of each of these models to the correlations between the ten subtests can be explained by original unrestricted CHC model. a single g-factor. This model is thus referred to as the pure In the first restricted CHC model we assumed that the g-factor model. standardized factor loading of Gf on the higher-order g- In the second alternative model we assumed that the factor equals 1. This model is referred to as the G=Gf correlations between the ten subtests can be explained by model. The second restricted CHC model tested the 580 M.E. Arendasy et al. / Intelligence 36 (2008) 574–583

Fig. 1. Standardized solution of the CHC model. G: General Intelligence; Gq: Quantitative Reasoning; Gc: Crystallized Intelligence; Gf:Fluid Intelligence; Gv: Visual Processing; Gsar: Short-term Memory; ANF: Arithmetic Fluency; CE: Computational Estimation; GEOM: Figural Inductive Reasoning; NID: Numerical Inductive Reasoning; SYN: Synonyms; VF: Verbal Fluency VI: Visual Short-term Memory; VE: Verbal Short-term Memory; ELT: Endless Loops; 3DC: Spatial Comprehension.

assumption that the standardized factor loading of Gq on multivariate normality can be retained. The global fit the higher-order g-factor can be set to 1 (G=Gq model), of the models was assessed using the following cut-off while the same assumption has been made with regard to values for the global fit indices (cf. Browne & Cudeck, the standardized factor loading of Gc on the higher- 1993; Byrne, 2001; Hu & Bentler, 1999; Marsh, Hau, & 2 2 order g-factor in the third restricted CHC model (G=Gc Wen, 2004): non-significant χ -test, χ /dfb2, model). The fourth model tested the assumption that Gv RMSEA≤.06, SRMR≤ .06 and CFI ≥.95. The fit is virtually indistinguishable from the higher-order g- statistics of all the models tested are presented in factor by setting the standardized factor loading of Gv Table 2 (below). on the g-factor to 1 (G=Gv model). Finally we tested the As Table 2 shows, the CHC model provides a good 2 assumption that the stratum-two factors Gf, Gc, Gq and fit to the data. Only the χ -test turned out to be sig- Gv do not differ in their g-saturation by constraining the nificant. However, this can be partially attributed to the factor loadings of Gf, Gc Gq and Gv on the higher-order sample size used in this study (cf. Byrne, 2001). An g-factor equal. This model is referred to in the following inspection of the standardized residuals further revealed sections as the G→Gf =Gc =Gq =Gv model. that none of them was statistically significant at α=.05; this provides further evidence of the fit of the CHC 6. Results model to our data. In case of the g-factor model, none of the fit indices The data were analysed using AMOS 5.0 (Arbuckle, meets the standard criteria for an adequate model fit. The 2003). The results are presented in two sections. The first superiority of the CHC model is confirmed by the section presents descriptive statistics for the individual improvement in the χ2 statistic over the pure g-factor subtests. The second section describes the results of model (Δχ2 [4]=46.55; pb.001). Furthermore, the g- confirmatory factor analyses conducted using Maximum factor model also fits significantly less well (Δχ2 [3]= Likelihood estimation to calculate the parameters of the 35.50; pb.001) than the content factor model. However, postulated models. an inspection of the standardized factor loadings of the ten The means, standard deviations and reliability subtests on their corresponding content factors revealed estimates of the subtests are presented in Table 2. Note that their magnitude was generally low and in several that all measures meet the standard criteria for univariate cases even non-significant (pN.05). On the basis of this normality as indicated by the fact that the skew and result the content factor model was rejected. kurtosis values for the individual measures are ≤2 (cf. In a further step we examined the statistical sig- Kline, 1998). In addition, the assumption of multivariate nificance of the factor loadings of the CHC model, since normality was evaluated using the Mardia-Test, resulting it provided the best model for our data. The factor in a multivariate kurtosis of 1.46. Thus the assumption of loadings of the individual subtests on their corresponding M.E. Arendasy et al. / Intelligence 36 (2008) 574–583 581 stratum-two factors were all moderate to high and reached with a substantive interpretation of the shared variance statistical significance at α=.01. The standardized factor captured by each stratum-two factor. loadings are depicted in Fig. 1. In this article we have suggested that top-down ap- The results also indicated that the g-factor is nearly proaches to automatic item generation, such as the auto- indistinguishable from Gf. The standardized factor matic min-max approach (cf. Arendasy, 2005; Arendasy & loading of .98 is close to 1.00, and the error variance Sommer, 2005, 2007), can be used to provide this kind of of Gf did not reach the significance level (pN.05) and . In our own study we used results of prior was thus excluded. It was thus hardly surprising that the studies regarding the construct representation (Embretson, G=Gf model not only exhibited a good fit to the data but 1983) of the individual subtests to select two subtests to fitted the data no worse than the unrestricted CHC measure each of five of the eleven stratum-two factors of model (Δχ2 [1]=1.31; p=.250). the Cattell–Horn–Carroll model. Even though the appli- With regard to the G=Gq model, the fit statistics cation of the top-down approach to automatic item indicate a mediocre model fit and a Δχ2 test (Δχ2 [1]= generation presented in this article is by no means 15.19; pb.001) revealed that the G=Gq model fits sig- restricted to the Carroll–Horn–Cattell model we chose nificantly worse than the unrestricted CHC model. this particular model of due to the Similar results were obtained for the G=Gc model and disagreement on the g-saturation of the broader stratum- the G=Gv model. Both models failed to meet the two factors outlined in the current research literature in standard criteria of an adequate or a good model fit and order to demonstrate the usefulness of our approach. With also fitted the data significantly worse than the unrest- this aim in mind we resorted to prior studies regarding the 2 ricted CHC model (G=Gc: Δχ [1]=46.44; pb.001; construct representation (Embretson, 1983) of the indivi- 2 G=Gv: Δχ [1]=119.63; pb.001). With regard to the dual subtests of the test battery used in this study to select G→Gf =Gc =Gq =Gv model, the fit statistics and the two subtests to measure each of five of the eleven stratum- Δχ2 test (Δχ2 [2]=112.85; pb.001) contradict the two factors of the Cattell–Horn–Carroll model. In order to assumption that the four stratum-two factors contribute investigate whether the Cattell–Horn–Carroll model equally to the higher-order g-factor. indeed provides the best description of our data we also Taken together the results obtained in this study investigated the fit of a pure g-factor model and a content support the position that Gf is virtually indistinguishable factor model. The results indicate that the Cattell–Horn– from the g-factor, while this is not the case for the Carroll model provides a good fit to the data while both remaining stratum-two factors. alternative models fail to meet the standard criteria for an acceptable model fit. It should be noted that we were 7. Discussion unable to test all plausible alternative models of human intelligence such as the one proposed by Vernon (1964) Even though there is a consensus on the hierarchical and associates due to the lack of specific subtests in our test structure of human intelligence, researchers still disagree battery associated with these models. The conclusion that on the g-saturation of various stratum-two factors due to the Cattell–Horn–Carroll model provides the best fit is mixed evidence in the literature (cf. Ashton & Lee, 2006; thus tentative. Given that the Carroll–Horn–Catell model Bickley et al., 1995; Carroll, 1993, 2003; Gignac, 2006; indeed provides the best fit to the data the comparison of Gustafsson, 1984, 2002; Gustafsson & Balke, 1993; different restricted CHC models suggests that Gf is Roberts et al., 2000; Robinson, 1999; Undheim & virtually identical to the higher-order g-factors. However, Gustafsson, 1987). In this article we have argued that the identity of the Gf-factor and psychometric ‘g’ is the mixed evidence can be attributed to (1) differences in sometimes taken as evidence of a misspecification of the the stratum-two factors covered in the test batteries Cattell–Horn–Carroll model (cf. Johnson & Bouchard, investigated and (2) the construct representation (Embret- 2005). Proponents of this view argue that the correlations son, 1983) of the individual subtests. In previous studies between the various subtests might be better explained by several subtests have turned out to be mixed measures of an alternative model. The results of the model comparisons two or more stratum-two factors, which cause problems in carried out in our study indicate that this explanation can be the substantive interpretation of these factors. We have ruled out at least for the two other plausible models we argued that these problems are closely linked to the compared against the CHC model in this study. construct representation (Embretson, 1983) of the subt- The finding that Gf and psychometric ‘g’ are virtually ests. Since a confirmatory factor analysis does not identical nevertheless poses the question of whether Gf establish the meaning of the extracted factors, some sort is misclassified as a stratum-two factor and instead rep- of independent evidence is needed in order to come up resents the essence of psychometric g. This question is 582 M.E. Arendasy et al. / Intelligence 36 (2008) 574–583 particularly interesting in the light of current revisions of modelling of the item difficulties in traditional subtests to Cattell's investment theory. Cattell (1963) originally as- provide independent evidence that can be used to sumed that people utilize their fluid intelligence in order distinguish between psychometric identical but substan- to acquire proficiency in various other mental abilities. tively different CFA models. Such an analysis would – for However, the limited empirical examinations of the in- instance – be useful in cases of the models Gignac (2006) vestment theory failed to provide strong evidence for a and Ashton and Lee (2006) proposed. Theory based ap- directional relation between Gf and Gc (cf. Blair, 2006). proaches to automatic item generation such as the cog- Moreover, these studies indicated that Gf and Gc are not nitive design system approach (cf. Embretson & Gorin, only distinct early in the lifespan but their developmental 2001) and the automatic min-max approach (cf. Arendasy, course is also different (cf. Horn & Noll, 1997; McArdle, 2005) even recognize this possibility as a means to derive Ferrer-Caja, Hamagami, & Woodcock, 2002). Thus more a set of radicals in constructing an automatic item gen- recent revisions of the investment theory assume that the erator or in deciding between psychometrically identical influence of Gf on Gc is mediated by (Schweizer but substantively different models on the structure of & Koch, 2002) and that fluid intelligence and learning human intelligence. share a common cognitive basis—namely capacity or executive functioning and processing speed. This revision References turned out to be rather successful in predicting knowledge gain. Also note that the theoretical model bears some Anderson, M. (2001). Annotation: Conceptions of intelligence. Journal − similarities to the one outlined by Anderson (2001).The of Child Psychology and Psychiatry, 42,287 298. Arbuckle, J. L. (2003). Amos 5.0 update to the Amos user's guide. theoretical model proposed by Schweizer and Koch Chicago IL: Smallwaters Corporation. (2002) is particularly interesting with regard to the rad- Arendasy, M. (2000). Psychometrischer Vergleich computergestützer icals (Irvine, 2002) of our two measures of fluid in- Vorgabeformen bei Raumvorstellungsaufgaben: Stereoskopisch- telligence. The radicals ‘number of elements / length of the dreidimensionale und herkömmlich-zweidimensionale Darbietung number series’ and ‘number of rules’ are commonly as- [Psychometric comparison of computer based presentation modes of spatial ability tasks: Stereoscopic-3D and traditional-2D sumed to represent aspects of capacity, mental load and presentation modes] . PhD Dissertation: University of Vienna. executive functioning while the radical ‘rule complexity’ Arendasy, M. (2005). Automatic generation of Rasch-calibrated items: is assumed to be linked to processing speed (for an over- Figural matrices test GEOM and endless loops Test Ec. Interna- view: Arendasy & Sommer, 2005). Furthermore, expert- tional Journal of Testing, 5, 197−224. novice research (cf. Bransford, Brown & Cocking, 2000) Arendasy, M., & Sommer, M. (2005). The effect of different types of perceptual manipulations on the dimensionality of automatically has indicated that respondents who scored high on fluid generated figural matrices. Intelligence, 33, 307−324. intelligence measures posed more questions while learn- Arendasy, M., & Sommer, M. (2007). Psychometrische Technologie: ing various subject matter and also exhibited a higher Automatische Itemgenerierung eines neuen Aufgabentyps zur knowledge gain. This might point to the influence of Messung der Numerischen Flexibilität [Psychometric Technology: abstraction ability, which constitutes the third set of rad- Automated item generation of a new item type for the measure- ment of Arithmetic Fluency]. Diagnostica, 53(3), 119−130. icals (Irvine, 2002) in our fluid intelligence measures. One Arendasy, M., Sommer, M., Gittler, G., & Hergovich, A. (2006). would thus expect Gf to exhibit the highest level of g- Automatic generation of quantitative reasoning items: A pilot saturation in cross-sectional studies of the structure of study. Individual Differences, 27,2−14. human intelligence. Results such as those presented by Arendasy, M., Hornke, L. F., Sommer, M., Häusler, J., Wagner-Menghin, Gignac (2006) and Robinson (1999) are therefore not in M., Gittler, G., Bognar, B., & Wenzl, M. (2007). Manual Intelligence- Structure-Battery (INSBAT) Mödling: SCHUHFRIED GmbH. line with these theoretical assumptions. Ashton,M.C.,&Lee,K.(2006).“Minimally biased” g-loadings of Furthermore, the approach we used in this study also crystallized and non-crystallized abilities. Intelligence, 34,469−477. has some interesting implications for previous studies. Bickley, P. G., Keith, T. Z., & Wolfe, L. M. (1995). The three-stratum Even though automatic item generation (AIG) methods theory of cognitive abilities: Test of the structure of intelligence − can help to solve several problems which arise with regard across the life span. Intelligence, 20, 309 328. Blair, C. (2006). How similar are fluid and general to (1) the substantive interpretation of the extracted factors intelligence? A developmental neuroscience perspective on fluid and (2) the choice between competing models we do not cognition as an aspect of human cognitive ability. Behavioural and intend to that one approach (CFA models using tests Brain , 29, 109−160. constructed by AIG) should entirely replace the other one Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing (CFA analyses of traditional test batteries) since both model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 445−455). Newbury Park: Sage. have their own strength and weakness. In fact the two Bransford, J. D., Brown, A. L., & Cocking, R. R. (2000). How people approaches do not necessarily exclude each other. For learn. Brain, mind, experience, and school. Washington D.C.: instance one could use methods of psychometrically National Academy Press. M.E. Arendasy et al. / Intelligence 36 (2008) 574–583 583

Byrne, B. M. (2001). Structural equation modeling with AMOS: Basic Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in , application and programming London: Lawrence Erlbaum. covariance structure analysis: Conventional criteria versus new Carroll, J. B. (1993). Human cognitive abilities: A survey of factor alternatives. Structural Equation Modeling: A Multidisciplinary analytic studies New York: Cambridge University Press. Journal, 6,1−55. Carroll, J. B. (2003). The higher-stratum structure of cognitive abilities: Irvine, S. H. (2002). The foundations of item generation for mass Current evidence supports g and about ten broad factors. In H. testing. In S. H. Irvine & P. C. Kyllonnen (Eds.), Item generation Nyborg (Ed.), The scientific study of general intelligence: Tribute for test development (pp. 3−34). Mahwah, NJ: Lawrence Erlbaum to Arthur R. Jensen (pp. 5−22). San Diego: Pergamon. Associates. Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: Johnson, W., & Bouchard, T. J., Jr. (2005). The structure of human A critical experiment. Journal of Educational Psychology, 54, intelligence: It is verbal, perceptual, and image rotation (VPR), not 1−22. fluid and crystallized. Intelligence, 33, 393−416. Embretson, S. E. (1983). Construct validity: Construct representation Kline, R. B. (1998). Principles and practice of structural equation versus nomothetic span. Psychological Bulletin, 93, 179−197. modeling New York: Guilford Press. Embretson, S. E., & Gorin, J. (2001). Improving construct validity Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: with cognitive psychology principles. Journal of Educational Comment on hypothesis testing approaches to setting cutoff values Measurement, 38, 343−368. for fit indexes and dangers in overgeneralising Hu & Bentler's Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests (1999) findings. Structural Equation Modelling, 11, 320−341. [Introduction to the theory of psychological testing]. Bern: Huber. McArdle, J. J., Ferrer-Caja, E., Hamagami, F., & Woodcock, R. W. Fischer, G. H. (1973). The linear logistic test model as an instrument in (2002). Comparative longitudinal structural analyses of the growth educational research. Acta Psychologica, 36, 207−220. and decline of multiple intellectual abilities over the life span. Flanagan, D. P., & McGrew, K. S. (1998). Interpreting intelligence Developmental Psychology, 38,115−142. tests from contemporary Gf–Gc theory: Joint confirmatory factor McGrew, K. S. (1997). Analysis of the major intelligence batteries analysis of the WJ-R and the KAIT in a non-white sample. Journal according to a proposed comprehensive Gf–Gc framework. In D. P. of School Psychology, 36, 151−182. Flanagan, J. L. Genshaft, & P. L. Harrison (Eds.), Contemporary Gignac, G. E. (2006). Evaluating the subtest ‘g’ saturation level via the intellectual assessment: , tests, and issues (pp. 151−179). single trait-correlated uniqueness (STCU) SEM approach: Evi- New York: Guilford. dence in favor of crystallized subtests as the best indicators of ‘g’. Meo, M., Roberts, M. J., & Marucci, F. S. (2007). Element salience as a Intelligence, 34,29−46. predictor of item difficulty for the Raven's Progressive Matrices. Gittler, G. (1990). Dreidimensionaler Würfeltest—Ein Rasch-skalier- Intelligence, 35, 359−368. doi:10.1016/j.intell.2006.10.001 ter Test zur Messung des räumlichen Vorstellungsvermögens. Nathan, M. J., & Petrosino, A. (2003). Expert blind spot among Theoretische Grundlagen und Manual [Three dimensional cubes preservice teachers. American Educational Research Journal, 40, test—A Rasch-calibrated test fort he measurement of spatial 905−928. ability. Theoretical background and Manual] Weinheim: Beltz. Rasch, G. (1980). Probabilistic models for some intelligence and Gittler, G., & Arendasy, M. (2003). Endlosschleifen: Psychometrische attainment tests Chicago: University of Chicago Press (Original Grundlagen des Aufgabentyps EP [Endless Loops: Psychometric published in 1960). foundations of the item type EP]. Diagnostica, 49, 164−175. Roberts, R. D., Goff, G. N., Anjoul, F., Kyllonen, P. C., Pallier, G., & Gustafsson, J. E. (1984). A unifying model for the structure of Stankov, L. (2000). The Armed Services Vocational Aptitude intellectual abilities. Intelligence, 8, 179−203. Battery (ASVAB)—Little more than acculturated learning (Gc)!? Gustafsson, J. E. (2002). Measurement from a hierarchical point of Learning and Individual Differences, 12,81−103. view.InJ.I.Braun,D.N.Jackson,&D.E.Wiley(Eds.),The Robinson, D. L. (1999). The ‘IQ’ factor: Implications for intelligence role of constructs in psychological and educational measure- theory and measurement. Personality and Individual Differences, ment (pp. 73−95). Mahwah: Erlbaum. 27, 715−735. Gustafsson, J. E., & Balke, G. (1993). General and specific abilities as Schweizer, K., & Koch, W. (2002). A revision of Cattell's investment predictors of school achievement. Multivariate Behavioral Research, theory: Cognitive properties influence learning. Learning and 28,407−434. Individual Differences, 13,57−82. Horn, J. L., & Noll, J. (1997). Human cognitive capabilities: Gf–Gc Undheim, J. O., & Gustafsson, J. E. (1987). The hierarchical organisation theory. In D. P. Flanagan, J. L. Genshaft, & P. L. Harrison (Eds.), of cognitive abilities: Restoring general intelligence through the use Contemporary intellectual assessment: theory, tests and issues of linear structural relations (LISREL). Multivariate Behavioral (pp. 49−91). New York: The Guilford Press. Research, 22,149−171. Hornke, L. F. (2002). Item-generation models for higher order Vernon, P. E. (1964). The structure of human abilities London: Methuen. cognitive functions. In S. H. Irvine & P. C. Kyllonnen (Eds.), Item generation for test development (pp. 159−178). Mahwah, NJ: Lawrence Erlbaum Associates.