Florida State University Libraries
Electronic Theses, Treatises and Dissertations The Graduate School
2014 A Comparison of Three Approaches to Confidence Interval Estimation for Coefficient Omega Jie Xu
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY
COLLEGE OF EDUCATION
A COMPARISON OF THREE APPROACHES TO CONFIDENCE
INTERVAL ESTIMATION FOR COEFFICIENT OMEGA
By
JIE XU
A Thesis submitted to the Department of Educational Psychology and Learning Systems in partial fulfillment of the requirements for the degree of Master of Science
Degree Awarded: Fall Semester, 2014 Jie Xu defended this thesis on August 11, 2014. The members of the supervisory committee were:
Yanyun Yang Professor Directing Thesis
Betsy Becker Committee Member
Russell Almond Committee Member
The Graduate School has verified and approved the above-named committee members, and certifies that the thesis has been approved in accordance with university requirements.
ii
TABLE OF CONTENTS
List of Tables ...... v List of Figures ...... vi Abstract ...... vii INTRODUCTION ...... 1 LITERATURE REVIEW ...... 6 Reliability ...... 6 Reliability Estimation from CTT ...... 6 Test-retest ...... 7 Alternative-forms ...... 7 Internal Consistency ...... 8 Coefficient ...... 9 Violation of the homogeneity assumption ...... 9 Violation of the essential tau equivalence assumption ...... 10 Violation of the uncorrelated errors assumption ...... 11 Misinterpretation of as an index of homogeneity ...... 12 Reliability Estimation Based on CFA within SEM Framework ...... 12 Confirmatory Factor Analysis ...... 13 An illustration of the one-factor CFA Model ...... 13 The link between CTT and CFA for reliability estimation ...... 15 Coefficient ...... 16 Definition and formula for computing coefficient ...... 16 Relationship between coefficient and coefficient ...... 17 Confidence Interval ...... 18 Null Hypothesis Significance Testing ...... 18 Interval Estimation ...... 19 Three Approaches to Interval Estimation for Coefficient ...... 21 Wald Method ...... 21 Wald CI for an individual parameter ...... 21 Wald CI for a function of parameters ...... 22 Wald CI for coefficient based on the delta method ...... 25 Likelihood Method ...... 26 Likelihood ratio statistic ...... 26 Likelihood-based CI computed via the likelihood function of a single parameter ...... 27 Likelihood-based CI computed via likelihood function of multiple parameters ...... 28 Likelihood-based CI for coefficient ...... 30 Bias-corrected and Accelerated Bootstrap Method ...... 30 A brief introduction to bootstrap technique ...... 31 Construction of bias-corrected and accelerated bootstrap CI ...... 32 BCa bootstrap CI for coefficient ...... 34 A Comparison among Three Interval Estimation Methods ...... 34 Statistical Test ...... 34 Statistics Applied ...... 35 Assumption of Multivariate Normality ...... 36 iii
Symmetry ...... 36 Sample Size ...... 37 Variance/Invariance to Parameter Transformation ...... 37 Previous Research ...... 38 The Rationale and Purpose of the Proposed Study ...... 40 METHODS ...... 42 Design Factors ...... 43 Data Generation Procedure ...... 44 Data Analysis ...... 46 RESULTS ...... 51 Non-convergence ...... 51 Coverage Probability ...... 51 Interval Width ...... 55 Relative Bias of Point Estimates ...... 57 DISCUSSION AND CONCLUSIONS ...... 61 Major Findings from Simulation Study ...... 62 Discussion and Suggestions ...... 64 Limitations and Future Research ...... 65 APPENDICES ...... 68 A. R-CODES FOR DATA GENERATION ...... 68 B. CODES FOR CONFDENCE INTERVAL ESTIMATION IN R ...... 70 C. NONNORMAL DISTRIBUTIONS OF OBSERVED SCORES ...... 71 D. EFFECTS OF DESIGN FACTORS ON COVERAGE PROBABILITY ...... 77 E. EFFECTS OF DESIGN FACTORS ON INTERVAL WIDTH ...... 80 F. MEANS OF CI BOUNDS AND POPULATION COEFFICIENT OMEGA ...... 91 REFERENCES ...... 101 BIOGRAPHICAL SKETCH ...... 108
iv
LIST OF TABLES
1 Transformation Coefficients Corresponding to Three Types of Distribution ...... 45
2 Skewness and Kurtosis of Observed Item Scores ...... 47
3 Population Coefficient under Different Conditions ...... 49
4 Coverage Probability for Each Condition ...... 52
5 Results from Logistic Regression and ANOVA ...... 54
6 Interval Widths for 6 Items ...... 58
7 Interval Widths for 12 Items ...... 59
8 Relative Bias of Point Estimates for Coefficient ...... 60
v
LIST OF FIGURES
1 An Example of a One-factor CFA Model ...... 14
2 An Illustration of Confidence Interval Computed via the Log-likelihood Function ...... 28
3 Distributions of Observed Scores with Sk=2 and K=7 for Factor Scores ...... 71
4 Distributions of Observed Scores with Sk=3 and K=21 for Factor Scores ...... 74
5 Coverage Probabilities for Normally Distributed Data ...... 77
6 Coverage Probabilities for Moderately Nonnormally Distributed Data ...... 78
7 Coverage Probabilities for Seriously Nonnormally Distributed Data ...... 79
8 Interval Widths on Different Levels of Sample Size ...... 80
9 Interval Widths on Different Levels of Item Count ...... 83
10 Interval Widths on Different Levels of Factor Loading ...... 88
11 Means of CI Bounds and Population Coefficient Omega for 6 Items ...... 91
12 Means of CI Bounds and Population Coefficient Omega for 12 Items ...... 96
vi
ABSTRACT
Coefficient was introduced by McDonald (1978) as a reliability coefficient of composite scores for the congeneric model. Interval estimation on coefficient provides a range of plausible values which is likely to capture the population reliability of composite scores. The
Wald method, likelihood method, and bias-corrected and accelerated (BCa) bootstrap method are three methods to construct confidence interval for coefficient . Very limited number of studies on the evaluation of these three methods can be found in the literature. No simulation study has been conducted to evaluate the performance of these three methods on interval construction for coefficient . In the current simulation study, I assessed these three methods by comparing their empirical performance on interval estimation for coefficient . Four factors were included in the simulation design: sample size, number of items, factor loading, and degree of nonnormality.
Two thousand datasets were generated in R 2.15.0 (R Core Team, 2012) for each condition. For each generated dataset, three approaches (i.e., the Wald method, likelihood method, and BCa bootstrap method) were used to construct a 95% confidence interval for coefficient in R
2.15.0. The results showed that when the data were multivariate normally distributed, the three methods performed equally well and coverage probabilities were very close to the prespecified .95 confidence level. When the data were nonnormally distributed, coverage probabilities decreased and interval widths became wider for all three methods as the degree of nonnormality increased. In general, when the data departed from multivariate normality, the BCa bootstrap method performed better than the other two methods, with relatively higher coverage probabilities, while the Wald and likelihood methods were comparable and yielded narrower interval width than the BCa bootstrap method.
vii
INTRODUCTION
Measurement consistency is an essential and unavoidable issue in social and behavior
sciences (Raykov, 2000). Reliability is an important index for measurement consistency. Scale
reliability originated from the framework of Classical Test Theory (CTT), and is defined as the
ratio of two variances: true score variance over observed score variance (Crocker & Algina,
1986; Lord & Novick, 1968). Several methods have been proposed for reliability estimation within CTT: test-retest, alternative-forms, and internal consistency (e.g., split-half and coefficient alpha). Coefficient alpha is used most commonly by researchers in applied studies.
However, coefficient alpha is an unbiased estimate of reliability only when the underlying assumptions are met (e.g., Green & Hershberger, 2000). The assumptions are homogeneity, essential tau equivalence, and uncorrelated errors. Researchers have argued against the use of coefficient alpha given its untenable assumptions in practice. For example, Green and Yang
(2009) discouraged the use of coefficient alpha to assess reliability, because the assumptions of coefficient alpha are unlikely to hold. The bias due to violation of these assumptions is often substantial and can’t be ignored. In addition, coefficient alpha is frequently misinterpreted as an index of homogeneity (Cortina, 1993; Miller, 1995; Sijtsma, 2009).
The true score model in CTT can be expressed using a confirmatory factor analysis
(CFA) model, also known as a measurement model. Scale reliability can then be assessed by proposing an interpretable CFA model given good model-data fit (Green & Yang, 2009). In a one-factor CFA model, a link between CTT and CFA models as regards reliability estimation is that the true score in CTT can be expressed as the factor score weighted by the factor loading in the CFA model. In the one-factor CFA model, factor loadings and error variances are model parameters (assuming the variance of the factor is fixed to one for identification purposes). The
1 reliability coefficient can be computed based on the parameter estimates from the CFA model.
The reliability estimate based on the one-factor, so-called congeneric CFA model, is known as
coefficient , which is defined as a ratio of composite true score variance to the total score
variance of a scale (McDonald, 1978). Coefficient is viewed as a generalization of coefficient
alpha in reliability estimation of homogeneous measurements (Kelley & Cheng, 2012).
A reliability coefficient computed from a sample is a point estimate of the population
reliability, and it is sample dependent. How to determine the accuracy of reliability estimates has
received increasing attention in the literature. Testing of statistical significance (Neyman &
Pearson, 1933) is the most commonly used method to determine the accuracy of the point
estimate. Based on this method, a point estimate is judged as either significantly or non-
significantly different from a value specified by the null hypothesis (e.g., zero) in the population,
by comparing an obtained p value to a predetermined significance level (e.g., .05). In spite of the
popularity of null hypothesis significance testing, the appropriateness of its use is still a matter of
debate (e.g., Chow, 1988; Cohen, 1994; Nickerson, 2000; Schmidt, 1996; Wainer, 1999).
Interval estimation (Neyman, 1935, 1937) is proposed as a more informative and preferable
alternative to null hypothesis significance testing. A confidence level (often denoted as 1- ,
where is the Type I error rate) tells how likely the interval is to include the population
parameter (Kelley, 2007). The confidence interval (CI) is commonly reported at the .95
confidence level (i.e., Type I error rate of .05), which means that 95 out of 100 times under
repeated sampling the true population value of a parameter falls in the constructed CI. However,
empirical studies reporting confidence intervals along with corresponding parameter estimates
are still limited for two reasons (Cheung, 2009b; Steiger & Fouladi, 1997): (1) statistical
packages for interval estimation are not available; and (2) methods for CI construction on
2 different statistics and psychometric indices lack development. Interval estimation for coefficient also lacks development and investigation (Padilla & Divers, 2013).
This study reviews three methods for estimating confidence interval for coefficient
within the Structural Equation Modeling (SEM) framework; they are the Wald method, the likelihood method, and the bias-corrected and accelerated bootstrap method. The Wald CI is constructed based on the standard error estimate of the parameter of interest (e.g., Cheung, 2007,
2009a, 2009b; Raykov, 2002). To construct a Wald CI on coefficient which is a function of
multiple parameters, the delta method is applied. The delta method is an analytic method widely
used to approximate the asymptotic standard errors for parameters that are functions of a set of
parameters (Casella & Berger, 2002; Cheung, 2009b; Raykov, 2002; Raykov & Marcoulides,
2004; Oehlert, 1992). The application of this method is based on the linear approximation of
smooth parametric functions using the Taylor expansion of the function (Raykov & Marcoulides,
2004), under the assumption of multivariate normality of variables (Ogasawara, 1999). The
likelihood-based CI is developed from the asymptotic chi-square distribution of the likelihood
ratio test (Cheung, 2009b; Venzon & Moolgavkar, 1988). If the likelihood function has multiple
parameters, the profile likelihood method (Venzon & Moolgavkar, 1988) can be used to
construct the likelihood-based CI. The third method is the bias-corrected and accelerated
bootstrap method. This method has a distinguished advantage over the other bootstrap methods;
that is, it improves interval estimates by taking asymmetry, bias, and nonconstant variance into
consideration (Efron, 1987; Kelley & Cheng, 2012).
Similarities and differences exist among the Wald, the likelihood, and the BCa bootstrap
methods. (1) The Wald method is based on the asymptotically normally distributed Wald statistic
(Cheung, 2009b; Enders, 2010), and the likelihood method is based on the likelihood ratio
3 statistic which asymptotically follows a chi-square distribution (Venzon & Moolgavkar, 1988).
However, the BCa bootstrap method doesn’t require parametric statistical testing. So, normality
of variables or homoscedaslicity of error scores are not assumed. (2) Both Wald and likelihood
methods assume multivariate normality of variables, while the BCa bootstrap method, as a
nonparametric method, doesn’t require this assumption. (3) CIs constructed with the likelihood
and BCa bootstrap methods are asymmetric, while CIs constructed with the Wald method are
symmetric (Cheung, 2007, 2009b). An issue with symmetric CIs is that when the parameter
estimate is near to the limits, the estimated CIs may include some values that are outside of the
meaningful boundaries of the parameter (Neale & Miller, 1997). (4) Both the Wald and
likelihood methods require a large sample size to accurately estimate reliability, while the BCa
bootstrap method has a relatively relaxing requirement on sample size, when the sample size is
smaller than 100, the performance of bootstrap method is poor (Nevitt & Hancock, 2001).
(5) CIs constructed using the likelihood and BCa bootstrap methods are invariant to
transformations on model parameters, while CIs constructed through the Wald method are
variant to transformation (Neale & Miller, 1997).
In the current literature, only very limited numbers of studies compare the performance of
these methods. Cheung (2007) compared the empirical performances of four methods to
construct CIs (the Wald, percentile bootstrap, bias-corrected bootstrap, and likelihood methods)
on indirect effects (e.g., mediating effects of variable A on variable C through variable B)
through Monte Carlo studies. Cheung (2009b) compared the likelihood-based CI and Wald CI
for several psychometric indices and statistics, such as the correlation coefficient, indirect effect,
and reliability estimate, with numerical examples. He also conducted a Monte Carlo study on
Pearson correlation CIs in the same study (Cheung, 2009b). In another study (Cheung, 2009a),
4 he compared six Wald CIs, three bootstrap CIs, a likelihood-based CI, and the PRODCLIN CI
(i.e., CIs constructed using a PRODCLIN program with the R interface) on standardized indirect effect through a simulation study. Ye and Wen (2011) conducted a simulation study to compare a bootstrap method, delta method, and the method of using the outputs (i.e., point estimate, standard error) directly from the LISREL program to construct CIs for coefficient . Kelley and
Cheng (2012) used an empirical example to examine the performance of the Wald, percentile bootstrap, and BCa bootstrap methods for CI construction for coefficient . Padilla and Divers
(2013) conducted a simulation study to assess the performance of the normal theory bootstrap
CI, percentile based CI, and the BCa bootstrap CI for coefficient .
Based on my literature review, there is a gap in the literature on the performance of the three interval estimation methods (the Wald method, likelihood method, and BCa bootstrap method) for coefficient . No simulation study has been conducted to evaluate the performance of the three approaches to interval construction for coefficient . Given that numerical examples are mainly used for illustrative purposes (Yung & Bentler, 1996), I choose to conduct a simulation study to assess these three methods under different conditions. Specifically, four factors were manipulated: sample size, number of items, factor loading, and level of nonnormality. The purpose of the simulation study is to compare these three interval estimation methods for coefficient and help applied researchers to choose the appropriate method to obtain reliability interval estimation.
5
LITERATURE REVIEW
This chapter presents a literature review for the proposed study. First, reliability and the
estimation methods within the framework of classical test theory (CTT) and confirmatory factor
analysis (CFA) are introduced, respectively. Three approaches to constructing confidence
intervals for coefficient are then reviewed and compared. Last, the rationale behind the study
and the purpose of the study are provided.
Reliability
Measurement consistency is an essential and inevitable issue in social and behavior
sciences (Raykov, 2000). As an important index for measurement consistency, reliability is one
of the main concerns in these disciplines. Reliability refers to “the degree to which individuals’
deviation scores, or z-scores, remain relatively consistent over repeated administrations of the
same test or alternative test forms” (Crocker & Algina, 1986, p. 105). Reliability is a
characteristic of test scores for a group of examinees, not a characteristic of a test (Feldt &
Brennan, 1989; Wilkinson &The Task Force on Statistical Inference, 1999).
The concept of reliability originated from the framework of classical test theory (CTT).
In CTT, an observed test score (X) consists of two parts, a true score (T) and an error score (E).
2 2 2 The variance of X can be expressed in terms of the variances of T and E, as X T E under
the assumption that there is no correlation between true score and error score. The reliability of
observed test scores is defined as the ratio of true score variance to observed score variance, and
2 2 is expressed as XX ' T / X (Crocker & Algina, 1986; Lord & Novick, 1968).
Reliability Estimation from CTT
Several methods have been proposed for scale reliability estimation within CTT: test- retest, alternative-forms, and internal consistency (e.g., split-half and coefficient alpha). The use
6 of each method requires several assumptions. Assumptions underlying applications of different methods and effects of the violation of specific assumptions on reliability estimation are discussed in detail in this section.
Test-retest
For the test-retest method, a group of examinees take the same test twice at two different points in time. The correlation between the two sets of test scores is used to estimate the reliability of the test scores. A critical issue with the test-retest method is how to determine the elapsed time between the two administrations so that the time period is “long enough to allow effects of memory or practice to fade but not so long as to allow maturational or historical changes to occur in the examinees ” (Crocker & Algina, 1986, p. 133). The test-retest method requires that error scores associated with the first test and the retest are uncorrelated. The two error scores are also required to be independent from true scores of the two test administrations.
Because the correlation between errors of the first test and of the retest tends to be positive due to possible memory effects, the estimated reliability using the test-retest method is likely to be greater than the true reliability (Miller, 1995).
Alternative-forms
The alternative-forms method follows the same procedures as the test-retest method, except for using an alternative form at the second administration. The correlation between the two sets of observed scores is the estimated reliability coefficient for the test scores. Compared to the test- retest method, the assumption of uncorrelated errors for alternative-forms method is less likely to be violated. However, the alternative-forms method is not an optimal choice for several reasons.
For example, the time and monetary cost to develop alternative forms is considerable; the alternative-forms reliability coefficient may be affected by content sampling during test
7 construction (Crocker & Algina, 1986); the alternative-forms method requires essential tau equivalence of the two test forms, which requires that true-score variances of the alternative- forms are equivalent (i.e., true scores of the two test forms can differ only by a constant). This assumption is unlikely to hold in practice. The alternative-forms method tends to underestimate the true reliability if essential tau equivalence of the two test forms assumption is not met
(Miller, 1995).
Internal Consistency
The split-half method and coefficient alpha are widely used for estimation of internal consistency reliability. One assumption underlying internal consistency methods is uncorrelated errors. This assumption requires that error scores associated with different test components are uncorrelated with each other, and that error scores are uncorrelated with true scores of these components. For the split-half method, test components are two halves of a test; while for coefficient alpha test components refer to individual items in a test. Homogeneity of the test is another assumption of internal consistency methods, which requires only a single common latent factor underlying a set of items in a test. A third assumption associated with internal consistency estimation procedures is essential tau equivalency. For the split-half method, two halves of a test should be essential tau equivalent, while for coefficient alpha all items in a test are required to be essential tau equivalent.
The split-half method is used to estimate reliability through splitting a test into two halves,
computing the correlation of the two halves, and then applying the Spearman-Brown prophecy
formula to correct the correlation and obtain the reliability estimate for the full length test
(Crocker & Algina, 1986). A limitation for this method is the non-uniqueness of splitting a test
into two halves. For a test with k items, there are 1/2k!/[(1/2k)!]2 different ways to split the test
8
(Brownell, 1933). Another way to estimate reliability with the split-half method is Rulon’s
formula (Rulon, 1939), in which the variance of the difference score between the two halves is
used as an estimate of the variance of error scores in reliability calculation. Rulon’s coefficient
equals to coefficient alpha for a two-component test (Miller, 1995). Coefficient alpha is derived
as a general form of Rulon’s formula, that is, the number of test components can be more than
two. Coefficient alpha is also the average of split-half reliability coefficients (Cronbach, 1951;
Cronbach & Shavelson, 2004; Guttman, 1945). Coefficient alpha is routinely reported as an
estimate of reliability in applied studies. However, it requires restrictive assumptions that are
often violated in application, as summarized below.
Coefficient
Coefficient alpha ( ; Cronbach, 1951; Guttman, 1945) is defined as
k 2 i k i1 1( 2 ) ( k 1,2 i k ), (1) k 1 X
2 where X is the test score, including a set of items X1, X2, …, Xk; k is the number of items, i is
2 the variance for item Xi ; and X is the variance of the total test score.
Even though coefficient is popularly applied, misapplications and misinterpretations of
it are not rare in practice (e.g., Schmidt, 1996). Coefficient is identical to reliability only when
the underlying assumptions are satisfied. The assumptions include homogeneity, essential tau
equivalence, as well as uncorrelated errors. Violation of these assumptions could lead to
substantial bias in the estimate of reliability (Graham, 2006; Green & Yang, 2009; Raykov,
1997a, 1998a, 1998b; Sijtsma, 2009; Zimmerman, Zumbo, & Lalonde, 1993).
Violation of the homogeneity assumption. The homogeneity assumption, also known as the unidimensionality assumption, requires a test measures only a single common latent 9 construct. If the test measures more than one underlying construct, the homogeneity assumption is violated. Violation of this assumption can be conceptualized as a specific case of the violation of the essential tau equivalency assumption (see more details about essential tau equivalency in the next section), and thus introduces negative bias in reliability estimation.
A large standard error of coefficient (i.e., low precision of coefficient ) may indicate the violation of the homogeneity assumption. The standard error of alpha is a function of the standard deviation of item intercorrelations and the number of items (see equation 4 in Cortina,
1993). A large standard deviation of item intercorrelations may suggest that the test is multidimensional (i.e., more than one latent factor underlies the test). A positive relationship between the standard deviation of item intercorrelations and the standard error of coefficient
can be seem from Cortina’s equation. Hence, a large standard error of coefficient may warn us that the homogeneity assumption is violated. This assumption can also be assessed by factor analysis. If a factor analysis ends up with a one-factor model, then the items on an instrument are homogeneous (Lord & Novick, 1968). Such a model is called a congeneric model. If more than one factor is specified in the model, the homogeneity assumption is violated.
Violation of the essential tau equivalence assumption. Under the assumption of essential tau equivalence for coefficient , all items in the test measure the same latent variable and each item is allowed to have a unique error score. In addition, true scores of any two items can differ only by a constant (Graham, 2006; Lord & Novick, 1968; Miller, 1995; Raykov,
1997a; Yang & Green, 2011). When this assumption is not met, coefficient is a lower bound of reliability (Cortina, 1993; Graham, 2006; Green & Yang, 2009; McDonald, 1999; Miller, 1995;
Lord & Novick, 1968; Raykov, 1997a, 1998a, 1998b; Sijtsma, 2009; Zimmerman, Zumbo, &
Lalonde, 1993). “The larger the violation of tau-equivalence that occurs, the more coefficient
10 underestimates score reliability” (Graham, 2006, p. 939-940). The bias of coefficient as a
reliability coefficient may be substantial if one factor loading is fairly different from the other
loadings (i.e., only one item doesn’t meet the requirement of essential tau equivalence) (Green &
Yang, 2009; Raykov, 1997b).
Several methods can be used to examine whether this assumption is violated. One is to check the differences in standard deviations across items (Graham, 2006). The larger the differences, the more likely the assumption is violated. Second, tests with different response formats across items are likely to violate this assumption (Graham, 2006). Third, CFA can be used to examine the assumption of essential tau equivalence (Graham, 2006; Green & Yang,
2009; Raykov, 1997a), given that “this assumption is mathematically identical to the assumption that all items have equal loadings on a single common factor with their unique variances composed entirely of error” (Miller, 1995, p. 265). To assess this assumption, two one-factor
CFA models need to be posited and fitted to the data. These two models are the same expect for factor loadings. Factor loadings are constrained to be equal in one model (i.e., the essential tau equivalence model) but freely estimated in the other model (i.e., the congeneric model). Then, a chi-square difference test is conducted between these two models to test the essential tau equivalence assumption. If the chi-square difference test is non-significant, we do not have sufficient evidence to reject the null hypothesis that the essential tau equivalence assumption holds.
Violation of the uncorrelated errors assumption. The assumption of uncorrelated errors means “error scores of all pairs of items are uncorrelated” (Miller, 1995, p. 266). If this assumption is violated, coefficient will overestimate reliability (Graham, 2006; Green &
11
Hershberger, 2000; Green & Yang, 2009; Lord & Novick, 1968; McDonald, 1999; Miller, 1995;
Raykov, 1997a, 1998a; Sijtsma, 2009; Zimmerman, Zumbo, & Lalonde, 1993).
Green and Yang (2009) summarized how the assumption of uncorrelated errors may be
violated in practice. For example, correlated errors may be introduced by speeded tests with
right-wrong answers (Rozeboom, 1966), different stimulus materials for subgroups of items
(Steinberg, 2001; Steinberg & Thissen, 1996; Wainer & Kiely, 1987; Yen, 1993), effects of item
order (i.e., errors on one item affecting responses to the next item; Green & Hershberger, 2000),
consistent response sets, and transient errors (Becker, 2000; Green, 2003). In these cases,
correlated errors may result in positive bias in reliability estimation.
Misinterpretation of as an index of homogeneity. Coefficient has been mistakenly
interpreted as a homogeneity index. Coefficient is an index of internal consistency reliability
not an index of homogeneity (Miller, 1995). Internal consistency is a different concept from
homogeneity. It is defined as the degree of interrelatedness among items, while homogeneity
refers to unidimensionality (Schmitt, 1996). Homogeneity is a prerequisite for utilizing
coefficient as a reliability index.
Arguments against the use of coefficient have been increasing. Many researchers
criticized using it as an index of internal consistency reliability (e.g., Cortina,1993; Green et al.,
1997; Sijtsma, 2009; Thompson, Green & Yang, 2010), because the assumptions of coefficient
(e.g. essential tau equivalence and uncorrelated errors) are unlikely to hold, and the bias due to
violation of these assumptions is often substantial and can’t be ignored (Green & Yang, 2009).
Reliability Estimation Based on CFA within SEM Framework
As an alternative to coefficient , reliability can be estimated based on a CFA model within Structural Equation Modeling (SEM) framework. SEM is a frequently used technique for
12 investigation of relationships among a set of variables. Instead of designating a single statistical technique, SEM refers to a family of related procedures, among which path analytic and confirmatory factor analysis models are the core procedures (Kline, 2010). The aforementioned classical test models can be conceptualized as structural equation models (see Miller, 1995).
Below I briefly review the CFA model, the link between CFA and CTT, and coefficient omega.
Confirmatory Factor Analysis
CFA models are also referred to as measurement models. A CFA model depicts the relationship between latent factors and measured variables. A CFA model can be considered when researchers have an a priori hypothesis about the relationship among these variables (i.e., the number of factors and the correspondence between factors and measurement indicators)
(Kline, 2010). The CFA approach is superior to classical test models for the estimation of scale reliability for the following reasons. First, it can be applied to various complex models. Second, it can be used for unweighted as well as weighted composite scores. The proposed study focuses on unit weighted composite scores.
An illustration of the one-factor CFA model. A one-factor CFA model with six items is presented as an illustration (Figure 1), and is also one of the models used for confidence interval estimation for coefficient in my study. Let F denote the score of the common latent
factor, Xi (i=1,2,…,6) denote item scores, and Ei be the item error scores. i denotes factor
loading between the common factor and each item. The observed score for each item (here I use
the deviation score, which is the difference between a score and the mean) is a linear
combination of the factor score weighted by factor loading and error score:
X i i *F Ei . (2)
13
A CFA model is testable only if it is identified. “A model is said to be identified if it is theoretically possible to derive a unique estimate for each parameter” (Kline, 2010, p. 105). One necessary requirement for model identification is that the degrees of freedom (df) for the model
should be equal to or greater than zero. The model df is the number of observations (i.e., v(v+1) where v is the number of indicators) minus the number of freely estimated parameters (i.e., the total number of unique variances and covariances of the factors and errors plus the number of factor loadings) (Kline, 2010). The other necessary requirement is that each latent variable
(including factors and measurement errors) in the model should be assigned a scale. Factors can be scaled via unit loading identification (ULI) or unit variance identification (UVI) methods. For a factor, ULI fixes the factor loading of one of its indicators as 1.0, while UVI fixes the factor
variance as 1.0 and freely estimates factor loadings for all indicators (Kline, 2010). Measurement
errors are usually scaled through ULI, by fixing path coefficients associated with errors to 1.0.
For this example, the model df=9 and every latent variable is assigned a scale through ULI. In addition to these two necessary requirements, a sufficient condition is that there are at least three
indicators if the model has only one factor or at least two indicators for each factor if the model
has two or more factors (Kline, 2010). Based on these requirements, the illustrated one-factor
model is identified.
Figure 1: An Example of a One-factor CFA Model.
14
The link between CTT and CFA for reliability estimation. A strong relationship exists between CTT and CFA models.
Suppose a set of items, X1, X2, …, Xk (k >2), fit a one-factor CFA model. Let X
(X=X1+X2+…+Xk) denote the composite observed score and T (T=T1+T2+…+Tk) be the composite true score (Lord & Novick, 1968). In CTT, the observed score can be decomposed into a true score component and error score component (Zimmerman, 1975):
X1 T1 E1 , X 2 T2 E2 , …, X k Tk Ek , (3) where E1, E2,…, Ek are error terms associated with each item. From the CFA framework (refer to equation (2) above), item scores are expressed as:
X1 1 * F E1 , X 2 2 *F E2 , …, X k k *F Ek . (4)
By comparing equation (3) to equation (4), we may note that weighted factor scores and
measurement errors in CFA can be conceptualized as the true scores and error scores in CTT,
respectively. Correspondingly, reliability of the composite scores based on the one-factor model
is defined as:
2 T XX ' 2 ’ X
2 T1 T2 ...Tk 2 X1 X 2 ... X k
2 T1 T2 ...Tk 2 ((T1 T2 ...Tk ) (E1 E2 ... Ek ))
2 (1F 2F ... k F) 2 . (5) ((1F 2F ... k F) (E1 E2 ... Ek ))
2 If error terms are uncorrelated with each other and with the latent factor, and F is fixed to 1,
equation (5) can be further simplified to:
15
( ... )2 1 2 k XX ' ( ... )2 ( 2 2 ... 2 ) 1 2 k E1 E2 Ek
( )2 i . (6) ( )2 2 i Ei
Both and are model parameters in a CFA model and can be estimated based on sample
data. This coefficient is labeled coefficient (McDonald, 1978). More details about coefficient are presented in the next section.
Coefficient
Coefficient doesn’t require as restrictive assumptions as coefficient does. It is regarded as a generalization of coefficient and is recommended for estimating reliability of homogeneous measurements (Kelley & Cheng, 2012). Coefficient was first introduced by
McDonald (1978) as a reliability coefficient for the congeneric model. It was then extended to a more general form for the hierarchical model underlying multidimensional measures (McDonald,
1999). The generalized version of coefficient was referred to as h (Zinbarg, Revelle, &
Yovel, 2005) or t (Revelle & Zinbarg, 2009). This study focuses only on coefficient introduced by McDonald in 1978, that is, reliability based on the congeneric model. The following two sections introduce coefficient and its relationship with coefficient .
Definition and formula for computing coefficient . McDonald (1978) proposed coefficient as an index of scale reliability for the congeneric model. Coefficient is defined as a ratio of the variance of the composite true score to the variance of the observed total score.
Both composite true score variance and total score variance are a function of model parameters in a CFA model. The model parameters are estimated by fitting a one-factor CFA model to the data with an appropriate estimation method, such as maximum likelihood (ML) estimation. If we fix the factor variance to 1.0, coefficient is computed as equation (6).
16
Relationship between coefficient and coefficient . Relationship exists between
coefficient and coefficient . A key assumption for coefficient is essential tau equivalency. In
the one-factor CFA model, this assumption is equivalent to requiring equal factor loadings across
items. Let denote the common factor loading. The essential tau equivalent model can be
presented as :
X i * F Ei . (7)
However, coefficient does not require items to be tau equivalent. In other words, the factor loading could vary across items. Therefore, coefficient is a special case of coefficient
(Kelley & Cheng, 2012) where the essential tau equivalency assumption is satisfied as I show in detail below.
With ULI identification, for any two items i and i (i i ) with factor loadings and , respectively, the covariance between item i andi , , can be expressed as:
2 ii' ii ' . (8)
The average of covariances is:
ii' ii ' (kk )1 2 ii ' ii ' 2 (9) (kk )1 (kk )1 (kk )1
where is the number of items. Given equations (6), (8), and (9), coefficient can be rewritten
as:
( )2 ( )2 k 22 k 2 ii' k ii' i i ii ' ii ' . (10) ( )2 2 2 2 2 (kk )1 k 1 2 i Ei X X X X
Because the variance of the composite score ( ) consists of the sum of the item variances
2 ( i ) and the sum of covariances between any pair of items ( ii' ) , i ii '
17
2 2 X ii' i , (11) ii ' i
thus,
2 ii' i k ii ' k i 2 1( 2 ) . (12) k 1 X k 1 X
Therefore, coefficient is identical to coefficient when the tau equivalency assumption is met.
However, this assumption is not likely to hold in practice. So coefficient is a more accurate estimate of reliability coefficient than coefficient .
Confidence Interval
As an important index of measurement consistency, the reliability coefficient is routinely reported in empirical studies in social and behavior sciences. A point estimate of the reliability coefficient is a single value of the estimate of the population reliability. Because the point estimate is obtained from a given sample, the value is subject to sampling fluctuations and inevitably varies from sample to sample.
Null Hypothesis Significance Testing
The issue of determining the accuracy of estimates of parameters has received increasing attention in the literature. The most commonly used method is null hypothesis significance testing (NHST) (Neyman & Pearson, 1933), which has dominated statistical analysis for a long time in social and behavior studies (Cheung, 2009). A significance level, denoted as , of .05 or
.01 is routinely employed in tests of significance. The observed p value of a statistic is compared with the predetermined significance level and statistical inference is made based on the result of the comparison. For example, if the p value of the significance testing for the reliability estimate is larger than the significance level (e.g., .05), this reliability coefficient is judged as not significantly different from zero (assuming H0: ρ=0) in the population, and the measure in
18 question will then be concluded to lack reliability. A small p value is desired to reject a null hypothesis (Killeen, 2005).
Although significance testing for point estimates is commonly used, the appropriateness of its use is still a matter of debate (e.g., Chow, 1988; Cohen, 1994; Nickerson, 2000; Schmidt,
1996; Wainer, 1999). When conducting a null hypothesis significance test, we make conclusions by comparing the obtained p value to a predetermined arbitrary significance level such as .05.
However, it is not wise to make decisions based on such an arbitrary significance level (Neale &
Miller, 1997). An increasing number of researchers have found that it is unacceptable and unreasonable to evaluate scientific findings based on such a binary decision (Cheung, 2009b).
Another limitation of significance test is that it is sensitive to sample size. The test is more likely to be significant when the sample size is large. To avoid the unwise decision-making due to these reasons, a better alternative is interval estimation.
Interval Estimation
Point estimation provides only a single possible value of the population parameter. In contrast to point estimation, interval estimation (Neyman, 1935, 1937) provides a range of values which is likely to capture the population value of the parameter of interest. A confidence interval is a type of frequently used interval estimation. A confidence level (often denoted as 1- , where
is the Type I error rate) tells how likely it is that the interval includes the population parameter
(Kelley, 2007). A two-sided confidence interval includes a range of values defined by two limits
(i.e., upper and lower limits) with the general form of (Cheung, 2007; Cheung, 2009b; Kelley,
2007; Neyman, 1937):
ˆ ˆ P(L 0 U ) 1 (13)
19
ˆ ˆ where P denotes probability; 0 is the population parameter of interest; and L and U is a pair of
random variables which determine the random endpoints of a confidence interval (i.e., estimates
of the lower and upper confidence limits based on samples). In the literature, a 95% confidence
interval is most frequently reported which corresponds to the traditionally accepted level of
NHST (.05) ( imundi , 2008). A 95% CI means there is a 95% probability (e.g., 95 times out of
100) that the population ̌ ́ parameter falls in the interval under repeated sampling. Interval estimation may also be impacted by sample size. However, compared with significance testing, an interval estimate is more informative about the population parameter with a range of plausible values. Reporting a confidence interval along with the point estimate has been strongly recommended for empirical studies. It is said that “Interval estimates should be given for any effect sizes involving principle outcomes” and we should “provide intervals for correlations and other coefficients of associations or variation whenever possible” (Wilkinson & Task Force on
Statistical Inference, 1999, p. 599).
Even though interval estimation has received increasing attention, empirical studies reporting confidence intervals along with their corresponding parameter estimates in general are still limited. One possible reason is the lack of availability in statistical packages and methods for
CI construction for different statistics and psychometric indices (Cheung, 2009b; Steiger &
Fouladi, 1997). Confidence interval estimation for coefficient also lacks development and investigation (Padilla & Divers, 2013). In the current study, three methods for CI estimation for coefficient are introduced first, and then a comparison of performance among these three methods is conducted through a simulation study. The following sections review these three methods in detail.
20
Three Approaches to Interval Estimation for Coefficient
In this section, three methods for CI construction for coefficient are introduced: the
Wald method, the likelihood method, and the BCa bootstrap method.
Wald Method
Confidence intervals constructed with the Wald method are referred as the Wald
confidence intervals (Wald CIs). Wald CIs are constructed based on the standard error estimate
of the parameter of interest via the Wald statistic (Cheung, 2009b; Raykov, 2002). The
implementation of the Wald CI is usually accomplished with ML estimation. ML estimation is
based on the probability density function which provides the likelihood (i.e., probability) of an
individual case with a normal distribution (Azzalini, 1996; Enders, 2010).
Wald CI for an individual parameter. Let be an individual parameter, ˆ be the
maximum likelihood estimate of , and SE(ˆ) be the standard error of the maximum likelihood
ˆ estimate. The Wald statistic for which is defined as , asymptotically follows a standard SE(ˆ)
normal distribution with mean of zero and standard deviation of one (Cheung, 2009b; Enders,
2010), when the following requirements are met: (1) sample size is sufficiently large; (2) data meet the assumption of multivariate normality; and (3) is estimated using ML. With ML estimation methods, the standard error of the sample estimate of is determined by the second- derivative of the log-likelihood function of the sample, which “quantifies the curvature of the log-likelihood function” (Enders, 2010, p. 66). The 100(1- )% Wald CI for can be constructed
(Cheung, 2007, 2009a, 2009b; Raykov, 2002) as
ˆ Z SE(ˆ) 1 (14) 2
21 where Z is the standard normal score for the (1- For example, a 95% 1 /2)th percentile. 2
confidence interval for is ˆ .1 96 * SE(ˆ)
Wald CI for a function of parameters . If the parameter of interest is a function of one
or multiple parameters, the standard error estimate cannot be obtained directly from the second-
derivation of the likelihood function. In this case, we can use the delta method to obtain the
standard error estimate (Casella & Berger, 2002; Cheung, 2009b; Raykov, 2002; Raykov &
Marcoulides, 2004; Oehlert, 1992). The delta method is an analytic approach widely used to
obtain standard error estimates for functions, linear or non-linear, of model parameters (Raykov,
2009). This method has been commonly used to approximate standard error for parameters
which are functions of one or more other parameters, such as partial correlations (Olkin & Finn,
1995), mediating effects (Cheung, 2007), and reliability estimates (Cheung, 2009b, Kelley &
Cheng, 2012; Raykov, 2002, 2004, 2009; Raykov & Marcoulides, 2004).
A brief introduction to the application of the delta method. Under the assumption of
multivariate normality of variables, the asymptotic standard errors of parametric functions can be
derived with the delta method method (Ogasawara, 1999). The application of the delta method is
based on the linear “approximation of a smooth parametric function” (p. 623), which has the
properties of locally linearity and continuousness (Raykov & Marcoulides, 2004). Within the
framework of SEM, many functions of model parameters meet this requirement and can be
regarded as smooth functions, such as indirect effects and the reliability coefficient (Raykov &
Marcoulides, 2004). The essence of the application of the delta method is to approximate the
linear representation of a smooth function at a point of interest using the Taylor expansion (Hart,
1955) of the function.
22
Functions of a single model parameter. Suppose f )( is a function of a single model
parameter estimator and it is differentiable (n+1) times in an interval in which the population
th value 0 is included, then the n -order Taylor expansion of f )(is (Hart, 1955):
( )2 ( )n f )( f ()( f )(') 0 f )('' ... 0 f n)( )( R (15) 0 0 0 !2 0 n! 0 n
h)( th where f (0 ) denotes the h derivative of f )(evaluated at0 ; Rn is the remainder (0
In empirical applications, given the limited improvement provided by higher order expansions over the first-order expansion, the first-order Taylor expansion is routinely used as the linear approximation for a particular function (e.g., Raykov & Marcoulides, 2004; Tan,
1990):
f () f (0 ) ( 0 ) f (' 0 ) . (16)
Let D stand for the first derivative of the function with regard to evaluated at 0. The linear approximation of f )(can be expressed as:
f ( ) f )( f ( ) ( ) 0 0 0 . (17)
f (0 ) D( 0 )
Taking variances of both sides of equation (17) yields the variance of f )(. The approximate
standard error of the consistent ML estimate of f )( is obtained by taking the square root of the
variance:
SE( f (ˆ)) Dˆ * SE(ˆ) (18)
where f (ˆ) is the consistent ML estimate of f )( ; Dˆ is the first derivative estimate of f )( ; and SE(ˆ) stands for the standard error of the ML estimate of parameter . Once the approximate
23 standard error of f (ˆ) is obtained, the Wald CI for parameter f )( can be constructed as:
f (ˆ) Z SE( f (ˆ)) 1 . (19) 2
Functions of multiple model parameters. If the parameter is a function of more than one
parameter (i.e., 1,2 ,,...,k , k>1), the approximation of the first-order Taylor expansion of the function should be in the form of partial derivatives with regard to each individual parameter, evaluated at the population values of these parameters (e.g., Hart, 1955; Raykov & Marcoulides,
2004; Tan,1990), specifically,
f (10,20,...,k 0 ) f 21 ,,( ,...,k ) f (10,20,...,k 0 () 1 10) 1 , (20)
f (10,20,...,k 0 ) f (10,20,...,k 0 ) (2 20) ... (k k 0 ) 2 k
where f (1,2 ,,...,k ) is the parameter of interest; 10, 200 ,..., k are the population values of
f (10,20,...,k 0 ) 1,2 ,,...,k ; and is the partial derivative of the function with regard to j j
when evaluated at j0 (j0=10, 20, …, k0). The approximate standard error of f (1,2 ,,...,k ) can
be derived by first taking the variance of both sides of the function, and then taking the square
root of the obtained variance. The resulting approximate standard error is shown below (e.g.,
Raykov & Marcoulides, 2004):
ˆ ˆ ˆ ˆ 2 2 ˆ ˆ 2 2 ˆ 2 2 ˆ SE( f (1,2 ,,...,k )) [D1 (1) D2 (2 ) ... Dk (k ) , (21) ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ 2/1 2D1D2 (1,2 ) 2D2 D3 (2 ,3 ) 2Dk1Dk (k1,k )]
ˆ ˆ ˆ ˆ where f (1,2 ,,...,k ) is the consistent ML estimate of f (1,2 ,,...,k ) ; k is the ML estimate
ˆ of ; Dk is the partial derivative estimate of f (1,2 ,,...,k ) with regard to evaluated at
24
ˆ ˆ ˆ 2 ˆ ˆ ˆ 1,2 ,...,k ; (k ) is the variance of the ML estimate of a particular parameter; and (k1,k ) denotes the covariance between ML estimates of a pair of parameters. To this end, the Wald CI
for f (1,2 ,,...,k ) can be developed with the delta method.
Wald CI for coefficient based on the delta method. The parameter function of interest in the proposed study is the reliability coefficient based on the congeneric model, coefficient , which is computed as
( )2 i . (22) ( )2 2 i Ei
By using two new notations u and v to denote and , respectively, reliability can be
simplified to (Kelley & Cheng, 2012; Raykov,∑ 2002) ∑
u 2 . (23) u 2 v
Based on the delta method, Kelley and Cheng (2012) and Raykov (2002) derived the
approximate standard error of the ML estimate of coefficient :
ˆ 22 ˆ 22 ˆˆ SE(ˆ) Du (uˆ) Dv (vˆ) 2D Dvu ( ˆ,vu ˆ) , (24)
where is the ML estimate of ; and denote the consistent ML estimates of u and v,
̂ 2 2 ̂ ̂ respectively; (uˆ) and (vˆ) indicate variances of and , respectively; and ( ˆ,vuˆ) is the
ˆ ̂ ̂ covariance between and . Du is the consistent estimate of the partial derivative of with
̂ ̂ ˆ respect to u, evaluated at the population value of u; Dv is the consistent estimate of the partial
derivative of with respect to v, evaluated at the population value of v. The two partial derivatives can be obtained by applying rules of differentiation (Hart, 1955). Once the standard error estimate is obtained, the 100(1- )% Wald CI for can be constructed by:
25
Z SE() . (25) ˆ 1 ˆ 2
One advantage of using SEM methods with Wald CI formation is that the parameter
estimates and their corresponding variances and covariances, which are needed for the
computation of the related components, are available in SEM applications (Raykov &
Marcoulides, 2004). Multiple approaches are available to obtain these components. The approach
proposed by Raykov (2002) is to introduce four auxiliary variables (e.g., Raykov, 2001; Raykov
& Shrout, 2002) to the congeneric model and impose nonlinear constraints on the model
parameters. In 2009 he outlined a simpler procedure to obtain point and interval estimates for the
reliability coefficient of congeneric measures by introducing the reliability coefficient as a new
parameter via imposing model constraints. Kelley and Cheng (2012) recommended a modified
procedure, which uses the inverse of the information matrix (see Casella & Berger, 2002) and
doesn’t require nonlinear parameter constraints. Given that this approach is relatively simple to
implement, the current study used Kelley and Cheng’s approach to estimate the confidence
interval for coefficient .
Likelihood Method
Likelihood-based CIs are constructed using the likelihood method. Unlike a Wald CI
ˆ which is based on the asymptotically normal distribution of the Wald statistic, , a SE(ˆ)
likelihood-based CI is developed from the asymptotic chi-square distribution of the likelihood
ratio (LR) statistic (Venzon & Moolgavkar, 1988). The LR statistic and likelihood-based CI
construction for a single parameter, and a parameter that is a function of multiple parameters, are
discussed below.
Likelihood ratio statistic. A likelihood ratio test is used to compare the fit of two nested
26 models, the null model (i.e., restricted model) and the alternative model (i.e., unrestricted model).
The LR statistic tells how many times more likely the data are under one model than the other
(i.e., one model is more likely to be supported by the data than the other). Let be a parameter
vector, LogL(ˆ) be the log likelihood statistic under the alternative model evaluated at the ML
~ ~ estimateˆ , and LogL( )be the log likelihood statistic under the null model evaluated at . The
LR statistic is defined as (Azzalini, 1996; Buse, 1982):
~ ~ 2 LR 2LogL( ) (2LogL(ˆ)) 2LogL(ˆ) 2LogL( ) ~ (g) (26)
where LogL(.) denotes the natural logarithm likelihood function of the specified model and g is the number of parameters constrained in the null hypothesis, which is equal to the difference in degrees of freedom (df) between the two models. The deviance (i.e., difference in -2LogL(.)) asymptotically follows a chi-square distribution with the degrees of freedom equal to g (Buse,
1982).
Likelihood-based CI computed via the likelihood function of a single parameter.
Suppose is a single parameter and the log likelihood function contains only this single parameter. Because the number of parameters constrained in the null hypothesis is 1, the LR statistic follows a chi-square distribution with one degree of freedom. To construct a two sided
100(1- )% likelihood-based CI for , two interval limit estimates, and , need to be
ˆ ˆ ̂ ̂ identified, where L is a point on the left of the ML estimate and U is a point on the right of the
ML estimate. At both points the LR statistic is just statistically significant under the chi-square
distribution with df=1 at the predetermined significance level. The formula representations are:
ˆ ˆ 2 2LogL() 2LogL(L ) )1,1( (27)
ˆ ˆ 2 2LogL() 2LogL(U ) )1,1( (28)
27
2 where ˆ denotes the ML estimate of ; 1,1( ) is the critical value (i.e., the (1- )th percentile) of
the LR statistic distribution with df=1 at the significance level of . Solving the equation below,
ˆ ˆ we can get solutions for L and U . Figure 2 illustrates the log likelihood function with regard to
the parameter (Azzalini, 1996), where
ˆ ˆ ˆ 2 LogL(L ) LogL(U ) LogL() )1,1( 2/ . (29)
LogL()
LogL()
2 LogL( ) (1,1 ) / 2
L U
Figure 2: An Illustration of Confidence Interval Computed via the Log-likelihood Function.
Likelihood-based CI computed via likelihood function of multiple parameters. The likelihood method is also applicable for a likelihood function of multiple parameters. Suppose that the likelihood function contains a set of parameters, 1, 2, …, , where s is the parameter
of interest ( ). To construct the likelihood-based CI for s, we can use the profile- likelihood method (Venzon & Moolgavkar, 1988).
Let us denote the parameter of interest s by , and let be the vector containing
28 remaining parameters in the model, 1, 2, …, s-1, s+1, …, t. The profile likelihood method
treats the remaining parameters as nuisance parameters and maximizes the likelihood function
over them (Venzon & Moolgavkar, 1988). Through this method, the nuisance parameters ( ) are removed from the likelihood function, and only the single parameter of interest ( ) remains. The resulting function is the profile likelihood function for . This process can be illustrated by
(Patterson, 2014; Venzon & Moolgavkar, 1988):
LLp ( ) max ( , ) , (30) where is the profile likelihood function containing only the parameter of interest ; L( , )
is the original likelihood function including both and ; and max (.) is the maximum function,
which returns the largest likelihood among a set o f possible likelihood values. For each value of ,
is “the maximum of the likelihood function over the remaining parameters” (Patterson,
2014, p. 1). A 100(1- )% likelihood-based CI for can be obtained by applying the LR test,
ˆ ˆ ˆ 2 2LogL(, ) 2LogLP (L ) )1,1( (31)
ˆ ˆ ˆ 2 2LogL(, ) 2LogLP (U ) )1,1( (32)
ˆ ˆ ˆ where and ˆ are the ML estimates of and , respectively; L and U are the estimated lower and
upper bounds of the 100(1- )% likelihood-based CI for , at which the LR statistic reaches the
2 chi-square critical value, 1,1( ) , with one degree of freedom at significance level;
ˆ ˆ ˆ LogLP (L ) is the profile log likelihood for when is fixed at L ; and LogLP (U ) is the profile
ˆ log likelihood for , when is fixed at U . The likelihood-based CI with approximate confidence
ˆ ˆ ˆ ˆ level (1- ) is ( L , U ), where L and U can be obtained by solving equations (31) and (32).
29
Likelihood-based CI for coefficient . With the profile likelihood method, the likelihood-based CI for coefficient is obtainable via SEM. If the parameter of our interest is a
function of multiple model parameters, a model reparameterization needs to be implemented
before applying the profile likelihood method (Patterson, 2014). Because coefficient is a
function of factor loadings and error variances, which are model parameters in the CFA model,
the reparameterization is needed before profiling the likelihood function. Three steps are
followed to calculate the profile likelihood CI for coefficient . First, identify the log likelihood
function of the congeneric model, which consists of model parameters including factor loadings
and error variances. Second, reparameterize the log likelihood function of the congeneric model
in terms of the target parameter . A set of new parameters may be introduced during the model
reparameterization process. Last, apply the profile likelihood method to the reparameterized log
likelihood function, which contains and other newly introduced parameters. In line with
equations (31) and (32), taking as (i.e., the parameter of our interest) and all other
parameters in the log likelihood function as (i.e., nuisance parameters), the profile likelihood
function for coefficient can be derived. Once the profile likelihood function is available, the
100(1- )% confidence interval for coefficient can be obtained via the LR test. The OpenMx
package (Boker et al., 2011) installed in the R program can automatically implement these
procedures.
Bias-corrected and Accelerated Bootstrap Method
The basic idea of the bootstrap method is to estimate the sampling distribution for the
parameter of interest by repeated sampling from an available sample. Bootstrap intervals can be
formed by using different bootstrap techniques, such as the percentile bootstrap, bias-corrected
(BC) bootstrap, bias-corrected and accelerated (BCa) bootstrap, and approximate BCa (ABC)
30 bootstrap. Among various bootstrap intervals, the BCa bootstrap interval is generally more
preferred for two reasons. First, it takes asymmetry, bias, and nonconstant variance into
consideration; second, it improves the interval via transformation, bias correction, and
acceleration adjustment (DiCiccio & Efron, 1996; Efron, 1987; Kelley & Cheng, 2012). The
BCa bootstrap method is proved to be second-order accurate and correct (DiCiccio & Efron,
1996; Efron, 1987; Kelley & Cheng, 2012). The following section is a brief introduction to the
BCa bootstrap method.
A brief introduction to bootstrap techniques. Suppose is the parameter of interest and N is the sample size of an available sample S. Generally, four steps are followed for practicing the bootstrap technique. The first step is to obtain a bootstrap sample with sample size equivalent to N, which is achieved by random sampling N times from S with replacement, and then calculate the ML estimate of based on the obtained bootstrap sample. On each random
selection, each observation in S has an equal probability of being selected with probability of
1/N. The second step is to replicate the aforementioned random sampling B times to get B
bootstrap samples, with the same sample size of N for each, and B estimates of based on the B
bootstrap samples. The number of bootstrap replications B is supposed to be relatively large
(e.g., 1000) (Efron, 1987). The third step is to form the bootstrap estimate of the sampling
distribution (i.e., the bootstrap sampling distribution) of based on its B estimates yielded from
the B bootstrap replications. The last step is to find the upper and lower confidence limits to form
the bootstrap confidence interval based on the bootstrap sampling distribution, which are the
values of bootstrap estimates corresponding to the lower and upper percentiles of the
distribution. Different lower and upper percentiles may be identified for confidence limits
through different bootstrap methods. For example, the ( /2)th percentile and the (1- /2)th
31 percentile of the bootstrap distribution are often identified as the lower and upper limits for the percentile bootstrap interval (Kelley & Cheng, 2012; Padilla & Divers, 2013), while the BCa bootstrap method uses different limits for CI construction (see equations (36) and (37) below).
Construction of bias-corrected and accelerated bootstrap CI. Let ˆ be the maximum
likelihood estimate of based on the original sample and ˆ* denote the ML estimate of based
ˆ* ˆ* on the bootstrap replications. To calculate the two limits ( L andU ) of the 100(1- )% BCa
bootstrap CI for , two estimates need to be obtained. One is the bias-corrected estimate and the
other is the acceleration estimate.
The bias-corrected estimate. The bias-corrected estimate quantifies the degree of
asymmetry of the bootstrap sampling distribution, often denoted as zˆ0 (DiCiccio & Efron, 1996;
Efron, 1987; Kelley & Cheng, 2012):
(# ˆ* ˆ) zˆ 1( ) (33) 0 B
where B is the number of bootstrap replications and (# ˆ* ˆ) indicates the number of bootstrap
* estimates less than the original estimate; (# ˆ ˆ) is the proportion of bootstrap estimates less B
1 than the original estimate; and (.) is the inverse of the standard normal cumulative distribution
* * (# ˆ ˆ) function (c.d.f.) at a particular value. When the distribution of ˆ is symmetric, 2/1 B
* and zˆ0 = 0 (DiCiccio & Efron, 1996). For a positively skewed distribution, ˆ is negatively
* (# ˆ ˆ) * biased relative toˆ , 2/1 , and zˆ0 > 0; for a negatively skewed distribution, ˆ is B
(# ˆ* ˆ) positively biased relative toˆ , 2/1 , and zˆ0 < 0. B
32
The acceleration estimate. The acceleration estimate, which is denoted as ˆ0 , quantifies
the rate of change of the standard error of ˆ and is measured on a normalized scale (Efron, 1987).
It is calculated via the jackknife estimation procedure. The basic idea of the procedure is to
ˆ estimate the jackknife value i , which is the estimate of ˆ when the ith observation is removed
ˆ from the original sample (Miller, 1974; Kelley & Cheng, 2012). The estimation of jackknife i
value repeats N times with the total number of N observations deleted from the original sample
ˆ ˆ one-by-one and the mean of the N jackknife i values is denoted asi . Then, ˆ 0 is computed as
(DiCiccio & Efron, 1996; Efron, 1987; Kelley & Cheng, 2012):
N ˆ ˆ 3 (i i ) i1 ˆ0 N . (34) ˆ ˆ 2/32 (6 (i i )) i1
BCa bootstrap confidence interval limits. Let (.) be the cumulative distribution function which is defined as (DiCiccio & Efron, 1996; ̂Efron, 1987; Kelley & Cheng, 2012):
(# ˆ* c) Gˆ(ˆ* c)| , (35) B
where c is a constant. When zˆ0 and ˆ0 are available, the 100(1- )% BCa bootstrap confidence
ˆ* ˆ* interval ( L ,U ) can be calculated as follow (DiCiccio & Efron, 1996; Efron, 1987; Kelley &
Cheng, 2012):
z Z ˆ* ˆ 1 ˆ* ˆ0 2/ L G ( | (zˆ0 ) , and (36) 1ˆ0 (zˆ0 Z 2/ )
33
zˆ Z ˆ* ˆ 1 ˆ* 0 )2/1( U G ( | (zˆ0 ) . (37) 1ˆ0 (zˆ0 Z )2/1( )
When zˆ0 = 0 and ˆ0 = 0, the BCa bootstrap CI equals the percentile CI.
BCa bootstrap CI for coefficient . Using the BCa bootstrap method, the general
procedures for the formation of the CI on coefficient are: (a) obtain B bootstrap samples by sampling with replacement from the original random sample; (b) estimate coefficient via the
ML estimation method for each bootstrap sample; (c) construct the bootstrap distribution of coefficient estimates; (d) locate the two percentiles on the empirical bootstrap distribution,
zˆ0 Z 2/ which correspond to the two cumulative frequencies (zˆ0 ) and 1ˆ0 (zˆ0 Z 2/ )
zˆ0 Z )2/1( (zˆ0 ), respectively; and (e) form the 100(1- )% BCa bootstrap CI on 1ˆ0 (zˆ0 Z )2/1( ) coefficient with the two percentiles as interval limits.
Kelley and Cheng (2012) recommended the ci. reliability ( ) function in the MBESS package in R to compute the BCa bootstrap procedures on coefficient . This approach was applied in my study to perform BCa bootstrap CI for .
A Comparison among Three Interval Estimation Methods
This section discusses the similarities and differences among the three types of methods
for confidence interval construction. The comparison is discussed from the following six aspects.
Statistical Test
The Wald and likelihood methods involve statistical tests. The Wald CI is constructed
ˆ based on the Wald statistic , which asymptotically follows a normal distribution. The SE(ˆ)
construction of likelihood-based CI uses the LR statistic, which asymptotically follows a chi-
34 square distribution with degrees of freedom equal to g. However, most bootstrapping procedures
require randomization tests, but do not require any parametric statistical tests.
Statistics Applied
The Wald and likelihood methods are associated with statistical tests. Differences and
relations exist between the Wald statistic and likelihood ratio statistic (LR). Squaring the Wald
ˆ ˆ statistic, , yields the other version of the Wald statistic ( )2 , which asymptotically SE(ˆ) SE(ˆ)
follows a chi-square distribution (Enders, 2010). This version of the Wald statistic (W) is “the
quadratic approximation of the LR statistic by using the second-order Taylor’s expansion of the
log-likelihood function around the MLE” (Cheung, 2009b, p. 270), specifically,
1 LogL )( LogL(ˆ) I(ˆ)(ˆ )2 (38) 2
I(ˆ)(ˆ )2 2LogL(ˆ) 2LogL() (39)
where I(ˆ) is the information matrix which is obtained by adding a negative sign to the second derivative of the LogL( ) evaluated atˆ Taking the square root of the inverse of the information
matrix yields the standard error of the parameter, via
ˆ I(ˆ)(ˆ )2 ( )2 W (40) SE(ˆ)
LR 2LogL(ˆ) 2LogL() (41)
W LR. (42)
Therefore, these two statistics are asymptotically equivalent, following a chi-square distribution
(Buse, 1982; Cheung, 2009b). In some special cases, the Wald and LR statistics are exactly the
same, for example, when the log-likelihood function is quadratic for linear models (Buse, 1982).
35
However, differences also exist between the Wald and LR statistics. First, as the LR statistic departs from the quadratic shape, the inequalities appear between them when the sample size is finite (Buse, 1982). The likelihood function for coefficient is the profile likelihood function. Because “each point on the profile likelihood function is the maximum value of a
likelihood function” (Patterson, 2014, p. 1) and the form of the profile likelihood function for
coefficient is not implicitly known, it is unclear what the difference in estimated confidence
intervals is between the Wald and Likelihood methods. Second, the LR test refers to the
estimation of two models, a constrained model and an unconstrained model, and the comparison
of model fit between them, while the Wald test is based on the estimation of one model and is
performed by testing the null hypothesis that a set of parameters of interest are simultaneously
equal to some constant in the population, usually zero.
Assumption of Multivariate Normality
The Wald and likelihood methods are parametric, while the BCa bootstrap method is
non-parametric and computation-intensive. One major assumption for parametric techniques (the
Wald and likelihood) is multivariate normality, which does not always hold in practice. A major
advantage of the BCa bootstrap method is that it does not depend on this assumption, because it
is a bootstrap resampling technique. Bootstrap confidence intervals are directly constructed using
the empirical distribution of the parameter estimates. When the multivariate normality
assumption is violated, the bootstrap CI might be more accurate than the Wald CI and likelihood-
based CI.
Symmetry
The Wald CI, which is constructed based on the asymptotically normal distribution, is
symmetric around the ML estimate. However, symmetry may be inappropriate, especially when
36 parameter estimates are near to the limits (Neale & Miller, 1997), in such cases, the confidence intervals would include some values outside the meaningful boundaries of the parameter.
However, the likelihood-based CI, using the log-likelihood function directly, is asymmetric
(Cheung, 2007, 2009b). Because the BCa bootstrap CI is based on the empirical distribution of the parameter, it is also asymmetric.
Sample Size
Given the nature of the related test statistics, both the Wald method and likelihood method require a large sample size to accurately estimate the confidence interval. However, the
BCa bootstrap method, which relies on resampling techniques, has a relatively relax requirement on sample size. Therefore, the BCa bootstrap method may yield more sound confidence intervals than the other two methods for relative small sample sizes. However, bootstrap techniques do not work well (i.e., consistently overestimating standard errors) if sample size is too small (e.g., 100); for sample size larger than 200, bootstrap techniques are favored (Nevitt & Hancock, 2001).
Variance/Invariance to Parameter Transformation
CIs constructed with the Wald method are not invariant to transformations on parameters
(i.e., equivalent reparameterizations of a model) (Neale & Miller, 1997), which is a big concern
in SEM because of the existence of many possible equivalent models with different model
parameterizations (Cheung, 2009). By contrast, the likelihood-based CI is invariant to
transformations (Neale & Miller, 1997). Several percentile bootstrap methods are
parameterization invariant, such as the simple percentile bootstrap method, the bias-corrected
(BC) percentile method, and the BCa bootstrap method (DiCiccio & Romano, 1988).
Below is a brief summary for these three methods based on the above discussion:
37 a. The Wald and likelihood methods depend on the Wald and likelihood ratio statistics,
respectively; while the BCa bootstrap method does not use a parametric statistical test. b. Both Wald and likelihood methods assume multivariate normality of variables, while the
BCa bootstrap method does not have this requirement. c. CIs constructed with the likelihood and BCa bootstrap methods are asymmetric, while CIs
formed with the Wald method are symmetric. d. Both the Wald and likelihood techniques require a large sample size to accurately estimate
the CI, while the BCa bootstrap technique has a relatively relaxing requirement on sample
size, as long as the sample size is not too small (> 100).
e. CIs constructed through the likelihood and BCa bootstrap methods are invariant to
transformations on model parameters, while CIs constructed through the Wald method are
variant to transformations.
Previous Research
The three methods discussed above have been used to construct CIs for various statistics
and psychometric indices in empirical studies. They have also been implemented in several
statistical packages via the SEM approach. Some authors also have provided code and syntax for
these methods in SEM applications. However, a very limited number of studies could be found
on the comparison of the empirical performance of these methods. A few examples of such
studies are presented below.
Cheung (2007) used simulation studies to examine the performance of four methods for
CI construction for mediating effects. The four methods were the Wald, percentile bootstrap,
bias-corrected bootstrap, and likelihood-based methods. The results showed that, with large
mediating effects or large sample sizes, these methods performed equally well in terms of the
38 coverage probability. For small mediating effects and sample sizes, the bootstrap CI and likelihood-based CI were recommended (Cheung, 2007). Cheung (2009b) illustrated how to form Wald CI and likelihood-based CI for many statistics and psychometric indices, including dependent correlation coefficients, squared multiple R, standardized regression coefficients, mediating effects, and reliability estimates. First he compared the likelihood-based and Wald CIs constructed with the SEM approach based on a real data set. The results of empirical data analyses suggested that these two methods appeared to be comparable.
In the same paper, Cheung 2009b) presented results from a simulation study on the
Pearson correlation and compared coverage probabilities and interval widths from these two methods. The simulation study showed that the likelihood-based CI outperformed the Wald CI in small samples (Cheung, 2009b). In another study, Cheung (2009a) compared six Wald CIs
(Sobel-fixed, Aroian-fixed, Sobel-random, Aroian-random, Bobko-Rieck, and SEM-Wald), three bootstrap CIs (Naive bootstrap, Percentile bootstrap, and BC bootstrap), likelihood-based CI, and the PRODCLIN CI on standardized indirect effects through a simulation study. The results showed that the percentile bootstrap, BC bootstrap, and the likelihood-based methods performed the best according to coverage probability (Cheung, 2009a). Ye and Wen (2011) conducted a simulation study to compare three interval estimation methods for coefficient : bootstrap, delta
(i.e, the Wald method), and using the model parameters estimated using LISREL (i.e., point estimate and standard error). However, it is not clear which kind of bootstrap method was chosen in their study; the method using parameter estimates directly from LISRELwas not clearly addressed either. Kelley and Cheng (2012) compared the Wald CI and BCa bootstrap CI on coefficient using an empirical example and recommended the BCa bootstrap approach. Padilla and Divers (2013) investigated three different bootstrap CIs for coefficient in a simulation
39 study. The study assessed the performance of the normal theory bootstrap (NTB) CI, percentile based (PB) CI, and the BCa bootstrap CI on coefficient under four simulation factors (number
of items, correlation type, number of item response categories, and sample size). Categorical
item with symmetric distributions was investigated in their study. Although the study compared
three bootstrap CIs for coefficient , no comparison was made between bootstrap methods and
non-bootstrap methods.
The Rationale and Purpose of the Proposed Study
It is noteworthy that a gap exists in the current literature on the evaluation of the
performance of the three methods (i.e., Wald method, likelihood-based method, and bias-
corrected and accelerated method). That is, no simulation study has been conducted to evaluate
the performance of the three approaches to interval construction on coefficient . Given that
various factors may affect the performance of CI construction methods on coefficient (e.g., data distribution, sample size, and variable correlation), it is very hard to find the practical differences among these CI construction methods with only numerical illustrations. Numerical examples are mainly used for illustrative purposes (Yung & Bentler, 1996). In order to investigate how these CI construction methods would work in different settings, a simulation study is considered.
The current simulation study aims to compare three approaches to constructing CIs for coefficient By comparing the performance of these three approaches in different conditions, I hope that the findings would benefit applied researchers in choosing the appropriate method to estimate reliability.
Based on my literature review, I have the following expectations:
1. When data are multivariate normally distributed, the three methods would perform
40
equally well on CI estimation for coefficient .
2. When multivariate normality does not hold, the BCa bootstrap method will perform
better than the other two methods, while the performance of the likelihood method
will be comparable to the Wald method.
3. As the degree of nonnormality increases, the Wald and likelihood methods for CI
estimation for coefficient will perform worse; while the BCa bootstrap method will
be robust to the departures from normality.
4. As sample size increases, the constructed CIs for coefficient should be more
accurate for all three methods in terms of interval width.
5. Based on my review, the number of items and factor loadings in a test should not
have an effect on the performance of the three interval estimation methods. However,
in my study I manipulated these two conditions to explore the potential relationship.
41
METHODS
According to Yung and Bentler (1996), for the purpose of evaluating the appropriateness
of a method, simulation studies make more sense than empirical studies. Monte Carlo study is an
alternative technique to investigate problems when analytical methods are not available
(Bandalos & Leite, 2013). It can also be used to “provide insight into the behavior of a statistic
even when mathematical proofs of the problem being studied are available”, because “theoretical
properties of estimators do not always hold under real data conditions” (Bandalos & Leite, 2013, p. 627). Given that the comparison of these three methods for constructing CI on coefficient
cannot be achieved through analytical approaches directly and also because the assumptions
required by these methods may not be met in real settings, a simulation study is chosen to
investigate the performance of the three CI estimation methods.
Four factors were manipulated in the simulation design: sample size, number of items,
factor loadings, and degree of nonnormality. As indicated by the derivation of coefficient , the
number of items affects the magnitude of the reliability coefficient. Sample size is included in
the study design because: (1) the ML estimator is based on the asymptotic theory; (2) the Wald
and LR statistics are sensitive to sample size; and (3) given that the performance of bootstrap
techniques is also subject to sample size, the proposed research also allows to examine the effect
of sample size on the BCa bootstrap approach to interval estimation of coefficient . Degree of
nonnormality was considered in the study design because the Wald and likelihood methods are
supposed to work well under the multivariate normality assumption while the bootstrap
technique does not require this assumption. “Strictly speaking, test scores are seldom normally
distributed” (Nunnally, 1978, p. 160). Micceri (1989) investigated 440 distributions, which
represented most types of distributions found in applied settings. The results showed that all
42 distributions were significantly nonnormal at the alpha level of .01. Therefore, I am interested in examining the performance of different methods for CI construction for coefficient with different types of distribution. For the one-factor CFA model, sizes of factor loadings directly impact the population correlations among items as well as reliability coefficient. Therefore, the size of the factor loading was considered as a design factor. Below I provide in detail the design factors in the simulation study.
Design Factors
Four design factors were manipulated in the study: sample size, number of items, factor loading, and degree of nonnormality. a. Sample size. Three levels of sample size were investigated: 100, 300, and 500. These three
levels of sample size were chosen to represent small to moderately large size in factor
analysis (Comrey & Lee, 1992). b. Number of items. Two levels were considered for the number of items, 6 and 12. c. Factor loading. Three sets of factor loadings for the 6-item congeneric model
were .3, .3, .3, .7, .7, .7; .4, .4, .4, .8, .8, .8; and .6, .6, .6, .9, .9, .9. The sizes of factor loadings
for the 12-item congeneric model were the same as those for the 6-item model. However, the
number of items with each loading was doubled. The error variance of each item was fit to
one minus the square of the factor loading. Consequently, the variance of the observed scores
was one. These levels of factor loadings were selected for application purposes, because the
widely acceptable level of reliability for measures is .70 or above in most applied settings
(Lance et al., 2006; Nunnally, 1978) and the resulting population reliabilities under normal
distributions with these levels of factor loadings are very close to or higher than this cutoff
point (see Table 3).
43 d. Degree of nonnormality. Degree of nonnormality is quantified by two statistics, skewness
and kurtosis. In my study, I considered the situation where the source of nonnormality of the
observed scores is due to the nonnormality of the factor scores. Three different combinations
of skewness and kurtosis for factor scores were considered: normal distribution with
skewness of 0 and kurtosis of 0 (Sk=0, K=0), moderately nonnormal distribution with
skewness of 2 and kurtosis of 7 (Sk=2, K=7), and seriously nonnormal distribution with
skewness of 3 and kurtosis of 21 (Sk=3, K=21). More details about the selection of skewness
and kurtosis are provided in the next section.
A total of 54 conditions were considered in data generation, created by fully crossing the four factors (3 levels of sample size × 2 levels of item count ×3 levels of factor loadings × 3 levels of nonnormality). For each condition, 2000 data sets were generated, resulting a total of
108000 (2000 × 54) data sets. For each dataset, the Wald method, likelihood method, and BCa bootstrap method were used to construct the 95% Wald CI, likelihood-based CI, and BCa bootstrap CI on coefficient , respectively.
Data Generation Procedure
Data were generated in R 2.15.0 (R Core Team, 2012) based on the specified conditions.
An example of R code for generating data is provided in Appendix A. Specifically, factor scores following a standard normal distribution (i.e., normal distribution with mean of zero and standard
deviation of one {N (0, 1)}) were first generated for each item. After obtaining standard
normally distributed factor scores, Fleishman’s (1978) power transformation method was used to
transform the normal factor scores to nonnormal scores with the prespecified skewness and
kurtosis (e.g., Fan & Fan, 2005): as
Y a bX cX 2 dX 3 , (43)
44
where is the transformed nonnormal variable obtained via Fleishman’s procedure; is the
normally distributed variable; a, b, c and d are transformation coefficients (a=-c) relating to
different degrees of skewness and kurtosis. Table 1 presents the transformation coefficients
corresponding to these three combinations of skewness and kurtosis for factor scores. Fleishman
(1978) provided these coefficients for various skewness and kurtosis. SAS code for obtaining
these coefficients is also available in Fan and Fan (2005).
Table 1: Transformation Coefficients Corresponding to Three Types of Distribution
Skewness Kurtosis Transformation coefficients b c d 0 0 1 0 0 2 7 0.761585275 0.260022598 0.053072274 3 21 -0.681632225 0.637118193 0.148741042
After obtaining nonnormal factor scores with the above procedure, random error scores
following a standard normal distribution were generated for each item. Observed item scores
were then obtained as a linear combination of the common factor and error scores, weighted by
the corresponding factor loading and standard deviation of error scores, according to the one-
factor CFA model (e.g. Bernstein & Teng, 1989 ), via
2 X ij i * Fj 1( i *) Eij , (44)
where is the observed score for person j on item i; is the factor loading for item i;
denotes the factor score of person j; denotes the random error score for person j on item i;
45
2 and 1 i represents the standard deviation of the error score associated with item i. Item scores
generated in this way were nonnormally distributed.
Based on the data generation procedure presented above, the skewness and kurtosis of
observed item scores are a function of the skewness and kurtosis of factor scores and
corresponding factor loadings, and thus they are different from the skewness and kurtosis of the
factor scores. Considering the distributions of observed scores most commonly encountered in
practical settings, three different combinations of skewness and kurtosis for factor scores were
carefully selected (i.e., Sk=0, K=0; Sk=2, K=7; and Sk=3, K=21). Skewness and kurtosis of
observed item scores corresponding to these three combinations were calculated and are
presented in Table 2. The skewness of observed item scores ranged from 0.06 to 1.45 and from
0.08 to 2.16, when the skewness of factor scores was 2 and 3, respectively. The kurtosis of
observed item scores ranged from 0.07 to 4.55 and from 0.16 to 12.79, when the kurtosis of
factor scores was 7 and 21, respectively. These values represented low level of skewness and low
to high level of kurtosis in the observed variables. To graphically demonstrate the population
distributions with these levels of skewness and kurtosis, a large dataset with sample size of
1,000,000 was generated for each condition. Histograms for nonnormal distributions were
plotted and displayed in Figure 3 and Figure 4 (see Appendix C).
Data Analysis
CI construction with the three approaches was implemented using the R program. For each generated dataset, a 95% Wald CI and a 95% likelihood-based CI for coefficient were
estimated using the MBESS package and a 95% BCa bootstrap CI was formed using the
OpenMx package. To evaluate the performance of the three interval estimation methods, two
criteria were used: coverage probability and interval width. Coverage probability and the mean
46
Table 2: Skewness and Kurtosis of Observed Item Scores.
Skewness and kurtosis of factor scores Factor loading Sk=0, K=0 Sk=2, K=7 Sk=3, K=21 Sk K Sk K Sk K .3 0 0 .05 .06 .08 .17 .4 0 0 .13 .18 .19 .5 .6 0 0 .43 .91 .63 2.54 .7 0 0 .68 1.69 1.00 4.66 .8 0 0 1.02 2.88 1.51 8.02 .9 0 0 1.45 4.6 2.15 13.03
of the interval widths were calculated for each condition. Although the focus of this study was on
CI construction, point estimates were also reported as supplemental evidence for evaluating the
performance of the three CI construction methods. For each point estimate, I computed the
relative bias using the following formula:
Relative bias= ˆ , (45) whereˆ and denote the mean of point estimates and the population coefficient , respectively.
To examine the accuracy of point estimates for coefficient with these three methods, Hoogland and Boomsma’s (1998) criterion was employed. According to their criterion, biases for point estimates with absolute values smaller than .05 are acceptable (Hoogland & Boomsma, 1998).
Coverage probability in this simulation study is defined as “the proportion of estimated
CIs that contain the true population coefficient ” (Padilla & Divers, 2013, p. 82). If the CIs show consistently acceptable coverage under the simulation conditions, the method used to form these CIs performs well. Bradley’s liberal criterion (1978) was used to determine the acceptable
47 coverage probability, which is defined as , where is the true
Type I error rate and is the predetermined Type I error rate. This criterion has been employed
in simulation studies (e.g., Nevitt & Handcock, 2001; Padilla & Divers, 2013). Based on this
criterion, the acceptable coverage for a 95% CI (i.e., Type I error rate is .05) ranges from .925 to
.975 (e.g., Padilla & Divers, 2013). Because the power transformation was performed only on
factor scores and was applied before using equation (43) to obtain the observed item scores, the
correlation among the observed items with normal distributions should be the same as those with
nonnormal distributions. Consequently, the population coefficient is the same across the
conditions with different degrees of nonnormality. Therefore, the population coefficient under
the normal distribution was used to compute the coverage probability (as well as the relative bias
of point estimates) for the conditions with nonnormal distribution. To empirically check it, three
large datasets with sample size equal to 1,000,000 were generated for each condition with a
nonnormal distribution (i.e., Sk=2 and K=7, and Sk=3 and K=21 for factor scores). Coefficient
was calculated based on each dataset and the mean of the coefficient estimates obtained from
the three large datasets was used to approximate the population under that condition. The
values of coefficient under nonnormal distributions were found to be identical to those under
the normal distribution to the third decimal place. Interval width, which is the difference between
the upper and lower CI limits ( , suggests “the precision of the parameter estimates”
(Cheung, 2009b, p. 282). The mean ̂ of ̂ interval widths for each condition was calculated. When
the coverage probabilities are the same, the CIs with a smaller interval width are more accurate
and desirable.
To further explore the effects of the four design factors (sample size, degree of
nonnormality, factor loading, and number of items) on the performance of the three interval
48
Table 3: Population Coefficient under Different Conditions
Number of items Factor loading Population coefficient .3 .7 .679 6 .4 .8 .783 .6 .9 .891 .3 .7 .809 12 .4 .8 .878 .6 .9 .942
estimation methods (Wald, likelihood, and BCa bootstrap), two statistical tests were conducted in SAS program version 9.3 (SAS Institute Inc., 2011): logistic regression and 5-way ANOVA.
The five independent variables for the two tests were sample size, degree of nonnormality, factor
loading, number of items, and CI estimation methods. It is obvious that the assumption of
independence among observed scores for performing regression and ANOVA was violated in
this study. However, I conducted the logistic regression and ANOVA for heuristic purpose. For
logistic regression, the binary outcome variable indicated whether the estimated CIs covered the
population coefficient (“1”=”yes”; “0”=”no”). Fifteen predictors in the logistic regression were identified: sample size, degree of nonnormality, factor loading, the number of items, CI estimation method, and the interaction effects between any pair of them. The likelihood-based pseudo R-square for the full model (i.e., the model with all fifteen predictors) was requested from the SAS program. To examine the contribution of an individual predictor to the improvement of
model fit, partialling out other predictors, the partial pseudo R-square was calculated as the
difference between the pseudo R-square of the full model and the reduced model (i.e., the model
excluding that particular predictor). For ANOVA, the dependent variable was interval width.
49
Given that the sample size of each group in the statistical tests was considerably large (2000 for most of the groups) and statistical tests tend to be significant with large samples (Alhija &
Wisenbaker, 2006), I calculated partial eta squared to evaluate the importance of each individual
predictor (i.e., four different design factors and interval estimation method plus all the two-way
interaction effects). The partial eta squared is explained as the proportion of variance in the
outcome variable attributable to a single predictor, partialling out the other predictors.
50
RESULTS
Fifty-four conditions were yielded by fully crossing all factors: sample size (100, 300,
and 500), number of items (6 and 12), factor loadings (small, median, and large), and degree of
nonnormality on factor scores (Sk=0 and K=0, Sk=2 and K=7, and Sk=3 and K=21 for factor
scores). Population coefficient values were calculated and shown in Table 3. For each
generated dataset, three types of CIs (the Wald CI, likelihood-based CI, and BCa bootstrap CI)
were constructed for coefficient using ML. To evaluate the performance of the three interval estimation methods (the Wald, likelihood, and BCa bootstrap methods), coverage probability, interval width, and relative bias of point estimates were calculated for each condition. Logistic regression and 5-way ANOVA were conducted to examine the effects of design factors on coverage probability and interval width of the constructed CIs, respectively. Results were reported and summarized in this chapter.
Non-convergence
The non-convergence issue occurred only in the condition with the nonnormally distributed data (Sk=3 and K=21) with small loadings (.3/.7) and small sample size (100), when the Wald and BCa bootstrap methods were used. Among 2000 applications, five replications
with non-convergence were found and they were excluded from further data analysis.
Coverage Probability
Table 4 reports coverage probabilities of the 95% CIs for coefficient yielded by the
three methods. Several consistent observations were made from this table.
When the data were multivariate normally distributed, all coverage probabilities were acceptable, according to Bradley’s liberal criterion (1978) (i.e., between 92.5% and 97.5%) and
very close to the prespecified .95 confidence level across all conditions. Figure 5 in Appendix D
51
Table 4: Coverage Probability for Each Condition. Coverage probability (%) Sample Sk & K Loading 6 items 12 items size Wald LL BCa Wald LL BCa 100 94.90 95.40 94.10 93.90 94.80 94.05 .3 .7 300 94.70 94.80 94.40 95.20 95.60 95.10
500 95.70 95.60 95.20 93.65 93.60 93.35 100 94.55 95.30 94.65 94.40 94.55 94.25 .4 .8 Sk=0 300 95.40 95.85 95.40 95.15 95.55 95.55
K=0 500 95.80 95.60 95.40 94.90 95.40 95.25 100 96.25 95.70 94.60 94.25 94.50 94.35 .6 .9 300 94.65 94.65 94.80 95.10 95.05 94.50
500 94.15 94.30 94.30 94.65 94.60 94.65 100 80.75 80.40 89.80 78.35 78.25 89.80 .3 .7 300 80.95 81.00 92.00 74.45 74.65 92.35
500 81.00 80.75 93.55 75.10 75.10 92.80 100 77.25 78.00 89.85 75.10 74.50 88.85 .4 .8 300 78.00 77.20 92.10 71.85 71.95 92.10
Sk=2 500 76.15 76.15 93.15 71.95 71.75 92.70 K=7 100 76.95 75.75 89.75 70.25 68.80 88.50 .6 .9 300 73.15 72.50 91.10 70.40 70.45 91.75 500 72.90 72.20 91.30 70.40 68.60 91.85 100 77.20 75.45 87.30 70.30 67.65 84.95 .3 .7 300 69.00 68.20 87.90 62.15 61.15 85.95 500 64.60 63.55 88.05 59.60 59.15 86.70 100 73.50 70.90 86.25 64.35 61.95 84.50
.4 .8 300 64.30 63.45 86.45 56.55 55.40 85.40 Sk=3 K=21 500 63.05 62.25 87.45 56.95 55.50 87.55 100 67.25 64.05 84.70 61.65 59.25 83.35 .6 .9 300 57.25 56.85 85.05 53.65 52.60 85.30 500 58.15 57.75 87.30 53.70 53.20 88.00 Note: Sk & K denotes skewness and kurtosis of factor scores. Wald, LL, and BCa stand for Wald method, likelihood method, and BCa bootstrap method, respectively. The bolded and italicized numbers indicate that these coverage probabilities were too low (i.e., < 92.50%).
52 shows that the three lines representing coverage probabilities for the three methods were almost overlapped with each other and were very close to the .95 reference line. This indicated that the three methods performed equally well with normal data, in terms of coverage probability.
When the data were nonnormally distributed, none of the coverage probabilities reached .95 and only a few coverage probabilities were acceptable. These coverage probabilities occurred when the BCa bootstrap method was used, in conditions with large sample size (500), small/moderate sets of factor loadings (.3/.7 and .4/.8), and skewness and kurtosis of factor scores equal to 2 and 7. All of the three methods did not perform well with nonnormal data, given that most coverage probabilities were below the lower bound of the acceptable range
(92.5%). However, it is obvious that the BCa bootstrap method outperformed the other two methods, because the coverage probabilities with this method tended to be higher than those with the other two methods. As shown by graphs in Figure 6 and Figure 7 (see Appendix D), the line representing coverage probabilities for the BCa bootstrap method was above the lines for the
Wald and likelihood methods, In addition, coverage probabilities decreased as sample size
increased when the Wald and likelihood methods were used, while they tended to increase as
sample size increased when the BCa bootstrap method was used.
As evidenced from the results of logistic regression for coverage probability in Table 5,
the partial pseudo R-square associated with degree of nonnormality was .02 (the pseudo R-square
of the full model was .183 and the pseudo R-square of the reduced model, excluding degree of
nonnormality, was .162). Although the magnitude of the partial pseudo R-square was small,
degree of nonnormality has the largest pseudo R-square among all the predictors (see Table 5).
As degree of nonnormality increased, coverage probabilities decreased across all conditions,
regardless of estimation method (see Table 4 and Figures 5, 6, and 7 in Appendix D). The second
53 largest partial pseudo R-square was attributable to interval estimation method, which was .015.
The third largest partial pseudo R-square was due to the interaction effect of interval estimation
method and degree of nonnormality (.008). It indicated that the outperformance of BCa method
over the other two methods was more obvious as the data departed further from normality. Other
predictors and interaction effects didn’t seem to impact the model very much (most with partial
pseudo R-square lower than .005).
Table 5: Results from Logistic Regression and ANOVA.
Partial pseudo R-square from Partial eta squared from Main and interactions effects logistic regression analysis for ANOVA for interval width coverage probability Method .015 .06 Sample size .001 .24 Nonnormality .020 .02 Loading .001 .31 #Item .001 .15 Method * Sample size .003 .00 Method * Nonnormality .008 .03 Method * Loading .001 .00 Method * #Item .001 .00 Sample size * Nonnormality .001 .00 Sample size * Loading 0 .04 Sample size * #Item 0 .02 Nonnormality * Loading .001 .00 Nonnormality * #Item .001 .00 Loading * #Item 0 .01 Note: #Item means the number of items.
54
Interval Width
The mean of interval widths across all legitimate replications for each condition is reported in Table 6 and Table 7, along with the mean of upper bounds of CIs and the mean of lower bounds of CIs. When the data were multivariate normal, interval widths for the three CI construction methods were fairly similar across all conditions. Figures 8, 9, and 10 in Appendix
E show that, with nonnormally distributed data, the three lines indicating interval widths using the three methods were almost overlapped. It suggested that the performance of the three methods were comparable for multivariate normal data, in terms of interval width. When data were nonnormally distributed, interval widths using the Wald and likelihood methods were comparable with each other but much narrower than those using the BCa bootstrap method.
Hence, although the BCa bootstrap method performed better than the other two methods according to coverage probability, it yielded greater degree of uncertainty in estimating the
population parameter due to the wider interval width. To find out which method produced more
accurate estimate of interval width, I approximated population CIs for in different conditions.
For each condition, 20,000 datasets were generated and sample coefficient was calculated for each dataset. The obtained point estimates were then sorted by size and the 2.5th and 97.5th percentiles, respectively, were located as the estimated lower and upper limits of the population
CI in a specific condition. Results showed that when data were nonnormality distributed, interval widths estimated using the BCa bootstrap methods were closer to the population values; interval widths using the Wald and likelihood methods were negatively biased. To conclude, although the intervals estimated using the BCa bootstrap method were wider than the other two methods, they were more accurate. Figure 11 (see Appendix F) exhibits means of CI bounds (the mean of upper bounds of CIs and the mean of lower bounds of CIs) and the corresponding population
55 coefficient in different conditions. It can be observed that the population coefficient located equally well in the CIs for all the three methods (i.e., almost in the middle of the two CI bounds, not in the two extremes).
Three general patterns can be observed from Table 6 and Table 7, regardless of the estimation methods: interval widths became narrower as (1) factor loadings became higher
(graphical illustrations were provided in Figure 10 in Appendix E); (2) sample size became larger (see Figure 8 in Appendix E); and (3) the number of items became larger (see Figure 9 in
Appendix E). As evidenced from the 5-way ANOVA shown in Table 5, the largest proportion of variance in interval widths was attributable to the main effect of loading (partial eta squared of .31), followed by sample size and number of items with partial eta squared of .24 and .15, respectively. Interval construction method explained 6% variance in interval widths, controlling for other independent variables. On the other hand, the partial eta squared for degree of nonnormality was small, constituting only 2% of variation in interval widths. When the data were nonnormal, interval widths using the Wald and likelihood methods were comparable but much narrower than those using the BCa bootstrap method. The interaction effect between the CI construction method and the degree of nonnormality accounted for 3% of variation in interval widths.
Although not the focus of this study, I found that these three two-way interaction effects had non-zero partial eta squared: sample size by factor loading (4%), sample size by number of items (2%), and factor loading by number of items (1%). All other interaction effects had nearly zero partial eta squared values. Given that the interaction effect between the interval estimation method and sample size was negligible, sample size had the same effects on the three methods.
56
In other words, the BCa bootstrap method showed similar sensitivity to sample size as the other two methods.
Relative Bias of Point Estimates
As a supplemental evidence for evaluating the performance of the three CI construction
methods on coefficient , I also computed relative bias of point estimates for each condition.
Table 8 presents these results. The absolute values of relative biases were very close to zero and
none of them was greater than .03. Based on Hoogland and Boomsma’s (1998) criterion that
relative biases smaller than .05 indicate negligible bias, I could conclude that the point estimates
of coefficient resulting from the three methods were comparable and acceptable. Given that the performance of point estimation for coefficient with the three methods was similar, it makes the comparison of interval estimation between these methods more important.
57
Table 6: Interval Widths for 6 Items.
Loading Sample Wald Likelihood BCa bootstrap ( size Lower Upper Width Lower Upper Width Lower Upper Width Sk K =0 =0 100 .579 .771 .192 .568 .762 .194 .561 .757 .197 .3 .7 (.679) 300 .624 .734 .110 .620 .731 .110 .619 .730 .111 500 .635 .721 .085 .633 .719 .086 .632 .718 .086 100 .717 .847 .130 .709 .841 .132 .705 .838 .134 .4 .8 (.783) 300 .745 .820 .075 .742 .818 .075 .742 .817 .076 500 .754 .812 .058 .752 .810 .058 .751 .810 .059 100 .855 .923 .068 .850 .919 .069 .849 .918 .069 .6 .9 300 .871 .909 .039 .869 .908 .039 .869 .908 .039 (.891) 500 .875 .905 .030 .874 .904 .030 .874 .904 .030 Sk=2 K=7 100 .571 .766 .195 .559 .757 .198 .521 .779 .258 .3 .7 300 .621 .732 .111 .617 .729 .111 .595 .753 .158 (.679) 500 .635 .720 .086 .633 .718 .086 .615 .740 .125 100 .706 .841 .135 .698 .835 .137 .665 .855 .190 .4 .8 (.783) 300 .74 .817 .076 .738 .815 .077 .718 .835 .117 500 .751 .810 .059 .750 .808 .059 .734 .827 .093 100 .849 .919 .071 .844 .916 .072 .823 .928 .106 .6 .9 300 .869 .908 .039 .867 .907 .039 .855 .919 .064 (.891) 500 .874 .904 .030 .874 .904 .030 .864 .914 .051 Sk=3 K=21 100 .561 .760 .199 .549 .751 .202 .500 .776 .276 .3 .7 (.679) 300 .614 .727 .113 .610 .723 .113 .577 .762 .184 500 .629 .716 .087 .627 .714 .087 .599 .752 .153 100 .692 .833 .141 .683 .826 .143 .641 .849 .207 .4 .8 (.783) 300 .735 .813 .078 .733 .811 .078 .704 .841 .138 500 .747 .806 .060 .745 .805 .060 .721 .834 .113 100 .842 .916 .074 .837 .912 .075 .808 .926 .118 .6 .9 300 .865 .906 .040 .864 .904 .040 .846 .922 .076 (.891) 500 .872 .902 .031 .871 .902 .031 .855 .919 .064 Note: means the population in a specific condition. Sk & K denote skewness and kurtosis of factor scores. Lower, Upper, and Width means the lower bound of CIs, the upper bound of CIs, and the mean of interval widths, respectively.
58
Table 7: Interval Widths for 12 Items.
Loading Sample Wald Likelihood BCa bootstrap ( ) size Lower Upper Width Lower Upper Width Lower Upper Width Sk K =0 =0 100 .751 .862 .111 .745 .857 .112 .742 .855 .113 .3 .7 (.809) 300 .776 .840 .064 .774 .838 .064 .773 .838 .064 500 .783 .833 .049 .782 .832 .049 .782 .832 .050 100 .841 .912 .071 .837 .909 .072 .835 .908 .073 .4 .8 (.878) 300 .858 .899 .041 .857 .897 .041 .856 .897 .041 500 .862 .894 .032 .861 .893 .032 .861 .893 .032 100 .924 .958 .034 .922 .957 .035 .921 .956 .035 .6 .9 300 .932 .952 .020 .931 .951 .020 .931 .951 .020 (.942) 500 .934 .950 .015 .934 .949 .015 .934 .949 .015 Sk=2 K=7 100 .740 .856 .116 .734 .851 .117 .701 .870 .170 .3 .7 300 .773 .837 .065 .771 .836 .065 .752 .855 .103 (.809) 500 .781 .831 .050 .780 .830 .050 .765 .847 .082 100 .831 .907 .076 .827 .903 .076 .803 .918 .114 .4 .8 (.878) 300 .854 .896 .042 .854 .896 .042 .839 .909 .069 500 .861 .893 .032 .860 .892 .032 .849 .904 .055 100 .920 .956 .036 .918 .954 .036 .905 .962 .057 .6 .9 300 .931 .951 .020 .930 .950 .020 .923 .957 .035 (.942) 500 .931 .951 .020 .933 .949 .015 .928 .955 .028 Sk=3 K=21 100 .729 .850 .121 .722 .844 .122 .681 .866 .185 .3 .7 (.809) 300 .768 .834 .066 .766 .832 .066 .739 .860 .121 500 .778 .828 .051 .776 .827 .051 .754 .855 .101 100 .827 .904 .078 .822 .901 .078 .790 .917 .126 .4 .8 (.878) 300 .852 .894 .042 .850 .893 .043 .830 .913 .082 500 .858 .891 .032 .857 .890 .033 .841 .910 .069 100 .916 .954 .038 .914 .952 .038 .896 .960 .064 .6 .9 300 .929 .949 .021 .928 .949 .021 .917 .959 .041 (.942) 500 .932 .948 .016 .932 .948 .016 .923 .958 .035 Note: means the population in a specific condition. Sk & K denote skewness and kurtosis of factor scores. Lower, Upper, and Width means the lower bound of CIs, the upper bound of CIs, and the mean of interval widths, respectively.
59
Table 8: Relative Bias of Point Estimates for Coefficient .
Relative Bias Sample Sk & K Loading 6 items 12 items size Wald LL BCa Wald LL BCa 100 -.006 -.006 -.006 -.002 -.002 -.002 .3 .7 300 .000 .000 .000 -.001 -.001 -.001
500 -.001 -.001 -.001 -.001 -.001 -.001 100 -.001 -.001 -.001 -.001 -.001 -.001 .4 .8 Sk=0 300 .000 .000 .000 .000 .000 .000
K=0 500 .000 .000 .000 .000 .000 .000 100 -.002 -.002 -.002 -.001 -.001 -.001 .6 .9 300 -.001 -.001 -.001 .000 .000 .000
500 -.001 -.001 -.001 .000 .000 .000 100 -.016 -.016 -.016 -.012 -.012 -.012 .3 .7 300 -.004 -.004 -.004 -.004 -.004 -.004
500 -.001 -.001 -.001 -.002 -.002 -.002 100 -.011 -.011 -.011 -.010 -.010 -.010 .4 .8 300 -.005 -.005 -.005 -.003 -.003 -.003
Sk=2 500 -.004 -.004 -.004 -.001 -.001 -.001 K=7 100 -.007 -.007 -.007 -.004 -.004 -.004 .6 .9 300 -.002 -.002 -.002 -.001 -.001 -.001 500 -.001 -.001 -.001 -.001 -.001 -.001 100 -.028 -.028 -.027 -.025 -.025 -.025 .3 .7 300 -.013 -.013 -.013 -.010 -.010 -.010 500 -.009 -.009 -.009 -.007 -.007 -.007 100 -.024 -.024 -.024 -.015 -.015 -.015
.4 .8 300 -.010 -.010 -.010 -.006 -.006 -.006 Sk=3 K=21 500 -.008 -.008 -.008 -.005 -.005 -.005 100 -.013 -.013 -.013 -.007 -.007 -.007 .6 .9 300 -.007 -.007 -.007 -.003 -.003 -.003 500 -.004 -.004 -.004 -.002 -.002 -.002 Note: Sk & K denote skewness and kurtosis of factor scores. Wald, LL, and BCa stand for Wald method, likelihood method, and BCa bootstrap method, respectively.
60
DISCUSSION AND CONCLUSIONS
Increasing numbers of researchers have recommended coefficient as a reliability
estimate of composite scores (Green & Yang, 2009; Kelley & Cheng, 2012; McDonald, 1978;
Raykov, 1997a, 1997b, 1998a, 1998b, 2001). Given that information provided by a point
estimate of coefficient is very limited, empirical researchers have been encouraged to report
the more informative interval estimate for coefficient in their studies. However, CI estimation
for coefficient lacks development and investigation (Padilla & Divers, 2013). Several
approaches are available for constructing CIs on coefficient , such as the Wald method,
likelihood method, and various types of bootstrap methods. All of these approaches can be
implemented within SEM. But the comparison of the performance of these approaches has not
been well addressed in the literature until now.
The purpose of this study is to compare and evaluate three approaches (i.e., the Wald,
likelihood, and BCa bootstrap) to CI construction on coefficient in the SEM framework
through a simulation study. Four design factors were manipulated in the simulation study: (1)
sample size, (2) the number of items, (3) degree of nonnormality, and (4) factor loading. Data
were generated for 54 conditions, which were created by fully crossing the four factors. 2000
datasets were generated under each condition. For each dataset, three types of CIs for
coefficient were constructed by using the three methods. The performance of the three methods w as evaluated in terms of relative bias of point estimates, coverage probability, and interval width. Logistic regression and 5-way ANOVA were conducted to examine the related main and interaction effects. In the following sections, I summarize major findings from the simulation study based on my expectations of the results laid out in Chapter 3, and then discuss
61 the major findings. Finally I list limitations of this study as well as a few directions for future research.
Major Findings from Simulation Study
Several important findings can be concluded from the results of the current simulation
study. Major findings are presented below.
First, I expected that when the data were normally distributed, the three approaches to CI construction of coefficient would perform equally well across all conditions. Results from the simulation study support my expectation. As Table 4, Table 6, and Table 7 illustrated, coverage probabilities were acceptable and comparable, and interval widths were also fairly similar across the three methods for normally distributed data (see also Appendix D and Appendix E).
Second, I expected that when the data were nonormally distributed, the BCa bootstrap
method would outperform the other two methods because BCa bootstrap is a non-parametric
resampling technique, while both Wald and likelihood methods are parametric techniques and
require multivariate normality of variables. This expectation is supported by the results. For
Wald and likelihood methods, all coverage probabilities were far below the prespecified .95 coverage level. For the BCa bootstrap method, the coverage probabilities were higher than the
Wald and likelihood methods across all conditions; however, they were also below the prespecified .95 coverage level. These results indicated that the BCa bootstrap method is a better choice than the other two, but is not robust to departures from data normality. Although CIs constructed with the BCa bootstrap method yielded greater degrees of uncertainty in estimating the population parameter than the Wald and likelihood methods, as evidenced by the wider intervals, the BCa bootstrap method yielded more accurate interval width than the other two methods.
62
Third, as I expected, the performance of the three methods became worse as the degree of nonnormality increased. It can be observed from Table 4, Table 6, and Table 7 (see also
Appendix D and Appendix E) that as degree of nonnormality increased, coverage probabilities became lower and interval widths became wider. There were also small interaction effects between CI construction methods and degrees of nonnormality (the partial eta squared was .03 for interval widths and the partial pseudo R-square was .008 for coverage probabilities). It revealed that the degree of nonnormality has less seriously detrimental impact on CI construction for the BCa bootstrap method than for the Wald and likelihood methods.
Fourth, I expected that as sample size increased, the performance of these three interval estimation methods would become better. My research expectation is partially supported by the results. As Figure 8 (see Appendix E) displays, when sample size increased, interval widths became narrower. Results from ANOVA in Table 5 show that sample size explained 24% of variance in interval widths. Sample size didn’t affect the performance of the three methods in terms of coverage probability. The partial pseudo R-square for sample size in the logistic regression was only about .001.
Last, for the effect of factor loading and number of items on the performance of the three interval estimation methods, even though I didn’t have any expectation, I was interested in exploring the effects of these two factors. As evidenced from the results of ANOVA in Table 5, loading explained 31% variance in interval widths, which constituted the largest percentage of variance explained among all main and interaction effects. It can be noticed from Table 6 and
Table 7 that as loadings become larger, interval widths became narrower. This pattern was also observable from Figure 10 in Appendix E. However, loading did not seem to impact coverage probabilities (the partial pseudo R-square was only .001). From the logistic regression, coverage
63 probabilities for the three methods were insensitive to the number of items, with partial pseudo
R-square of .001. However, results from ANOVA show that number of items explained a medium proportion of variability in interval widths (partial eta squared of .15). In Tables 6 and 7, as the number of items increased, interval widths decreased across all of the conditions (see
Figure 9 in Appendix E for graphical illustrations). These suggest that number of items has
considerable effect on interval width.
Discussion and Suggestions
Findings from the current study are now compared with those of previous studies.
Cheung (2009b) illustrated, with empirical examples, that the Wald CI and likelihood-based CI
are similar for reliability estimate. This is consistent with the results of the current study that the
Wald and likelihood methods were comparable, with odds ratio of .97 (not reported in table) in
logistic regression analysis for coverage probability. Cheung (2009b) showed in the same study
that the likelihood-based CI outperformed the Wald CI for the Pearson correlation in small
samples. However, the current study found that these two methods performed very similarly.
One possible reason for the inconsistency in findings between previous studies and my study is
that coefficient is a complicated function of model parameters and the profile likelihood
function is only an approximation of the real likelihood function. As Cheung discussed:
Although it is suggested in this article that likelihood-based CIs are better alternatives to
Wald CIs, readers should be cautioned the likelihood-based CIs were constructed via the
numerical approximation of the profile likelihood functions… so it is possible that the
numerical approximations might fail to find the likelihood-based CIs, especially when the
parameters are complicated functions with many nonlinear constraints. (Cheung, 2009b,
p. 286)
64
Kelley and Cheng (2012) compared the CIs for coefficient coming from the Wald and
BCa bootstrap methods, based on a real dataset. The results displayed that the estimated CIs with the two methods were fairly similar. They speculated that it is because “sample size may be sufficiently large and/or multivariate normality may hold approximately” (p. 46). The result is consistent with the current simulation study: when data are multivariate normal, CIs estimated with these two methods are comparable, in terms of both coverage probability and interval width.
Findings of the simulation study suggest that: (1) when the assumption of multivariate normality holds for the data, the three methods are comparable for CI estimation on coefficient
; (2) when data are nonnormally distributed, the BCa bootstrap is better than the Wald and likelihood methods, although the uncertainty of point estimates (as evidenced by the wider interval width) is greater. The selection of the appropriate method for CI construction should be based on several considerations, such as time cost and availability of statistical packages. The
BCa bootstrap method is relatively time-consuming to compute compared with the other two methods due to computer-intensive resampling, and it also has potential inconsistency among bootstrapping replications due to random resampling variability (Preacher & Selig, 2012). The
Wald method is available in most statistical programs, while the likelihood and BCa bootstrap techniques are less accessible.
Limitations and Future Research
In the current simulation study, three approaches to confidence interval construction were compared based on their empirical performances in terms of coverage probability and interval width. However, given the generalization issue that most simulation studies encounter, conclusions based on the results of current study may not be valid in other applications, without examination of the conditions in that specific situation. Several limitations and concerns
65 regarding the simulation design were found. These limitations and concerns will suggest further research directions for me. Below is a brief summary of the limitations and concerns for the simulation study.
First, the data generated in this study are continuous. Considering the common use of categorical data in empirical studies, it is worth assessing the three interval estimation methods with categorical data in future research.
Second, the design factors and levels of these factors considered in the current study are limited. Other factors influencing the accuracy of interval estimation may exist and may be worth investigating. One potential factor is the estimation method. ML estimation methods were used throughout the current study. It is widely known that ML estimation is “a normal theory method” because it “assumes that the population distribution for the endogenous variables is multivariate normal” (Kline, 2010, p. 112). For nonnormally distributed data, alternative estimation methods should be considered, such as robust ML, weighted least squares, etc. For example, Maydeu-Olivares, Coffman and Hartmann (2007) conducted a Monte Carlo study to investigate the NT (normal theory) versus ADF (asymptotic distribution free) confidence intervals for coefficient alpha.
It may also be meaningful to include more levels for each factor in the current study. For example, the number of sample sizes examined can be increased. In the current study, this factor had values of 100, 300, and 500. Future study may consider adding a larger sample size, such as
1000, particular for the ADF estimation method given that ADF requires large sample sizes.
Last, the three interval estimation methods are evaluated for constructing CIs for coefficient , which is a reliability coefficient for the congenric model. However, the performances of these approaches to interval construction on reliability estimates for more
66 general or more complicated models are unknown. Future studies may be needed to address this research question.
67
APPENDIX A
R-CODES FOR DATA GENERATION
1. Generate normally distributed data in R. nd<-2000 #Number of data set list_data<-rep(0,nd) for(dd in 1:nd){ nvar<-6 #Number of variables ss<-100 #sample size f<-rnorm(ss,0,1) #random factor scores f1<-c(0.3,0.3,0.3,0.7,0.7,0.7); #factor loadings x<-matrix(0,ncol=nvar,nrow=ss) for(i in c(1,2,3,4,5,6)) { el<-rnorm(ss,0,1) for(j in 1:ss) { x[j,i]<-f1[i]*f[j]+sqrt(1-f1[i]^2)*el[j]} } name1<-paste("normal_data", dd, ".csv", sep="") name1_list<-paste("normal_data", dd, ".csv", sep="") write.table(x, name1, sep=",", quote=FALSE, col.names=FALSE, row.names=FALSE) list_data[dd]<-name1_list } 2. Generate nonnormal data in R using Fleishman’s method. nd<-2000 #Number of data set list_data<-rep(0,nd) h=10000+nd cond<-1 name_file<-paste("cond=",cond , sep="") dir.create(name_file) for(dd in 10001:h) { nvar<-6 #Number of variables ss<-100 #sample size f1<-c(0.3,0.3,0.3,0.7,0.7,0.7); y<-matrix(0,ncol=nvar,nrow=ss) f<-rnorm(ss,0,1) ft <- f for(j in 1:ss) { 68
------#Fleishman’s power transformation formula # ft[j] <- a+b*f[j]+c*f[j]^2+d*f[j]^3 ------#SK=2, K=7 b<-0.761585275 c<-0.260022598 d<-0.053072274 a<--c #resulting nonnormally distributed factor scores with SK=2 and K=7 #SK=3, K=21 #b<--0.681632225 #c<-0.637118193 #d<-0.148741042 #a<--c #resulting nonnormally distributed factor scores with SK=3 and K=21 ------ft[j] <- a+b*f[j]+c*f[j]^2+d*f[j]^3 for(i in c(1,2,3,4,5,6)) { el<-rnorm(1,0,1) y[j,i]<-f1[i]*ft[j]+sqrt(1-f1[i]^2)*el # resulting nonnormally distributed observed scores } } name2<-paste(name_file,"/data", dd, ".csv", sep="") name2_list<-paste("data", dd, ".csv", sep="") write.table(y, name2, sep=",", quote=FALSE, col.names=FALSE, row.names=FALSE) list_data[dd]<-name2_list }
69
APPENDIX B
CODES FOR CONFIDENCE INTERVAL ESTIMATION IN R
ci.reliability(data = data1.csv, type = "omega", analysis.type = "default", interval.type = "wald", conf.level = 0.95) #construct a 95% Wald CI for coefficient omega ci.reliability(data = data1.csv, type = "omega", analysis.type = "default", interval.type = "ll", conf.level = 0.95) #construct 95% likelihood_based CI for coefficient omega ci.reliability(data = data1.csv, type = "omega", analysis.type = "default", interval.type = "bca", B = 1000, conf.level = 0.95) #construct a 95% BCa bootstrap CI for coefficient omega
70
APPENDIX C
NONNORMAL DISTRIBUTIONS OF OBSERVED SCORES
Distribution of Observed Scores
10 Summary Statistics
N 1000000 Skew ness 0.054 Kurtosis 0.059 8
6
Percent
4
2
0 -4.55 -4.05 -3.55 -3.05 -2.55 -2.05 -1.55 -1.05 -0.55 -0.05 0.45 0.95 1.45 1.95 2.45 2.95 3.45 3.95 4.45 4.95 5.45 5.95 Item1 Distribution of Observed Scores
12 Summary Statistics
N 1000000 Skew ness 0.128 10 Kurtosis 0.174
8
6
Percent
4
2
0 -4.8 -4.3 -3.8 -3.3 -2.8 -2.3 -1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1.7 2.2 2.7 3.2 3.7 4.2 4.7 5.2 5.7 6.2 6.7 7.2 7.7 8.2 Item2 Figure 3. Distributions of Observed Scores with Sk=2 and K=7 for Factor Scores Note: Factor loadings for item 1 to 6 are .3, .4, .6, .7, .8, .9, respectively. Sk and K mean skewness and kurtosis, respectively.
71
Distribution of Observed Scores
12 Summary Statistics
N 1000000 Skew ness 0.430 10 Kurtosis 0.910
8
6
Percent
4
2
0 -4.3 -3.55 -2.8 -2.05 -1.3 -0.55 0.2 0.95 1.7 2.45 3.2 3.95 4.7 5.45 6.2 6.95 7.7 8.45 9.2 9.95 10.7 Item3
Distribution of Observed Scores
12 Summary Statistics
N 1000000 Skew ness 0.684 10 Kurtosis 1.692
8
6
Percent
4
2
0 -4.05 -3.3 -2.55 -1.8 -1.05 -0.3 0.45 1.2 1.95 2.7 3.45 4.2 4.95 5.7 6.45 7.2 7.95 8.7 9.45 10.2 10.95 Item4
Figure 3 - continued
72
Distribution of Observed Scores
12 Summary Statistics
N 1000000 Skew ness 1.024 10 Kurtosis 2.877
8
6
Percent
4
2
0 -3.55 -2.8 -2.05 -1.3 -0.55 0.2 0.95 1.7 2.45 3.2 3.95 4.7 5.45 6.2 6.95 7.7 8.45 9.2 9.95 10.7 11.45 12.2 12.95 Item5
Distribution of Observed Scores
14 Summary Statistics
N 1000000 12 Skew ness 1.452 Kurtosis 4.602
10
8
Percent 6
4
2
0 -3.3 -2.55 -1.8 -1.05 -0.3 0.45 1.2 1.95 2.7 3.45 4.2 4.95 5.7 6.45 7.2 7.95 8.7 9.45 10.2 10.95 11.7 12.45 13.2 13.95 Item6
Figure 3 - continued
73
Distribution of Observed Scores
10 Summary Statistics
N 1000000 Skew ness 0.082 Kurtosis 0.166 8
6
Percent
4
2
0 -4.8 -4.3 -3.8 -3.3 -2.8 -2.3 -1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1.7 2.2 2.7 3.2 3.7 4.2 4.7 5.2 5.7 6.2 6.7 7.2 7.7 8.2 Item1
Distribution of Observed Scores
12 Summary Statistics
N 1000000 Skew ness 0.185 10 Kurtosis 0.501
8
6
Percent
4
2
0 -4.8 -4.05 -3.3 -2.55 -1.8 -1.05 -0.3 0.45 1.2 1.95 2.7 3.45 4.2 4.95 5.7 6.45 7.2 7.95 8.7 9.45 10.2 Item2
Figure 4. Distributions of Observed Scores with Sk=3 and K =21 for Factor Scores Note: Factor loadings for item 1 to 6 are .3, .4, .6, .7, .8, .9, respectively. Sk and K mean skewness and kurtosis, respectively.
74
Distribution of Observed Scores
12 Summary Statistics
N 1000000 Skew ness 0.632 10 Kurtosis 2.538
8
6
Percent
4
2
0 -3.8 -3.05 -2.3 -1.55 -0.8 -0.05 0.7 1.45 2.2 2.95 3.7 4.45 5.2 5.95 6.7 7.45 8.2 8.95 9.7 10.45 11.2 11.95 12.7 13.45 14.2 14.95 Item3
Distribution of Observed Scores
12 Summary Statistics
N 1000000 Skew ness 1.004 10 Kurtosis 4.662
8
6
Percent
4
2
0 -4.05 -3.05 -2.05 -1.05 -0.05 0.95 1.95 2.95 3.95 4.95 5.95 6.95 7.95 8.95 9.95 10.95 11.95 12.95 13.95 14.95 15.95 Item4
Figure 4 - continued
75
Distribution of Observed Scores
14 Summary Statistics
N 1000000 12 Skew ness 1.508 Kurtosis 8.024
10
8
Percent 6
4
2
0 -3.55 -2.55 -1.55 -0.55 0.45 1.45 2.45 3.45 4.45 5.45 6.45 7.45 8.45 9.45 10.45 11.45 12.45 13.45 14.45 15.45 16.45 17.45 Item5
Distribution of Observed Scores
15.0 Summary Statistics
N 1000000 Skew ness 2.154 12.5 Kurtosis 13.03
10.0
7.5
Percent
5.0
2.5
0 -2.55 -1.55 -0.55 0.45 1.45 2.45 3.45 4.45 5.45 6.45 7.45 8.45 9.45 10.4511.4512.4513.4514.4515.4516.4517.4518.4519.4520.4521.45 Item6
Figure 4 - continued
76
APPENDIX D
EFFECTS OF DESIGN FACTORS ON COVERAGE PROBABILITY
Sk=0 K=0 Item=6 Loading=.3/.7 Sk=0 K=0 Item=12 Loading=.3/.7 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=0 K=0 Item=6 Loading=.4/.8 Sk=0 K=0 Item=12 Loading=.4/.8 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=0 K=0 Item=6 Loading=.6/.9 Sk=0 K=0 Item=12 Loading=.6/.9 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Figure 5. Coverage Probabilities for Normally Distributed Data
77
Sk=2 K=7 Item=6 Loading=.3/.7 Sk=2 K=7 Item=12 Loading=.3/.7 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=2 K=7 Item=6 Loading=.4/.8 Sk=2 K=7 Item=12 Loading=.4/.8 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=2 K=7 Item=6 Loading=.6/.9 Sk=2 K=7 Item=12 Loading=.6/.9 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Figure 6. Coverage Probabilities for Moderately Nonnormally Distributed data
78
Sk=3 K=21 Item=6 Loading=.3/.7 Sk=3 K=21 Item=12 Loading=.3/.7 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60 probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100300500 100300500 Sample size Sample size Sk=3 K=21 Item=6 Loading=.4/.8 Sk=3 K=21 Item=12 Loading=.4/.8 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=3 K=21 Item=6 Loading=.6/.9 Sk=3 K=21 Item=12 Loading=.6/.9 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60
Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Figure 7. Coverage Probabilities for Seriously Nonnormally Distributed Data
79
APPENDIX E
EFFECTS OF DESIGN FACTORS ON INTERVAL WIDTH
Sk=0 K=0 Item=6 Loading=.3/.7 Sk=0 K=0 Item=12 Loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size Sk=0 K=0 Item=6 Loading=.4/.8 Sk=0 K=0 Item=12 Loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size
Sk=0 K=0 Item=6 Loading=.6/.9 Sk=0 K=0 Item=12 Loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size Figure 8. Interval Widths on Different Levels of Sample Size
80
Sk=2 K=7 Item=6 Loading=.3/.7 Sk=2 K=7 Item=12 Loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100300500 100300500 Sample size Sample size
Sk=2 K=7 Item=6 Loading=.4/.8 Sk=2 K=7 Item=12 Loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size
Sk=2 K=7 Item=6 Loading=.6/.9 Sk=2 K=7 Item=12 Loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size
Figure 8 - continued
81
Sk=3 K=21 Item=6 Loading=.3/.7 Sk=3 K=21 Item=12 Loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 0.1 width Interval 0.07 0.07 0.04 0.04 0.01 0.01 100300500 100300500 Sample size Sample size Sk=3 K=21 Item=6 Loading=.4/.8 Sk=3 K=21 Item=12 Loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size
Sk=3 K=21 Item=6 Loading=.6/.9 Sk=3 K=21 Item=12 Loading=.6/.9 0.28 BCa Likelihood Wald 0.28 BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size
Figure 8 - continued
82
Sk=0 K=0 n=100 loading=.3/.7 Sk=0 K=0 n=300 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item
Sk=0 K=0 n=500 loading=.3/.7 Sk=0 K=0 n=100 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=0 K=0 n=300 loading=.4/.8 Sk=0 K=0 n=500 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9. Interval Widths on Different Levels of Item Count
83
Sk=0 K=0 n=100 loading=.6/.9 Sk=0 K=0 n=300 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=0 K=0 n=500 loading=.6/.9 Sk=2 K=7 n=100 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=2 K=7 n=300 loading=.3/.7 Sk=2 K=7 n=500 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9 - continued
84
Sk=2 K=7 n=100 loading=.4/.8 Sk=2 K=7 n=300 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=2 K=7 n=500 loading=.4/.8 Sk=2 K=7 n=100 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=2 K=7 n=300 loading=.6/.9 Sk=2 K=7 n=500 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9 - continued
85
Sk=3 K=21 n=100 loading=.3/.7 Sk=3 K=21 n=300 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=3 K=21 n=500 loading=.3/.7 Sk=3 K=21 n=100 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=3 K=21 n=300 loading=.4/.8 Sk=3 K=21 n=500 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9 - continued
86
Sk=3 K=21 n=100 loading=.6/.9 Sk=3 K=21 n=300 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=3 K=21 n=500 loading=.6/.9 0.28 BCa Likelihood Wald 0.25 0.22 0.19 0.16 0.13
Interval width Interval 0.1 0.07 0.04 0.01 6 12 Number of item Figure 9 - continued
87
Sk=0 K=0 Item=6 n=100 Sk=0 K=0 Item=12 n=100 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 123 1 2 3 Loading Loading
Sk=0 K=0 Item=6 n=300 Sk=0 K=0 Item=12 n=300 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading
Sk=0 K=0 Item=6 n=500 Sk=0 K=0 Item=12 n=500 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading
Figure 10. Interval Widths on Different Levels of Factor Loading Note: The level of loading “1” indicates the set of loadings .3/.7; the level of loading “2” indicates the set of loadings .4/.8; and the level of loading “3” indicates the set of loadings .6/.9.
88
Sk=2 K=7 Item=6 n=100 Sk=2 K=7 Item=12 n=100 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 123 123 Loading Loading Sk=2 K=7 Item=6 n=300 Sk=2 K=7 Item=12 n=300 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading
Sk=2 K=7 Item=6 n=500 Sk=2 K=7 Item=12 n=500 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading
Figure 10 - continued
89
Sk=3 K=21 Item=6 n=100 Sk=3 K=21 Item=12 n=100 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 123 123 Loading Loading Sk=3 K=21 Item=6 n=300 Sk=3 K=21 Item=12 n=300 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading
Sk=3 K=21 Item=6 n=500 Sk=3 K=21 Item=12 n=500 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13
Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading
Figure 10 - continued
90
APPENDIX F
MEANS OF CI BOUNDS AND POPULATION COEFFICIENT OMEGA
Sk=0 K=0 loading=.3/.7 n=100 Sk=0 K=0 loading=.3/.7 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=0 K=0 loading=.3/.7 n=500 Sk=0 K=0 loading=.4/.8 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=0 K=0 loading=.4/.8 n=300 Sk=0 K=0 loading=.4/.8 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11. Means of CI Bounds and Population Coefficient Omega for 6 Items
91
Sk=0 K=0 loading=.6/.9 n=100 Sk=0 K=0 loading=.6/.9 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCaLikelihoodWald Method Method Sk=0 K=0 loading=.6/.9 n=500 Sk=2 K=7 loading=.3/.7 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.3/.7 n=300 Sk=2 K=7 loading=.3/.7 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11 - continued
92
Sk=2 K=7 loading=.4/.8 n=100 Sk=2 K=7 loading=.4/.8 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=2 K=7 loading=.4/.8 n=500 Sk=2 K=7 loading=.6/.9 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.6/.9 n=300 Sk=2 K=7 loading=.6/.9 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11 - continued
93
Sk=3 K=21 loading=.3/.7 n=100 Sk=3 K=21 loading=.3/.7 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=3 K=21 loading=.3/.7 n=500 Sk=3 K=21 loading=.4/.8 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=3 K=21 loading=.4/.8 n=300 Sk=3 K=21 loading=.4/.8 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11 - continued
94
Sk=3 K=21 loading=.6/.9 n=100 Sk=3 K=21 loading=.6/.9 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=3 K=21 loading=.6/.9 n=500
1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60
Confidencelimit 0.55 0.50 0.45
BCa Likelihood Wald Method Figure 11 - continued
95
Sk=0 K=0 loading=.3/.7 n=100 Sk=0 K=0 loading=.3/.7 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=0 K=0 loading=.3/.7 n=500 Sk=0 K=0 loading=.4/.8 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=0 K=0 loading=.4/.8 n=300 Sk=0 K=0 loading=.4/.8 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12. Means of CI Bounds and Population Coefficient Omega for 12 Items
96
Sk=0 K=0 loading=.6/.9 n=100 Sk=0 K=0 loading=.6/.9 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=0 K=0 loading=.6/.9 n=500 Sk=2 K=7 loading=.3/.7 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.3/.7 n=300 Sk=2 K=7 loading=.3/.7 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12 - continued
97
Sk=2 K=7 loading=.4/.8 n=100 Sk=2 K=7 loading=.4/.8 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=2 K=7 loading=.4/.8 n=500 Sk=2 K=7 loading=.6/.9 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.6/.9 n=300 Sk=2 K=7 loading=.6/.9 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12 - continued
98
Sk=3 K=21 loading=.3/.7 n=100 Sk=3 K=21 loading=.3/.7 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=3 K=21 loading=.3/.7 n=500 Sk=3 K=21 loading=.4/.8 n=100
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=3 K=21 loading=.4/.8 n=300 Sk=3 K=21 loading=.4/.8 n=500
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12 - continued
99
Sk=3 K=21 loading=.6/.9 n=100 Sk=3 K=21 loading=.6/.9 n=300
1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60
Confidencelimit
Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45
BCaLikelihoodWald BCa Likelihood Wald Method Method Sk=3 K=21 loading=.6/.9 n=500
1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60
Confidencelimit 0.55 0.50 0.45
BCa Likelihood Wald Method Figure 12 - continued
100
REFERENCES
Alhija, F. N., & Wisenbaker, J. (2006). A Monte Carlo study investigating the impact of item parceling strategies on parameter estimates and their standard errors in CFA. Structural Equation Modeling, 13, 204-228.
Azzalini, A. (1996). Statistical inference: Based on the likelihood. London: Chapman & Hall.
Bandalos, D. L., & Leite, W. (2013). Use of Monte Carlo studies in structural equation modeling research. In Hancock, G.R., & Mueller, R.O. (eds.). Structural equation modeling: A second course (2nd ed., p. 625-666). Information Age Publishing.
Becker, G. (2000). How important is transient error in estimating reliability? Going beyond simulation studies. Psycho- logical Methods, 5, 370–379.
Bernstein, I. H., & Teng, G. (1989). Factoring items and factoring scales are different: Spurious evidence for multidimensionality due to item categorization. Psychological Bulletin, 105, 467-477.
Boker, S., Neale, M., Maes, H., Wilde, M., Spiegel, M., Brick, T. R., et al. (2011). OpenMx: An open source extended structural equation modeling framework. Psychometrika, 76, 306- 317.
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144-152.
Brownell, W. A. (1933). On the accuracy with which reliability may be measured by correlating test halves. Journal of Experimental Education, 1, 204-215.
Buse, A. (1982). The likelihood ratio, wald, and lagrange multiplier tests: An expository note. The American Statistician, 36(3), 153-157.
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Pacific Grove, CA: Duxbury/Thomson Learning.
Cheung, M. W.-L. (2009a). Comparison of methods for constructing confidence intervals of standardized indirect effects. Behavior Research Methods, 41(2), 425-438.
Cheung, M. W.-L. (2009b). Constructing approximate confidence intervals for parameters with structural equation models. Structural Equation Modeling: A Multidisciplinary Journal,16(2), 267-294.
Cheung, M. W. L. (2007). Comparison of approaches to constructing confidence intervals for mediating effects Using structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 14(2), 227-246.
101
Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105–11.
Cohen, J. (1994). The world is round (p < .05). American Psychologist, 49, 997–1003.
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis. Hillsdale, NJ: Erbaum.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98-104.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Toronto: Holt, RineHart, and Winston, Inc.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.
Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391-418.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3), 189-212.
DiCiccio, T. J., & Romano, J. P. (1988). A review of bootstrap confidence intervals (with discussion). J. R. Statist. Soc., B, 50, 338-370.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82, 171-185.
Enders, C. K. (2010). Applied missing data analysis. Guiford Press.
Fan, X., & Fan, X. (2005). Using SAS for Monte Carlo simulation research in SEM. Structural Equation Modeling, 12(2), 299–333.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed.), Washington, DC: American Council on Education.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521-532.
Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are and how to use them. Educational and Psychological Measurement, 66(6), 930-944.
Green, S.B. (2003). A coefficient alpha for test-retest data. Psychological Methods, 8, 88–101.
102
Green, S.B., Akey, T.M., Fleming, K.K., Hershberger, S.L., & Marquis, J.G. (1997). Effect of the number of scale points on chi-square fit indices in confirmatory factor analysis. Structural Equation Modeling, 4, 108–120.
Green, S.B., & Hershberger, S.L. (2000). Correlated errors in true score models and their effect on coefficient alpha. Structural Equation Modeling, 7, 251–270.
Green, S. B., & Yang ,Y. (2009). Commentary on coeficient alpha: A Cautionary Tale. Psychometrika, 74, 121–135.
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255-282.
Hart, W. L. (1955). Calculus. D. C. Heath and Campany: Boston.
Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance structure modeling: An overview and a meta-analysis. Sociological Methods & Research, 26, 329-367.
J reskog, K. G. (1971) Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109- 133. ̈ Kelley, K. (2007). Confidence Intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software, 20(8), 1-23.
Kelley, K., & Cheng, Y. (2012). Estimation of and confidence interval formation for reliability coefficients of homogeneous measurement instruments. Methodology, 8(2), 39–50.
Killeen, P. R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16(5), 345-353.
Kline, R. B. (2010). Principles and practice of structural equation modeling (3nd Ed). The Guilford Press.
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models (5th edition). New York: the McGraw_Hill Companies, Inc.
Lance, C. E. et al. (2006). The Sources of four commonl Reported cutoff criteria what did they really say? Organizational Research Methods, 9, 202-220.
Lord, F. M., & Novick. M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Wesley.
Maydeu-Olivares, A., Coffman, D. L., & Hartmann, W. M. (2007). Asymptotically distribution- free (ADF) interval estimation of coefficient alpha. Psychological Methods, 12(2), 157–176.
McDonald, R. P. (1978). Generalizability in factorable domains: Doman validity and generalizabilty. Educational and Psychological Measurement, 38, 75-79.
103
McDonald, R. P. (1999). Test theory: Unified treatment. Lawrence Erlbaum Associates.
Miceeri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156-166.
Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspectives of classical test theory and structural equation modeling. Structural Equation Modeling, 2(3), 255-273.
Miller, R. G. (1974). The jackknife – A review. Biometrika, 61, 1-15.
Neale, M. C., & Miller, M. B. (1997). The use of likelihood-based confidence intervals in genetic models. Behavior Genetics. 27(2), 113-120.
Nevitt, J., & Hancock, G. R. (2001). Performance of bootstrap approaches to model test statistics parameter standard error estimation in structure equation modeling. Structure Equation Modeling, 8(3), 353-377.
Neyman J. (1935). On the problem of confidence intervals. The Annals of Mathematical Statistics, 6, 111–116.
Neyman J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transaction of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236, 333–380.
Neyman, J., & Pearson, E. S. (1933). The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society, 29, 492–510.
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psycho- logical Methods, 5, 241–301.
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.
Oehlert, G. W. (1992). A note on the delta method. The American Statistician, 46(1), 27-29.
Ogasawara, H. (1999). Standard errors for matrix correlations. Multivariate Behavioral Research, 34(1), 103-122.
Olkin, I., & Finn, J. D. (1995). Correlation redux. Psychological Bulletin, 118, 155–164.
Padilla, M. A., & Divers, J. (2013). Bootstrap interval estimation of reliability via coefficient omega. Journal of Modern Applied Statistical Methods, 12(1), 78-89.
Patterson, D. (2014). Profile likelihood confidence intervals for GLM’s. Retrieved from http://www.math.umt.edu/patterson/ProfileLikelihoodCI.pdf on February 12, 2014.
104
Preacher, K. J., & Selig, J. P. (2012). Advantages of Monte Carlo confidence intervals for indirect effects. Communication Methods and Measures, 6, 77-98.
R Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R- project.org/.
Raykov, T. (1997a). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21, 173-184.
Raykov, T. (1997b). Scale reliability, Cronbach’s coefficient alpha, and violations of essential tau equivalence with fixed congeneric components. Multivariate Behavioral Research, 32, 329-353.
Raykov, T. (1998a). Coefficient alpha and composite relaibility with interrelated nonhomogeneous components. Applied Psychological Measurement, 22, 375-385.
Raykov, T. (1998b). A method for obtaining standard errors and confidence intervals of composite reliability for congeneric items. Applied Psychological Measurement, 22, 369- 374.
Raykov, T. (2000). A method for examining stability in reliability. Multivariate Behavioral Research, 35(3), 289-305.
Raykov, T. (2001). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear constraints. British Journal of Mathematical and Statistical Psychology, 54, 315-323.
Raykov, T. (2002). Analytic estimation of standard error and confidence interval for scale reliability. Multivariate Behavioral Research, 37(1), 89-103.
Raykov, T. (2004). Point and interval estimation of reliability for multiple- component measuring instruments via linear constraint covariance structure modeling. Structural Equation Modeling: A Multidisciplinary Journal, 11(3), 342-356.
Raykov, T. (2009). Evaluation of scale reliability for unidimensional measures using latent variable modeling. Measurement and Evaluation in Counseling and Development, 42(3), 223-232.
Raykov, T., & Marcoulides, G. A. (2004). Using the Delta method for approximate interval estimation of parameter functions in SEM. Structural Equation Modeling: A Multidisciplinary Journal, 11(4), 621-637.
Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: point and interval estimation using a structural equation modeling approach. Structure Equation Modeling, 9(2), 195-212.
105
Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the GLB: Comments on Sijtsma. Psychometrika, 74(1), 145–154.
Rozeboom, W.W. (1966). Foundations of the theory of prediction. Homewood: Dorsey.
Rulon, P. J. (1939). A simplified procedure for determining the reliability of a test by split- halves. Harvard Education Review, 9, 99-103.
SAS Institute Inc. (2011). Base SAS® 9.3 procedures guide. Cary, NC: SAS Institute Inc.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 94. doi:10.1007/s11336-008-9101-0.
Šimundić, A.-M. (2008). Confidence interval. Biochemia Medica, 18(2), 154-161.
Steiger, J.H., & Fouladi, R.T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.). What if there were no significance tests? Mahwah, NJ: Lawrence Erlbaum Associates.
Steinberg, L. (2001). The consequences of pairing questions: Context effects in personality measurement. Journal of Personality and Social Psychology, 81, 332–342.
Steinberg, L., & Thissen, D. (1996). Uses of item response theory and the testlet concept in the measurement of psychopathology. Psychological Methods, 1, 81–97.
Tan, S. J. (1990). Applied calculus. Boston: Kent.
Thompson, B. L., Green, S. B., & Yang, Y. (2010). Assessment of the maximal split-half coefficient to estimate reliability. Educational and Psychological Measurement, 70, 232- 251.
Venzon, D. J., & Moolgavkar, S. H. (1988). A method for computing profile-likelihood-based confidence intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics), 37(1), 87-94.
Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 4, 212–213.
Wainer, H., & Kiely, G.L. (1987). Item clusters and computerized adaptive testing: A case of testlets. Journal of EducationalMeasurement, 24, 185–201.
106
Wilkinson, L., & Inference, T. F. o. S. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
Yang, Y., & Green, S. B. (2010). A note on structural equation modeling estimates of reliability. Structural Equation Modeling: A Multidisciplinary Journal, 17(1), 66-81.
Yang, Y., & Green, S. B. (2011). Coefficient alpha: A reliability coefficient for the 21st century?. Journal of Psychoeducational Assessment, 29(4), 377-392.
Ye, Bao-J., & Wen, Zhong-L. (2011). A comparison of three confidence intervals of composite reliability of a unidimensional test. Acta Psychologica Sinita, 43(04), 453-461.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of EducationalMeasurement, 30, 187–214.
Yung, Y. F., & Bentler, P. M. (1996). Bootstrapping techniques in analysis of mean and covariance structures. In G. A. Marcoulides & R. E. Schumacker (Eds.), Advanced structural equation modeling: Issues and techniques. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Zimmerman, D. W. (1975). Probability spaces, hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395–412.
Zimmerman, D.W., Zumbo, R.D., & Lalonde, C. (1993). Coefficient alpha as an estimate of test reliability under violation of two assumptions. Educational and Psychological Measurement, 53, 33–49.
Zinbarg, R.E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s , Revelle’s , and McDonald’s H : Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70(1), 123–133.
107
BIOGRAPHICAL SKETCH
Jie Xu earned her Master’s degree in the School of Chinese as a Second Language from
Sun Yat-sen University in China in 2012. In fall 2012, she joined the Measurement and Statistics master’s program in the Department of Educational Psychology and Learning System at the
Florida State University.
During her master’s study, she has been a graduate assistant and teaching assistant for several graduate courses in the Department of Educational Psychology and Learning System.
Her major research interests include methodological studies and applications in structural equation modeling and item response theory.
108