Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2014 A Comparison of Three Approaches to Confidence Interval Estimation for Coefficient Omega Jie Xu

Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF EDUCATION

A COMPARISON OF THREE APPROACHES TO CONFIDENCE

INTERVAL ESTIMATION FOR COEFFICIENT OMEGA

By

JIE XU

A Thesis submitted to the Department of Educational Psychology and Learning Systems in partial fulfillment of the requirements for the degree of Master of Science

Degree Awarded: Fall Semester, 2014 Jie Xu defended this thesis on August 11, 2014. The members of the supervisory committee were:

Yanyun Yang Professor Directing Thesis

Betsy Becker Committee Member

Russell Almond Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the thesis has been approved in accordance with university requirements.

ii

TABLE OF CONTENTS

List of Tables ...... v List of Figures ...... vi Abstract ...... vii INTRODUCTION ...... 1 LITERATURE REVIEW ...... 6 Reliability ...... 6 Reliability Estimation from CTT ...... 6 Test-retest ...... 7 Alternative-forms ...... 7 Internal Consistency ...... 8 Coefficient ...... 9 Violation of the homogeneity assumption ...... 9 Violation of the essential tau equivalence assumption ...... 10 Violation of the uncorrelated errors assumption ...... 11 Misinterpretation of as an index of homogeneity ...... 12 Reliability Estimation Based on CFA within SEM Framework ...... 12 Confirmatory Factor Analysis ...... 13 An illustration of the one-factor CFA Model ...... 13 The link between CTT and CFA for reliability estimation ...... 15 Coefficient ...... 16 Definition and formula for computing coefficient ...... 16 Relationship between coefficient and coefficient ...... 17 Confidence Interval ...... 18 Null Hypothesis Significance Testing ...... 18 Interval Estimation ...... 19 Three Approaches to Interval Estimation for Coefficient ...... 21 Wald Method ...... 21 Wald CI for an individual parameter ...... 21 Wald CI for a function of parameters ...... 22 Wald CI for coefficient based on the delta method ...... 25 Likelihood Method ...... 26 Likelihood ratio statistic ...... 26 Likelihood-based CI computed via the likelihood function of a single parameter ...... 27 Likelihood-based CI computed via likelihood function of multiple parameters ...... 28 Likelihood-based CI for coefficient ...... 30 Bias-corrected and Accelerated Bootstrap Method ...... 30 A brief introduction to bootstrap technique ...... 31 Construction of bias-corrected and accelerated bootstrap CI ...... 32 BCa bootstrap CI for coefficient ...... 34 A Comparison among Three Interval Estimation Methods ...... 34 Statistical Test ...... 34 Applied ...... 35 Assumption of Multivariate Normality ...... 36 iii

Symmetry ...... 36 Sample Size ...... 37 /Invariance to Parameter Transformation ...... 37 Previous Research ...... 38 The Rationale and Purpose of the Proposed Study ...... 40 METHODS ...... 42 Design Factors ...... 43 Data Generation Procedure ...... 44 Data Analysis ...... 46 RESULTS ...... 51 Non-convergence ...... 51 Coverage Probability ...... 51 Interval Width ...... 55 Relative Bias of Point Estimates ...... 57 DISCUSSION AND CONCLUSIONS ...... 61 Major Findings from Simulation Study ...... 62 Discussion and Suggestions ...... 64 Limitations and Future Research ...... 65 APPENDICES ...... 68 A. R-CODES FOR DATA GENERATION ...... 68 B. CODES FOR CONFDENCE INTERVAL ESTIMATION IN R ...... 70 C. NONNORMAL DISTRIBUTIONS OF OBSERVED SCORES ...... 71 D. EFFECTS OF DESIGN FACTORS ON COVERAGE PROBABILITY ...... 77 E. EFFECTS OF DESIGN FACTORS ON INTERVAL WIDTH ...... 80 F. MEANS OF CI BOUNDS AND POPULATION COEFFICIENT OMEGA ...... 91 REFERENCES ...... 101 BIOGRAPHICAL SKETCH ...... 108

iv

LIST OF TABLES

1 Transformation Coefficients Corresponding to Three Types of Distribution ...... 45

2 Skewness and Kurtosis of Observed Item Scores ...... 47

3 Population Coefficient under Different Conditions ...... 49

4 Coverage Probability for Each Condition ...... 52

5 Results from Logistic Regression and ANOVA ...... 54

6 Interval Widths for 6 Items ...... 58

7 Interval Widths for 12 Items ...... 59

8 Relative Bias of Point Estimates for Coefficient ...... 60

v

LIST OF FIGURES

1 An Example of a One-factor CFA Model ...... 14

2 An Illustration of Confidence Interval Computed via the Log-likelihood Function ...... 28

3 Distributions of Observed Scores with Sk=2 and K=7 for Factor Scores ...... 71

4 Distributions of Observed Scores with Sk=3 and K=21 for Factor Scores ...... 74

5 Coverage Probabilities for Normally Distributed Data ...... 77

6 Coverage Probabilities for Moderately Nonnormally Distributed Data ...... 78

7 Coverage Probabilities for Seriously Nonnormally Distributed Data ...... 79

8 Interval Widths on Different Levels of Sample Size ...... 80

9 Interval Widths on Different Levels of Item Count ...... 83

10 Interval Widths on Different Levels of Factor Loading ...... 88

11 Means of CI Bounds and Population Coefficient Omega for 6 Items ...... 91

12 Means of CI Bounds and Population Coefficient Omega for 12 Items ...... 96

vi

ABSTRACT

Coefficient was introduced by McDonald (1978) as a reliability coefficient of composite scores for the congeneric model. Interval estimation on coefficient provides a range of plausible values which is likely to capture the population reliability of composite scores. The

Wald method, likelihood method, and bias-corrected and accelerated (BCa) bootstrap method are three methods to construct confidence interval for coefficient . Very limited number of studies on the evaluation of these three methods can be found in the literature. No simulation study has been conducted to evaluate the performance of these three methods on interval construction for coefficient . In the current simulation study, I assessed these three methods by comparing their empirical performance on interval estimation for coefficient . Four factors were included in the simulation design: sample size, number of items, factor loading, and degree of nonnormality.

Two thousand datasets were generated in R 2.15.0 (R Core Team, 2012) for each condition. For each generated dataset, three approaches (i.e., the Wald method, likelihood method, and BCa bootstrap method) were used to construct a 95% confidence interval for coefficient in R

2.15.0. The results showed that when the data were multivariate normally distributed, the three methods performed equally well and coverage probabilities were very close to the prespecified .95 confidence level. When the data were nonnormally distributed, coverage probabilities decreased and interval widths became wider for all three methods as the degree of nonnormality increased. In general, when the data departed from multivariate normality, the BCa bootstrap method performed better than the other two methods, with relatively higher coverage probabilities, while the Wald and likelihood methods were comparable and yielded narrower interval width than the BCa bootstrap method.

vii

INTRODUCTION

Measurement consistency is an essential and unavoidable issue in social and behavior

sciences (Raykov, 2000). Reliability is an important index for measurement consistency. Scale

reliability originated from the framework of Classical Test Theory (CTT), and is defined as the

ratio of two : true score variance over observed score variance (Crocker & Algina,

1986; Lord & Novick, 1968). Several methods have been proposed for reliability estimation within CTT: test-retest, alternative-forms, and internal consistency (e.g., split-half and coefficient alpha). Coefficient alpha is used most commonly by researchers in applied studies.

However, coefficient alpha is an unbiased estimate of reliability only when the underlying assumptions are met (e.g., Green & Hershberger, 2000). The assumptions are homogeneity, essential tau equivalence, and uncorrelated errors. Researchers have argued against the use of coefficient alpha given its untenable assumptions in practice. For example, Green and Yang

(2009) discouraged the use of coefficient alpha to assess reliability, because the assumptions of coefficient alpha are unlikely to hold. The bias due to violation of these assumptions is often substantial and can’t be ignored. In addition, coefficient alpha is frequently misinterpreted as an index of homogeneity (Cortina, 1993; Miller, 1995; Sijtsma, 2009).

The true score model in CTT can be expressed using a confirmatory factor analysis

(CFA) model, also known as a measurement model. Scale reliability can then be assessed by proposing an interpretable CFA model given good model-data fit (Green & Yang, 2009). In a one-factor CFA model, a link between CTT and CFA models as regards reliability estimation is that the true score in CTT can be expressed as the factor score weighted by the factor loading in the CFA model. In the one-factor CFA model, factor loadings and error variances are model parameters (assuming the variance of the factor is fixed to one for identification purposes). The

1 reliability coefficient can be computed based on the parameter estimates from the CFA model.

The reliability estimate based on the one-factor, so-called congeneric CFA model, is known as

coefficient , which is defined as a ratio of composite true score variance to the total score

variance of a scale (McDonald, 1978). Coefficient is viewed as a generalization of coefficient

alpha in reliability estimation of homogeneous measurements (Kelley & Cheng, 2012).

A reliability coefficient computed from a sample is a point estimate of the population

reliability, and it is sample dependent. How to determine the accuracy of reliability estimates has

received increasing attention in the literature. Testing of statistical significance (Neyman &

Pearson, 1933) is the most commonly used method to determine the accuracy of the point

estimate. Based on this method, a point estimate is judged as either significantly or non-

significantly different from a value specified by the null hypothesis (e.g., zero) in the population,

by comparing an obtained p value to a predetermined significance level (e.g., .05). In spite of the

popularity of null hypothesis significance testing, the appropriateness of its use is still a matter of

debate (e.g., Chow, 1988; Cohen, 1994; Nickerson, 2000; Schmidt, 1996; Wainer, 1999).

Interval estimation (Neyman, 1935, 1937) is proposed as a more informative and preferable

alternative to null hypothesis significance testing. A confidence level (often denoted as 1- ,

where is the Type I error rate) tells how likely the interval is to include the population

parameter (Kelley, 2007). The confidence interval (CI) is commonly reported at the .95

confidence level (i.e., Type I error rate of .05), which means that 95 out of 100 times under

repeated sampling the true population value of a parameter falls in the constructed CI. However,

empirical studies reporting confidence intervals along with corresponding parameter estimates

are still limited for two reasons (Cheung, 2009b; Steiger & Fouladi, 1997): (1) statistical

packages for interval estimation are not available; and (2) methods for CI construction on

2 different statistics and psychometric indices lack development. Interval estimation for coefficient also lacks development and investigation (Padilla & Divers, 2013).

This study reviews three methods for estimating confidence interval for coefficient

within the Structural Equation Modeling (SEM) framework; they are the Wald method, the likelihood method, and the bias-corrected and accelerated bootstrap method. The Wald CI is constructed based on the standard error estimate of the parameter of interest (e.g., Cheung, 2007,

2009a, 2009b; Raykov, 2002). To construct a Wald CI on coefficient which is a function of

multiple parameters, the delta method is applied. The delta method is an analytic method widely

used to approximate the asymptotic standard errors for parameters that are functions of a set of

parameters (Casella & Berger, 2002; Cheung, 2009b; Raykov, 2002; Raykov & Marcoulides,

2004; Oehlert, 1992). The application of this method is based on the linear approximation of

smooth parametric functions using the Taylor expansion of the function (Raykov & Marcoulides,

2004), under the assumption of multivariate normality of variables (Ogasawara, 1999). The

likelihood-based CI is developed from the asymptotic chi-square distribution of the likelihood

ratio test (Cheung, 2009b; Venzon & Moolgavkar, 1988). If the likelihood function has multiple

parameters, the profile likelihood method (Venzon & Moolgavkar, 1988) can be used to

construct the likelihood-based CI. The third method is the bias-corrected and accelerated

bootstrap method. This method has a distinguished advantage over the other bootstrap methods;

that is, it improves interval estimates by taking asymmetry, bias, and nonconstant variance into

consideration (Efron, 1987; Kelley & Cheng, 2012).

Similarities and differences exist among the Wald, the likelihood, and the BCa bootstrap

methods. (1) The Wald method is based on the asymptotically normally distributed Wald statistic

(Cheung, 2009b; Enders, 2010), and the likelihood method is based on the likelihood ratio

3 statistic which asymptotically follows a chi-square distribution (Venzon & Moolgavkar, 1988).

However, the BCa bootstrap method doesn’t require parametric statistical testing. So, normality

of variables or homoscedaslicity of error scores are not assumed. (2) Both Wald and likelihood

methods assume multivariate normality of variables, while the BCa bootstrap method, as a

nonparametric method, doesn’t require this assumption. (3) CIs constructed with the likelihood

and BCa bootstrap methods are asymmetric, while CIs constructed with the Wald method are

symmetric (Cheung, 2007, 2009b). An issue with symmetric CIs is that when the parameter

estimate is near to the limits, the estimated CIs may include some values that are outside of the

meaningful boundaries of the parameter (Neale & Miller, 1997). (4) Both the Wald and

likelihood methods require a large sample size to accurately estimate reliability, while the BCa

bootstrap method has a relatively relaxing requirement on sample size, when the sample size is

smaller than 100, the performance of bootstrap method is poor (Nevitt & Hancock, 2001).

(5) CIs constructed using the likelihood and BCa bootstrap methods are invariant to

transformations on model parameters, while CIs constructed through the Wald method are

variant to transformation (Neale & Miller, 1997).

In the current literature, only very limited numbers of studies compare the performance of

these methods. Cheung (2007) compared the empirical performances of four methods to

construct CIs (the Wald, percentile bootstrap, bias-corrected bootstrap, and likelihood methods)

on indirect effects (e.g., mediating effects of variable A on variable C through variable B)

through Monte Carlo studies. Cheung (2009b) compared the likelihood-based CI and Wald CI

for several psychometric indices and statistics, such as the correlation coefficient, indirect effect,

and reliability estimate, with numerical examples. He also conducted a Monte Carlo study on

Pearson correlation CIs in the same study (Cheung, 2009b). In another study (Cheung, 2009a),

4 he compared six Wald CIs, three bootstrap CIs, a likelihood-based CI, and the PRODCLIN CI

(i.e., CIs constructed using a PRODCLIN program with the R interface) on standardized indirect effect through a simulation study. Ye and Wen (2011) conducted a simulation study to compare a bootstrap method, delta method, and the method of using the outputs (i.e., point estimate, standard error) directly from the LISREL program to construct CIs for coefficient . Kelley and

Cheng (2012) used an empirical example to examine the performance of the Wald, percentile bootstrap, and BCa bootstrap methods for CI construction for coefficient . Padilla and Divers

(2013) conducted a simulation study to assess the performance of the normal theory bootstrap

CI, percentile based CI, and the BCa bootstrap CI for coefficient .

Based on my literature review, there is a gap in the literature on the performance of the three interval estimation methods (the Wald method, likelihood method, and BCa bootstrap method) for coefficient . No simulation study has been conducted to evaluate the performance of the three approaches to interval construction for coefficient . Given that numerical examples are mainly used for illustrative purposes (Yung & Bentler, 1996), I choose to conduct a simulation study to assess these three methods under different conditions. Specifically, four factors were manipulated: sample size, number of items, factor loading, and level of nonnormality. The purpose of the simulation study is to compare these three interval estimation methods for coefficient and help applied researchers to choose the appropriate method to obtain reliability interval estimation.

5

LITERATURE REVIEW

This chapter presents a literature review for the proposed study. First, reliability and the

estimation methods within the framework of classical test theory (CTT) and confirmatory factor

analysis (CFA) are introduced, respectively. Three approaches to constructing confidence

intervals for coefficient are then reviewed and compared. Last, the rationale behind the study

and the purpose of the study are provided.

Reliability

Measurement consistency is an essential and inevitable issue in social and behavior

sciences (Raykov, 2000). As an important index for measurement consistency, reliability is one

of the main concerns in these disciplines. Reliability refers to “the degree to which individuals’

deviation scores, or z-scores, remain relatively consistent over repeated administrations of the

same test or alternative test forms” (Crocker & Algina, 1986, p. 105). Reliability is a

characteristic of test scores for a group of examinees, not a characteristic of a test (Feldt &

Brennan, 1989; Wilkinson &The Task Force on Statistical Inference, 1999).

The concept of reliability originated from the framework of classical test theory (CTT).

In CTT, an observed test score (X) consists of two parts, a true score (T) and an error score (E).

2 2 2 The variance of X can be expressed in terms of the variances of T and E, as  X  T  E under

the assumption that there is no correlation between true score and error score. The reliability of

observed test scores is defined as the ratio of true score variance to observed score variance, and

2 2 is expressed as  XX '  T / X (Crocker & Algina, 1986; Lord & Novick, 1968).

Reliability Estimation from CTT

Several methods have been proposed for scale reliability estimation within CTT: test- retest, alternative-forms, and internal consistency (e.g., split-half and coefficient alpha). The use

6 of each method requires several assumptions. Assumptions underlying applications of different methods and effects of the violation of specific assumptions on reliability estimation are discussed in detail in this section.

Test-retest

For the test-retest method, a group of examinees take the same test twice at two different points in time. The correlation between the two sets of test scores is used to estimate the reliability of the test scores. A critical issue with the test-retest method is how to determine the elapsed time between the two administrations so that the time period is “long enough to allow effects of memory or practice to fade but not so long as to allow maturational or historical changes to occur in the examinees ” (Crocker & Algina, 1986, p. 133). The test-retest method requires that error scores associated with the first test and the retest are uncorrelated. The two error scores are also required to be independent from true scores of the two test administrations.

Because the correlation between errors of the first test and of the retest tends to be positive due to possible memory effects, the estimated reliability using the test-retest method is likely to be greater than the true reliability (Miller, 1995).

Alternative-forms

The alternative-forms method follows the same procedures as the test-retest method, except for using an alternative form at the second administration. The correlation between the two sets of observed scores is the estimated reliability coefficient for the test scores. Compared to the test- retest method, the assumption of uncorrelated errors for alternative-forms method is less likely to be violated. However, the alternative-forms method is not an optimal choice for several reasons.

For example, the time and monetary cost to develop alternative forms is considerable; the alternative-forms reliability coefficient may be affected by content sampling during test

7 construction (Crocker & Algina, 1986); the alternative-forms method requires essential tau equivalence of the two test forms, which requires that true-score variances of the alternative- forms are equivalent (i.e., true scores of the two test forms can differ only by a constant). This assumption is unlikely to hold in practice. The alternative-forms method tends to underestimate the true reliability if essential tau equivalence of the two test forms assumption is not met

(Miller, 1995).

Internal Consistency

The split-half method and coefficient alpha are widely used for estimation of internal consistency reliability. One assumption underlying internal consistency methods is uncorrelated errors. This assumption requires that error scores associated with different test components are uncorrelated with each other, and that error scores are uncorrelated with true scores of these components. For the split-half method, test components are two halves of a test; while for coefficient alpha test components refer to individual items in a test. Homogeneity of the test is another assumption of internal consistency methods, which requires only a single common latent factor underlying a set of items in a test. A third assumption associated with internal consistency estimation procedures is essential tau equivalency. For the split-half method, two halves of a test should be essential tau equivalent, while for coefficient alpha all items in a test are required to be essential tau equivalent.

The split-half method is used to estimate reliability through splitting a test into two halves,

computing the correlation of the two halves, and then applying the Spearman-Brown prophecy

formula to correct the correlation and obtain the reliability estimate for the full length test

(Crocker & Algina, 1986). A limitation for this method is the non-uniqueness of splitting a test

into two halves. For a test with k items, there are 1/2k!/[(1/2k)!]2 different ways to split the test

8

(Brownell, 1933). Another way to estimate reliability with the split-half method is Rulon’s

formula (Rulon, 1939), in which the variance of the difference score between the two halves is

used as an estimate of the variance of error scores in reliability calculation. Rulon’s coefficient

equals to coefficient alpha for a two-component test (Miller, 1995). Coefficient alpha is derived

as a general form of Rulon’s formula, that is, the number of test components can be more than

two. Coefficient alpha is also the average of split-half reliability coefficients (Cronbach, 1951;

Cronbach & Shavelson, 2004; Guttman, 1945). Coefficient alpha is routinely reported as an

estimate of reliability in applied studies. However, it requires restrictive assumptions that are

often violated in application, as summarized below.

Coefficient

Coefficient alpha ( ; Cronbach, 1951; Guttman, 1945) is defined as

k 2  i k i1   1(  2 ) ( k  1,2  i  k ), (1) k 1  X

2 where X is the test score, including a set of items X1, X2, …, Xk; k is the number of items,  i is

2 the variance for item Xi ; and  X is the variance of the total test score.

Even though coefficient is popularly applied, misapplications and misinterpretations of

it are not rare in practice (e.g., Schmidt, 1996). Coefficient is identical to reliability only when

the underlying assumptions are satisfied. The assumptions include homogeneity, essential tau

equivalence, as well as uncorrelated errors. Violation of these assumptions could lead to

substantial bias in the estimate of reliability (Graham, 2006; Green & Yang, 2009; Raykov,

1997a, 1998a, 1998b; Sijtsma, 2009; Zimmerman, Zumbo, & Lalonde, 1993).

Violation of the homogeneity assumption. The homogeneity assumption, also known as the unidimensionality assumption, requires a test measures only a single common latent 9 construct. If the test measures more than one underlying construct, the homogeneity assumption is violated. Violation of this assumption can be conceptualized as a specific case of the violation of the essential tau equivalency assumption (see more details about essential tau equivalency in the next section), and thus introduces negative bias in reliability estimation.

A large standard error of coefficient (i.e., low precision of coefficient ) may indicate the violation of the homogeneity assumption. The standard error of alpha is a function of the standard deviation of item intercorrelations and the number of items (see equation 4 in Cortina,

1993). A large standard deviation of item intercorrelations may suggest that the test is multidimensional (i.e., more than one latent factor underlies the test). A positive relationship between the standard deviation of item intercorrelations and the standard error of coefficient

can be seem from Cortina’s equation. Hence, a large standard error of coefficient may warn us that the homogeneity assumption is violated. This assumption can also be assessed by factor analysis. If a factor analysis ends up with a one-factor model, then the items on an instrument are homogeneous (Lord & Novick, 1968). Such a model is called a congeneric model. If more than one factor is specified in the model, the homogeneity assumption is violated.

Violation of the essential tau equivalence assumption. Under the assumption of essential tau equivalence for coefficient , all items in the test measure the same latent variable and each item is allowed to have a unique error score. In addition, true scores of any two items can differ only by a constant (Graham, 2006; Lord & Novick, 1968; Miller, 1995; Raykov,

1997a; Yang & Green, 2011). When this assumption is not met, coefficient is a lower bound of reliability (Cortina, 1993; Graham, 2006; Green & Yang, 2009; McDonald, 1999; Miller, 1995;

Lord & Novick, 1968; Raykov, 1997a, 1998a, 1998b; Sijtsma, 2009; Zimmerman, Zumbo, &

Lalonde, 1993). “The larger the violation of tau-equivalence that occurs, the more coefficient

10 underestimates score reliability” (Graham, 2006, p. 939-940). The bias of coefficient as a

reliability coefficient may be substantial if one factor loading is fairly different from the other

loadings (i.e., only one item doesn’t meet the requirement of essential tau equivalence) (Green &

Yang, 2009; Raykov, 1997b).

Several methods can be used to examine whether this assumption is violated. One is to check the differences in standard deviations across items (Graham, 2006). The larger the differences, the more likely the assumption is violated. Second, tests with different response formats across items are likely to violate this assumption (Graham, 2006). Third, CFA can be used to examine the assumption of essential tau equivalence (Graham, 2006; Green & Yang,

2009; Raykov, 1997a), given that “this assumption is mathematically identical to the assumption that all items have equal loadings on a single common factor with their unique variances composed entirely of error” (Miller, 1995, p. 265). To assess this assumption, two one-factor

CFA models need to be posited and fitted to the data. These two models are the same expect for factor loadings. Factor loadings are constrained to be equal in one model (i.e., the essential tau equivalence model) but freely estimated in the other model (i.e., the congeneric model). Then, a chi-square difference test is conducted between these two models to test the essential tau equivalence assumption. If the chi-square difference test is non-significant, we do not have sufficient evidence to reject the null hypothesis that the essential tau equivalence assumption holds.

Violation of the uncorrelated errors assumption. The assumption of uncorrelated errors means “error scores of all pairs of items are uncorrelated” (Miller, 1995, p. 266). If this assumption is violated, coefficient will overestimate reliability (Graham, 2006; Green &

11

Hershberger, 2000; Green & Yang, 2009; Lord & Novick, 1968; McDonald, 1999; Miller, 1995;

Raykov, 1997a, 1998a; Sijtsma, 2009; Zimmerman, Zumbo, & Lalonde, 1993).

Green and Yang (2009) summarized how the assumption of uncorrelated errors may be

violated in practice. For example, correlated errors may be introduced by speeded tests with

right-wrong answers (Rozeboom, 1966), different stimulus materials for subgroups of items

(Steinberg, 2001; Steinberg & Thissen, 1996; Wainer & Kiely, 1987; Yen, 1993), effects of item

order (i.e., errors on one item affecting responses to the next item; Green & Hershberger, 2000),

consistent response sets, and transient errors (Becker, 2000; Green, 2003). In these cases,

correlated errors may result in positive bias in reliability estimation.

Misinterpretation of as an index of homogeneity. Coefficient has been mistakenly

interpreted as a homogeneity index. Coefficient is an index of internal consistency reliability

not an index of homogeneity (Miller, 1995). Internal consistency is a different concept from

homogeneity. It is defined as the degree of interrelatedness among items, while homogeneity

refers to unidimensionality (Schmitt, 1996). Homogeneity is a prerequisite for utilizing

coefficient as a reliability index.

Arguments against the use of coefficient have been increasing. Many researchers

criticized using it as an index of internal consistency reliability (e.g., Cortina,1993; Green et al.,

1997; Sijtsma, 2009; Thompson, Green & Yang, 2010), because the assumptions of coefficient

(e.g. essential tau equivalence and uncorrelated errors) are unlikely to hold, and the bias due to

violation of these assumptions is often substantial and can’t be ignored (Green & Yang, 2009).

Reliability Estimation Based on CFA within SEM Framework

As an alternative to coefficient , reliability can be estimated based on a CFA model within Structural Equation Modeling (SEM) framework. SEM is a frequently used technique for

12 investigation of relationships among a set of variables. Instead of designating a single statistical technique, SEM refers to a family of related procedures, among which path analytic and confirmatory factor analysis models are the core procedures (Kline, 2010). The aforementioned classical test models can be conceptualized as structural equation models (see Miller, 1995).

Below I briefly review the CFA model, the link between CFA and CTT, and coefficient omega.

Confirmatory Factor Analysis

CFA models are also referred to as measurement models. A CFA model depicts the relationship between latent factors and measured variables. A CFA model can be considered when researchers have an a priori hypothesis about the relationship among these variables (i.e., the number of factors and the correspondence between factors and measurement indicators)

(Kline, 2010). The CFA approach is superior to classical test models for the estimation of scale reliability for the following reasons. First, it can be applied to various complex models. Second, it can be used for unweighted as well as weighted composite scores. The proposed study focuses on unit weighted composite scores.

An illustration of the one-factor CFA model. A one-factor CFA model with six items is presented as an illustration (Figure 1), and is also one of the models used for confidence interval estimation for coefficient in my study. Let F denote the score of the common latent

factor, Xi (i=1,2,…,6) denote item scores, and Ei be the item error scores. i denotes factor

loading between the common factor and each item. The observed score for each item (here I use

the deviation score, which is the difference between a score and the mean) is a linear

combination of the factor score weighted by factor loading and error score:

X i  i *F  Ei . (2)

13

A CFA model is testable only if it is identified. “A model is said to be identified if it is theoretically possible to derive a unique estimate for each parameter” (Kline, 2010, p. 105). One necessary requirement for model identification is that the degrees of freedom (df) for the model

should be equal to or greater than zero. The model df is the number of observations (i.e., v(v+1) where v is the number of indicators) minus the number of freely estimated parameters (i.e., the total number of unique variances and covariances of the factors and errors plus the number of factor loadings) (Kline, 2010). The other necessary requirement is that each latent variable

(including factors and measurement errors) in the model should be assigned a scale. Factors can be scaled via unit loading identification (ULI) or unit variance identification (UVI) methods. For a factor, ULI fixes the factor loading of one of its indicators as 1.0, while UVI fixes the factor

variance as 1.0 and freely estimates factor loadings for all indicators (Kline, 2010). Measurement

errors are usually scaled through ULI, by fixing path coefficients associated with errors to 1.0.

For this example, the model df=9 and every latent variable is assigned a scale through ULI. In addition to these two necessary requirements, a sufficient condition is that there are at least three

indicators if the model has only one factor or at least two indicators for each factor if the model

has two or more factors (Kline, 2010). Based on these requirements, the illustrated one-factor

model is identified.

Figure 1: An Example of a One-factor CFA Model.

14

The link between CTT and CFA for reliability estimation. A strong relationship exists between CTT and CFA models.

Suppose a set of items, X1, X2, …, Xk (k >2), fit a one-factor CFA model. Let X

(X=X1+X2+…+Xk) denote the composite observed score and T (T=T1+T2+…+Tk) be the composite true score (Lord & Novick, 1968). In CTT, the observed score can be decomposed into a true score component and error score component (Zimmerman, 1975):

X1  T1  E1 , X 2  T2  E2 , …, X k  Tk  Ek , (3) where E1, E2,…, Ek are error terms associated with each item. From the CFA framework (refer to equation (2) above), item scores are expressed as:

X1  1 * F  E1 , X 2  2 *F  E2 , …, X k  k *F  Ek . (4)

By comparing equation (3) to equation (4), we may note that weighted factor scores and

measurement errors in CFA can be conceptualized as the true scores and error scores in CTT,

respectively. Correspondingly, reliability of the composite scores based on the one-factor model

is defined as:

2  T  XX '  2 ’  X

2  T1 T2 ...Tk  2  X1  X 2 ... X k

2  T1 T2 ...Tk   2  ((T1 T2 ...Tk )  (E1  E2 ... Ek ))

2  (1F  2F ... k F)  2 . (5)  ((1F  2F ... k F)  (E1  E2 ... Ek ))

2 If error terms are uncorrelated with each other and with the latent factor, and F is fixed to 1,

equation (5) can be further simplified to:

15

(    ...   )2   1 2 k XX ' (    ...   )2  ( 2  2  ...  2 ) 1 2 k E1 E2 Ek

(  )2   i . (6) (  )2   2  i  Ei

Both and are model parameters in a CFA model and can be estimated based on sample

data. This coefficient is labeled coefficient (McDonald, 1978). More details about coefficient are presented in the next section.

Coefficient

Coefficient doesn’t require as restrictive assumptions as coefficient does. It is regarded as a generalization of coefficient and is recommended for estimating reliability of homogeneous measurements (Kelley & Cheng, 2012). Coefficient was first introduced by

McDonald (1978) as a reliability coefficient for the congeneric model. It was then extended to a more general form for the hierarchical model underlying multidimensional measures (McDonald,

1999). The generalized version of coefficient was referred to as h (Zinbarg, Revelle, &

Yovel, 2005) or t (Revelle & Zinbarg, 2009). This study focuses only on coefficient introduced by McDonald in 1978, that is, reliability based on the congeneric model. The following two sections introduce coefficient and its relationship with coefficient .

Definition and formula for computing coefficient . McDonald (1978) proposed coefficient as an index of scale reliability for the congeneric model. Coefficient is defined as a ratio of the variance of the composite true score to the variance of the observed total score.

Both composite true score variance and total score variance are a function of model parameters in a CFA model. The model parameters are estimated by fitting a one-factor CFA model to the data with an appropriate estimation method, such as maximum likelihood (ML) estimation. If we fix the factor variance to 1.0, coefficient is computed as equation (6).

16

Relationship between coefficient and coefficient . Relationship exists between

coefficient and coefficient . A key assumption for coefficient is essential tau equivalency. In

the one-factor CFA model, this assumption is equivalent to requiring equal factor loadings across

items. Let denote the common factor loading. The essential tau equivalent model can be

presented as :

X i   * F  Ei . (7)

However, coefficient does not require items to be tau equivalent. In other words, the factor loading  could vary across items. Therefore, coefficient is a special case of coefficient

(Kelley & Cheng, 2012) where the essential tau equivalency assumption is satisfied as I show in detail below.

With ULI identification, for any two items i and i (i i ) with factor loadings and , respectively, the covariance between item i andi , , can be expressed as:

2  ii'   ii '   . (8)

The average of covariances is:

    ii'  ii ' (kk  )1 2 ii '  ii '   2 (9) (kk  )1 (kk  )1 (kk  )1

where is the number of items. Given equations (6), (8), and (9), coefficient can be rewritten

as:

  (  )2 (  )2 k 22 k 2  ii' k  ii'    i   i   ii '  ii ' . (10) (  )2   2  2  2  2 (kk  )1 k 1  2  i  Ei X X X X

Because the variance of the composite score ( ) consists of the sum of the item variances

2 ( i ) and the sum of covariances between any pair of items ( ii' ) , i ii '

17

2 2  X   ii'   i , (11) ii ' i

thus,

2  ii'  i k ii ' k i   2  1(  2 ) . (12) k 1  X k 1  X

Therefore, coefficient is identical to coefficient when the tau equivalency assumption is met.

However, this assumption is not likely to hold in practice. So coefficient is a more accurate estimate of reliability coefficient than coefficient .

Confidence Interval

As an important index of measurement consistency, the reliability coefficient is routinely reported in empirical studies in social and behavior sciences. A point estimate of the reliability coefficient is a single value of the estimate of the population reliability. Because the point estimate is obtained from a given sample, the value is subject to sampling fluctuations and inevitably varies from sample to sample.

Null Hypothesis Significance Testing

The issue of determining the accuracy of estimates of parameters has received increasing attention in the literature. The most commonly used method is null hypothesis significance testing (NHST) (Neyman & Pearson, 1933), which has dominated statistical analysis for a long time in social and behavior studies (Cheung, 2009). A significance level, denoted as , of .05 or

.01 is routinely employed in tests of significance. The observed p value of a statistic is compared with the predetermined significance level and statistical inference is made based on the result of the comparison. For example, if the p value of the significance testing for the reliability estimate is larger than the significance level (e.g., .05), this reliability coefficient is judged as not significantly different from zero (assuming H0: ρ=0) in the population, and the measure in

18 question will then be concluded to lack reliability. A small p value is desired to reject a null hypothesis (Killeen, 2005).

Although significance testing for point estimates is commonly used, the appropriateness of its use is still a matter of debate (e.g., Chow, 1988; Cohen, 1994; Nickerson, 2000; Schmidt,

1996; Wainer, 1999). When conducting a null hypothesis significance test, we make conclusions by comparing the obtained p value to a predetermined arbitrary significance level such as .05.

However, it is not wise to make decisions based on such an arbitrary significance level (Neale &

Miller, 1997). An increasing number of researchers have found that it is unacceptable and unreasonable to evaluate scientific findings based on such a binary decision (Cheung, 2009b).

Another limitation of significance test is that it is sensitive to sample size. The test is more likely to be significant when the sample size is large. To avoid the unwise decision-making due to these reasons, a better alternative is interval estimation.

Interval Estimation

Point estimation provides only a single possible value of the population parameter. In contrast to point estimation, interval estimation (Neyman, 1935, 1937) provides a range of values which is likely to capture the population value of the parameter of interest. A confidence interval is a type of frequently used interval estimation. A confidence level (often denoted as 1- , where

is the Type I error rate) tells how likely it is that the interval includes the population parameter

(Kelley, 2007). A two-sided confidence interval includes a range of values defined by two limits

(i.e., upper and lower limits) with the general form of (Cheung, 2007; Cheung, 2009b; Kelley,

2007; Neyman, 1937):

ˆ ˆ P(L 0 U ) 1 (13)

19

ˆ ˆ where P denotes probability; 0 is the population parameter of interest; and  L and U is a pair of

random variables which determine the random endpoints of a confidence interval (i.e., estimates

of the lower and upper confidence limits based on samples). In the literature, a 95% confidence

interval is most frequently reported which corresponds to the traditionally accepted level of

NHST (.05) ( imundi , 2008). A 95% CI means there is a 95% probability (e.g., 95 times out of

100) that the population ̌ ́ parameter falls in the interval under repeated sampling. Interval estimation may also be impacted by sample size. However, compared with significance testing, an interval estimate is more informative about the population parameter with a range of plausible values. Reporting a confidence interval along with the point estimate has been strongly recommended for empirical studies. It is said that “Interval estimates should be given for any effect sizes involving principle outcomes” and we should “provide intervals for correlations and other coefficients of associations or variation whenever possible” (Wilkinson & Task Force on

Statistical Inference, 1999, p. 599).

Even though interval estimation has received increasing attention, empirical studies reporting confidence intervals along with their corresponding parameter estimates in general are still limited. One possible reason is the lack of availability in statistical packages and methods for

CI construction for different statistics and psychometric indices (Cheung, 2009b; Steiger &

Fouladi, 1997). Confidence interval estimation for coefficient also lacks development and investigation (Padilla & Divers, 2013). In the current study, three methods for CI estimation for coefficient are introduced first, and then a comparison of performance among these three methods is conducted through a simulation study. The following sections review these three methods in detail.

20

Three Approaches to Interval Estimation for Coefficient

In this section, three methods for CI construction for coefficient are introduced: the

Wald method, the likelihood method, and the BCa bootstrap method.

Wald Method

Confidence intervals constructed with the Wald method are referred as the Wald

confidence intervals (Wald CIs). Wald CIs are constructed based on the standard error estimate

of the parameter of interest via the Wald statistic (Cheung, 2009b; Raykov, 2002). The

implementation of the Wald CI is usually accomplished with ML estimation. ML estimation is

based on the probability density function which provides the likelihood (i.e., probability) of an

individual case with a normal distribution (Azzalini, 1996; Enders, 2010).

Wald CI for an individual parameter. Let be an individual parameter, ˆ be the

maximum likelihood estimate of , and SE(ˆ) be the standard error of the maximum likelihood

ˆ  estimate. The Wald statistic for which is defined as , asymptotically follows a standard SE(ˆ)

normal distribution with mean of zero and standard deviation of one (Cheung, 2009b; Enders,

2010), when the following requirements are met: (1) sample size is sufficiently large; (2) data meet the assumption of multivariate normality; and (3) is estimated using ML. With ML estimation methods, the standard error of the sample estimate of is determined by the second- derivative of the log-likelihood function of the sample, which “quantifies the curvature of the log-likelihood function” (Enders, 2010, p. 66). The 100(1- )% Wald CI for can be constructed

(Cheung, 2007, 2009a, 2009b; Raykov, 2002) as

ˆ  Z SE(ˆ) 1 (14) 2

21 where Z is the standard normal score for the (1- For example, a 95% 1 /2)th percentile. 2

confidence interval for is ˆ  .1 96 * SE(ˆ)

Wald CI for a function of parameters . If the parameter of interest is a function of one

or multiple parameters, the standard error estimate cannot be obtained directly from the second-

derivation of the likelihood function. In this case, we can use the delta method to obtain the

standard error estimate (Casella & Berger, 2002; Cheung, 2009b; Raykov, 2002; Raykov &

Marcoulides, 2004; Oehlert, 1992). The delta method is an analytic approach widely used to

obtain standard error estimates for functions, linear or non-linear, of model parameters (Raykov,

2009). This method has been commonly used to approximate standard error for parameters

which are functions of one or more other parameters, such as partial correlations (Olkin & Finn,

1995), mediating effects (Cheung, 2007), and reliability estimates (Cheung, 2009b, Kelley &

Cheng, 2012; Raykov, 2002, 2004, 2009; Raykov & Marcoulides, 2004).

A brief introduction to the application of the delta method. Under the assumption of

multivariate normality of variables, the asymptotic standard errors of parametric functions can be

derived with the delta method method (Ogasawara, 1999). The application of the delta method is

based on the linear “approximation of a smooth parametric function” (p. 623), which has the

properties of locally linearity and continuousness (Raykov & Marcoulides, 2004). Within the

framework of SEM, many functions of model parameters meet this requirement and can be

regarded as smooth functions, such as indirect effects and the reliability coefficient (Raykov &

Marcoulides, 2004). The essence of the application of the delta method is to approximate the

linear representation of a smooth function at a point of interest using the Taylor expansion (Hart,

1955) of the function.

22

Functions of a single model parameter. Suppose f )( is a function of a single model

parameter and it is differentiable (n+1) times in an interval in which the population

th value 0 is included, then the n -order Taylor expansion of f )(is (Hart, 1955):

(  )2 (  )n f )(  f   ()(   f  )(')  0 f  )('' ... 0 f n)(  )(  R (15) 0 0 0 !2 0 n! 0 n

h)( th where f (0 ) denotes the h derivative of f )(evaluated at0 ; Rn is the remainder (00).

In empirical applications, given the limited improvement provided by higher order expansions over the first-order expansion, the first-order Taylor expansion is routinely used as the linear approximation for a particular function (e.g., Raykov & Marcoulides, 2004; Tan,

1990):

f ()  f (0 )  ( 0 ) f (' 0 ) . (16)

Let D stand for the first derivative of the function with regard to evaluated at 0. The linear approximation of f )(can be expressed as:

f ( ) f )(  f ( )  (  ) 0 0 0  . (17)

 f (0 )  D( 0 )

Taking variances of both sides of equation (17) yields the variance of f )(. The approximate

standard error of the consistent ML estimate of f )( is obtained by taking the square root of the

variance:

SE( f (ˆ))  Dˆ * SE(ˆ) (18)

where f (ˆ) is the consistent ML estimate of f )( ; Dˆ is the first derivative estimate of f )( ; and SE(ˆ) stands for the standard error of the ML estimate of parameter . Once the approximate

23 standard error of f (ˆ) is obtained, the Wald CI for parameter f )( can be constructed as:

f (ˆ)  Z SE( f (ˆ)) 1 . (19) 2

Functions of multiple model parameters. If the parameter is a function of more than one

parameter (i.e., 1,2 ,,...,k , k>1), the approximation of the first-order Taylor expansion of the function should be in the form of partial derivatives with regard to each individual parameter, evaluated at the population values of these parameters (e.g., Hart, 1955; Raykov & Marcoulides,

2004; Tan,1990), specifically,

f (10,20,...,k 0 ) f  21 ,,( ,...,k )  f (10,20,...,k 0  () 1 10) 1 , (20)

f (10,20,...,k 0 ) f (10,20,...,k 0 )  (2 20) ... (k k 0 ) 2 k

where f (1,2 ,,...,k ) is the parameter of interest; 10,  200 ,..., k are the population values of

f (10,20,...,k 0 ) 1,2 ,,...,k ; and is the partial derivative of the function with regard to  j  j

when evaluated at j0 (j0=10, 20, …, k0). The approximate standard error of f (1,2 ,,...,k ) can

be derived by first taking the variance of both sides of the function, and then taking the square

root of the obtained variance. The resulting approximate standard error is shown below (e.g.,

Raykov & Marcoulides, 2004):

ˆ ˆ ˆ ˆ 2 2 ˆ ˆ 2 2 ˆ 2 2 ˆ SE( f (1,2 ,,...,k ))  [D1  (1)  D2 (2 )  ...  Dk  (k ) , (21) ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ 2/1  2D1D2 (1,2 )  2D2 D3 (2 ,3 )  2Dk1Dk (k1,k )]

ˆ ˆ ˆ ˆ where f (1,2 ,,...,k ) is the consistent ML estimate of f (1,2 ,,...,k ) ; k is the ML estimate

ˆ of ; Dk is the partial derivative estimate of f (1,2 ,,...,k ) with regard to evaluated at

24

ˆ ˆ ˆ 2 ˆ ˆ ˆ 1,2 ,...,k ;  (k ) is the variance of the ML estimate of a particular parameter; and (k1,k ) denotes the covariance between ML estimates of a pair of parameters. To this end, the Wald CI

for f (1,2 ,,...,k ) can be developed with the delta method.

Wald CI for coefficient based on the delta method. The parameter function of interest in the proposed study is the reliability coefficient based on the congeneric model, coefficient , which is computed as

(  )2    i . (22) (  )2   2  i  Ei

By using two new notations u and v to denote and , respectively, reliability can be

simplified to (Kelley & Cheng, 2012; Raykov,∑ 2002) ∑

u 2   . (23) u 2  v

Based on the delta method, Kelley and Cheng (2012) and Raykov (2002) derived the

approximate standard error of the ML estimate of coefficient :

ˆ 22 ˆ 22 ˆˆ SE(ˆ)  Du (uˆ)  Dv  (vˆ)  2D Dvu  ( ˆ,vu ˆ) , (24)

where is the ML estimate of ; and denote the consistent ML estimates of u and v,

̂ 2 2 ̂ ̂ respectively;  (uˆ) and  (vˆ) indicate variances of and , respectively; and  ( ˆ,vuˆ) is the

ˆ ̂ ̂ covariance between and . Du is the consistent estimate of the partial derivative of with

̂ ̂ ˆ respect to u, evaluated at the population value of u; Dv is the consistent estimate of the partial

derivative of with respect to v, evaluated at the population value of v. The two partial derivatives can be obtained by applying rules of differentiation (Hart, 1955). Once the standard error estimate is obtained, the 100(1- )% Wald CI for can be constructed by:

25

  Z SE() . (25) ˆ 1 ˆ 2

One advantage of using SEM methods with Wald CI formation is that the parameter

estimates and their corresponding variances and covariances, which are needed for the

computation of the related components, are available in SEM applications (Raykov &

Marcoulides, 2004). Multiple approaches are available to obtain these components. The approach

proposed by Raykov (2002) is to introduce four auxiliary variables (e.g., Raykov, 2001; Raykov

& Shrout, 2002) to the congeneric model and impose nonlinear constraints on the model

parameters. In 2009 he outlined a simpler procedure to obtain point and interval estimates for the

reliability coefficient of congeneric measures by introducing the reliability coefficient as a new

parameter via imposing model constraints. Kelley and Cheng (2012) recommended a modified

procedure, which uses the inverse of the information matrix (see Casella & Berger, 2002) and

doesn’t require nonlinear parameter constraints. Given that this approach is relatively simple to

implement, the current study used Kelley and Cheng’s approach to estimate the confidence

interval for coefficient .

Likelihood Method

Likelihood-based CIs are constructed using the likelihood method. Unlike a Wald CI

ˆ  which is based on the asymptotically normal distribution of the Wald statistic, , a SE(ˆ)

likelihood-based CI is developed from the asymptotic chi-square distribution of the likelihood

ratio (LR) statistic (Venzon & Moolgavkar, 1988). The LR statistic and likelihood-based CI

construction for a single parameter, and a parameter that is a function of multiple parameters, are

discussed below.

Likelihood ratio statistic. A likelihood ratio test is used to compare the fit of two nested

26 models, the null model (i.e., restricted model) and the alternative model (i.e., unrestricted model).

The LR statistic tells how many times more likely the data are under one model than the other

(i.e., one model is more likely to be supported by the data than the other). Let be a parameter

vector, LogL(ˆ) be the log likelihood statistic under the alternative model evaluated at the ML

~ ~ estimateˆ , and LogL( )be the log likelihood statistic under the null model evaluated at  . The

LR statistic is defined as (Azzalini, 1996; Buse, 1982):

~ ~ 2 LR  2LogL( )  (2LogL(ˆ))  2LogL(ˆ)  2LogL( ) ~  (g) (26)

where LogL(.) denotes the natural logarithm likelihood function of the specified model and g is the number of parameters constrained in the null hypothesis, which is equal to the difference in degrees of freedom (df) between the two models. The deviance (i.e., difference in -2LogL(.)) asymptotically follows a chi-square distribution with the degrees of freedom equal to g (Buse,

1982).

Likelihood-based CI computed via the likelihood function of a single parameter.

Suppose is a single parameter and the log likelihood function contains only this single parameter. Because the number of parameters constrained in the null hypothesis is 1, the LR statistic follows a chi-square distribution with one degree of freedom. To construct a two sided

100(1- )% likelihood-based CI for , two interval limit estimates, and , need to be

ˆ ˆ ̂ ̂ identified, where  L is a point on the left of the ML estimate and U is a point on the right of the

ML estimate. At both points the LR statistic is just statistically significant under the chi-square

distribution with df=1 at the predetermined significance level. The formula representations are:

ˆ ˆ 2 2LogL()  2LogL(L )    )1,1( (27)

ˆ ˆ 2 2LogL()  2LogL(U )    )1,1( (28)

27

2 where ˆ denotes the ML estimate of ;  1,1(  ) is the critical value (i.e., the (1- )th percentile) of

the LR statistic distribution with df=1 at the significance level of . Solving the equation below,

ˆ ˆ we can get solutions for  L and U . Figure 2 illustrates the log likelihood function with regard to

the parameter (Azzalini, 1996), where

ˆ ˆ ˆ 2 LogL(L )  LogL(U )  LogL()    )1,1( 2/ . (29)

LogL()

LogL()

2 LogL( ) (1,1 ) / 2

L   U

Figure 2: An Illustration of Confidence Interval Computed via the Log-likelihood Function.

Likelihood-based CI computed via likelihood function of multiple parameters. The likelihood method is also applicable for a likelihood function of multiple parameters. Suppose that the likelihood function contains a set of parameters, 1, 2, …, , where s is the parameter

of interest ( ). To construct the likelihood-based CI for s, we can use the profile- likelihood method (Venzon & Moolgavkar, 1988).

Let us denote the parameter of interest s by , and let be the vector containing

28 remaining parameters in the model, 1, 2, …, s-1, s+1, …, t. The profile likelihood method

treats the remaining parameters as nuisance parameters and maximizes the likelihood function

over them (Venzon & Moolgavkar, 1988). Through this method, the nuisance parameters ( ) are removed from the likelihood function, and only the single parameter of interest ( ) remains. The resulting function is the profile likelihood function for . This process can be illustrated by

(Patterson, 2014; Venzon & Moolgavkar, 1988):

LLp ( ) max (  ,  ) , (30)  where is the profile likelihood function containing only the parameter of interest ; L( , )

is the original likelihood function including both and ; and max (.) is the maximum function,

which returns the largest likelihood among a set o f possible likelihood values. For each value of  ,

is “the maximum of the likelihood function over the remaining parameters” (Patterson,

2014, p. 1). A 100(1- )% likelihood-based CI for can be obtained by applying the LR test,

ˆ ˆ ˆ 2 2LogL(, )  2LogLP (L )    )1,1( (31)

ˆ ˆ ˆ 2 2LogL(, )  2LogLP (U )    )1,1( (32)

ˆ ˆ ˆ where  and ˆ are the ML estimates of  and , respectively;  L and U are the estimated lower and

upper bounds of the 100(1- )% likelihood-based CI for  , at which the LR statistic reaches the

2 chi-square critical value,  1,1(  ) , with one degree of freedom at significance level;

ˆ ˆ ˆ LogLP (L ) is the profile log likelihood for  when  is fixed at  L ; and LogLP (U ) is the profile

ˆ log likelihood for  , when  is fixed at U . The likelihood-based CI with approximate confidence

ˆ ˆ ˆ ˆ level (1- ) is (  L , U ), where  L and U can be obtained by solving equations (31) and (32).

29

Likelihood-based CI for coefficient . With the profile likelihood method, the likelihood-based CI for coefficient is obtainable via SEM. If the parameter of our interest is a

function of multiple model parameters, a model reparameterization needs to be implemented

before applying the profile likelihood method (Patterson, 2014). Because coefficient is a

function of factor loadings and error variances, which are model parameters in the CFA model,

the reparameterization is needed before profiling the likelihood function. Three steps are

followed to calculate the profile likelihood CI for coefficient . First, identify the log likelihood

function of the congeneric model, which consists of model parameters including factor loadings

and error variances. Second, reparameterize the log likelihood function of the congeneric model

in terms of the target parameter . A set of new parameters may be introduced during the model

reparameterization process. Last, apply the profile likelihood method to the reparameterized log

likelihood function, which contains and other newly introduced parameters. In line with

equations (31) and (32), taking as (i.e., the parameter of our interest) and all other

parameters in the log likelihood function as (i.e., nuisance parameters), the profile likelihood

function for coefficient can be derived. Once the profile likelihood function is available, the

100(1- )% confidence interval for coefficient can be obtained via the LR test. The OpenMx

package (Boker et al., 2011) installed in the R program can automatically implement these

procedures.

Bias-corrected and Accelerated Bootstrap Method

The basic idea of the bootstrap method is to estimate the sampling distribution for the

parameter of interest by repeated sampling from an available sample. Bootstrap intervals can be

formed by using different bootstrap techniques, such as the percentile bootstrap, bias-corrected

(BC) bootstrap, bias-corrected and accelerated (BCa) bootstrap, and approximate BCa (ABC)

30 bootstrap. Among various bootstrap intervals, the BCa bootstrap interval is generally more

preferred for two reasons. First, it takes asymmetry, bias, and nonconstant variance into

consideration; second, it improves the interval via transformation, bias correction, and

acceleration adjustment (DiCiccio & Efron, 1996; Efron, 1987; Kelley & Cheng, 2012). The

BCa bootstrap method is proved to be second-order accurate and correct (DiCiccio & Efron,

1996; Efron, 1987; Kelley & Cheng, 2012). The following section is a brief introduction to the

BCa bootstrap method.

A brief introduction to bootstrap techniques. Suppose is the parameter of interest and N is the sample size of an available sample S. Generally, four steps are followed for practicing the bootstrap technique. The first step is to obtain a bootstrap sample with sample size equivalent to N, which is achieved by random sampling N times from S with replacement, and then calculate the ML estimate of based on the obtained bootstrap sample. On each random

selection, each observation in S has an equal probability of being selected with probability of

1/N. The second step is to replicate the aforementioned random sampling B times to get B

bootstrap samples, with the same sample size of N for each, and B estimates of based on the B

bootstrap samples. The number of bootstrap replications B is supposed to be relatively large

(e.g., 1000) (Efron, 1987). The third step is to form the bootstrap estimate of the sampling

distribution (i.e., the bootstrap sampling distribution) of based on its B estimates yielded from

the B bootstrap replications. The last step is to find the upper and lower confidence limits to form

the bootstrap confidence interval based on the bootstrap sampling distribution, which are the

values of bootstrap estimates corresponding to the lower and upper percentiles of the

distribution. Different lower and upper percentiles may be identified for confidence limits

through different bootstrap methods. For example, the ( /2)th percentile and the (1- /2)th

31 percentile of the bootstrap distribution are often identified as the lower and upper limits for the percentile bootstrap interval (Kelley & Cheng, 2012; Padilla & Divers, 2013), while the BCa bootstrap method uses different limits for CI construction (see equations (36) and (37) below).

Construction of bias-corrected and accelerated bootstrap CI. Let ˆ be the maximum

likelihood estimate of based on the original sample and ˆ* denote the ML estimate of based

ˆ* ˆ* on the bootstrap replications. To calculate the two limits ( L andU ) of the 100(1- )% BCa

bootstrap CI for , two estimates need to be obtained. One is the bias-corrected estimate and the

other is the acceleration estimate.

The bias-corrected estimate. The bias-corrected estimate quantifies the degree of

asymmetry of the bootstrap sampling distribution, often denoted as zˆ0 (DiCiccio & Efron, 1996;

Efron, 1987; Kelley & Cheng, 2012):

(# ˆ* ˆ) zˆ  1( ) (33) 0 B

where B is the number of bootstrap replications and (# ˆ*  ˆ) indicates the number of bootstrap

* estimates less than the original estimate; (# ˆ ˆ) is the proportion of bootstrap estimates less B

1 than the original estimate; and  (.) is the inverse of the standard normal cumulative distribution

* * (# ˆ  ˆ) function (c.d.f.) at a particular value. When the distribution of ˆ is symmetric,  2/1 B

* and zˆ0 = 0 (DiCiccio & Efron, 1996). For a positively skewed distribution, ˆ is negatively

* (# ˆ ˆ) * biased relative toˆ ,  2/1 , and zˆ0 > 0; for a negatively skewed distribution, ˆ is B

(# ˆ* ˆ) positively biased relative toˆ ,  2/1 , and zˆ0 < 0. B

32

The acceleration estimate. The acceleration estimate, which is denoted as ˆ0 , quantifies

the rate of change of the standard error of ˆ and is measured on a normalized scale (Efron, 1987).

It is calculated via the jackknife estimation procedure. The basic idea of the procedure is to

ˆ estimate the jackknife value i , which is the estimate of ˆ when the ith observation is removed

ˆ from the original sample (Miller, 1974; Kelley & Cheng, 2012). The estimation of jackknife i

value repeats N times with the total number of N observations deleted from the original sample

ˆ ˆ one-by-one and the mean of the N jackknife i values is denoted asi . Then, ˆ 0 is computed as

(DiCiccio & Efron, 1996; Efron, 1987; Kelley & Cheng, 2012):

N ˆ ˆ 3 (i i ) i1 ˆ0  N . (34) ˆ ˆ 2/32 (6 (i i )) i1

BCa bootstrap confidence interval limits. Let (.) be the cumulative distribution function which is defined as (DiCiccio & Efron, 1996; ̂Efron, 1987; Kelley & Cheng, 2012):

(# ˆ*  c) Gˆ(ˆ* c)|  , (35) B

where c is a constant. When zˆ0 and ˆ0 are available, the 100(1- )% BCa bootstrap confidence

ˆ* ˆ* interval ( L ,U ) can be calculated as follow (DiCiccio & Efron, 1996; Efron, 1987; Kelley &

Cheng, 2012):

z  Z ˆ* ˆ 1 ˆ* ˆ0  2/ L  G ( | (zˆ0  ) , and (36) 1ˆ0 (zˆ0  Z 2/ )

33

zˆ  Z ˆ* ˆ 1 ˆ* 0  )2/1( U  G ( | (zˆ0  ) . (37) 1ˆ0 (zˆ0  Z  )2/1( )

When zˆ0 = 0 and ˆ0 = 0, the BCa bootstrap CI equals the percentile CI.

BCa bootstrap CI for coefficient . Using the BCa bootstrap method, the general

procedures for the formation of the CI on coefficient are: (a) obtain B bootstrap samples by sampling with replacement from the original random sample; (b) estimate coefficient via the

ML estimation method for each bootstrap sample; (c) construct the bootstrap distribution of coefficient estimates; (d) locate the two percentiles on the empirical bootstrap distribution,

zˆ0  Z 2/ which correspond to the two cumulative frequencies (zˆ0  ) and 1ˆ0 (zˆ0  Z 2/ )

zˆ0  Z  )2/1( (zˆ0  ), respectively; and (e) form the 100(1- )% BCa bootstrap CI on 1ˆ0 (zˆ0  Z  )2/1( ) coefficient with the two percentiles as interval limits.

Kelley and Cheng (2012) recommended the ci. reliability ( ) function in the MBESS package in R to compute the BCa bootstrap procedures on coefficient . This approach was applied in my study to perform BCa bootstrap CI for .

A Comparison among Three Interval Estimation Methods

This section discusses the similarities and differences among the three types of methods

for confidence interval construction. The comparison is discussed from the following six aspects.

Statistical Test

The Wald and likelihood methods involve statistical tests. The Wald CI is constructed

ˆ  based on the Wald statistic , which asymptotically follows a normal distribution. The SE(ˆ)

construction of likelihood-based CI uses the LR statistic, which asymptotically follows a chi-

34 square distribution with degrees of freedom equal to g. However, most bootstrapping procedures

require randomization tests, but do not require any parametric statistical tests.

Statistics Applied

The Wald and likelihood methods are associated with statistical tests. Differences and

relations exist between the Wald statistic and likelihood ratio statistic (LR). Squaring the Wald

ˆ  ˆ  statistic, , yields the other version of the Wald statistic ( )2 , which asymptotically SE(ˆ) SE(ˆ)

follows a chi-square distribution (Enders, 2010). This version of the Wald statistic (W) is “the

quadratic approximation of the LR statistic by using the second-order Taylor’s expansion of the

log-likelihood function around the MLE” (Cheung, 2009b, p. 270), specifically,

1 LogL )(  LogL(ˆ)  I(ˆ)(ˆ )2 (38) 2

I(ˆ)(ˆ )2  2LogL(ˆ)  2LogL() (39)

where I(ˆ) is the information matrix which is obtained by adding a negative sign to the second derivative of the LogL( ) evaluated atˆ Taking the square root of the inverse of the information

matrix yields the standard error of the parameter, via

ˆ  I(ˆ)(ˆ )2  ( )2 W (40) SE(ˆ)

LR  2LogL(ˆ)  2LogL() (41)

W  LR. (42)

Therefore, these two statistics are asymptotically equivalent, following a chi-square distribution

(Buse, 1982; Cheung, 2009b). In some special cases, the Wald and LR statistics are exactly the

same, for example, when the log-likelihood function is quadratic for linear models (Buse, 1982).

35

However, differences also exist between the Wald and LR statistics. First, as the LR statistic departs from the quadratic shape, the inequalities appear between them when the sample size is finite (Buse, 1982). The likelihood function for coefficient is the profile likelihood function. Because “each point on the profile likelihood function is the maximum value of a

likelihood function” (Patterson, 2014, p. 1) and the form of the profile likelihood function for

coefficient is not implicitly known, it is unclear what the difference in estimated confidence

intervals is between the Wald and Likelihood methods. Second, the LR test refers to the

estimation of two models, a constrained model and an unconstrained model, and the comparison

of model fit between them, while the is based on the estimation of one model and is

performed by testing the null hypothesis that a set of parameters of interest are simultaneously

equal to some constant in the population, usually zero.

Assumption of Multivariate Normality

The Wald and likelihood methods are parametric, while the BCa bootstrap method is

non-parametric and computation-intensive. One major assumption for parametric techniques (the

Wald and likelihood) is multivariate normality, which does not always hold in practice. A major

advantage of the BCa bootstrap method is that it does not depend on this assumption, because it

is a bootstrap resampling technique. Bootstrap confidence intervals are directly constructed using

the empirical distribution of the parameter estimates. When the multivariate normality

assumption is violated, the bootstrap CI might be more accurate than the Wald CI and likelihood-

based CI.

Symmetry

The Wald CI, which is constructed based on the asymptotically normal distribution, is

symmetric around the ML estimate. However, symmetry may be inappropriate, especially when

36 parameter estimates are near to the limits (Neale & Miller, 1997), in such cases, the confidence intervals would include some values outside the meaningful boundaries of the parameter.

However, the likelihood-based CI, using the log-likelihood function directly, is asymmetric

(Cheung, 2007, 2009b). Because the BCa bootstrap CI is based on the empirical distribution of the parameter, it is also asymmetric.

Sample Size

Given the nature of the related test statistics, both the Wald method and likelihood method require a large sample size to accurately estimate the confidence interval. However, the

BCa bootstrap method, which relies on resampling techniques, has a relatively relax requirement on sample size. Therefore, the BCa bootstrap method may yield more sound confidence intervals than the other two methods for relative small sample sizes. However, bootstrap techniques do not work well (i.e., consistently overestimating standard errors) if sample size is too small (e.g., 100); for sample size larger than 200, bootstrap techniques are favored (Nevitt & Hancock, 2001).

Variance/Invariance to Parameter Transformation

CIs constructed with the Wald method are not invariant to transformations on parameters

(i.e., equivalent reparameterizations of a model) (Neale & Miller, 1997), which is a big concern

in SEM because of the existence of many possible equivalent models with different model

parameterizations (Cheung, 2009). By contrast, the likelihood-based CI is invariant to

transformations (Neale & Miller, 1997). Several percentile bootstrap methods are

parameterization invariant, such as the simple percentile bootstrap method, the bias-corrected

(BC) percentile method, and the BCa bootstrap method (DiCiccio & Romano, 1988).

Below is a brief summary for these three methods based on the above discussion:

37 a. The Wald and likelihood methods depend on the Wald and likelihood ratio statistics,

respectively; while the BCa bootstrap method does not use a parametric statistical test. b. Both Wald and likelihood methods assume multivariate normality of variables, while the

BCa bootstrap method does not have this requirement. c. CIs constructed with the likelihood and BCa bootstrap methods are asymmetric, while CIs

formed with the Wald method are symmetric. d. Both the Wald and likelihood techniques require a large sample size to accurately estimate

the CI, while the BCa bootstrap technique has a relatively relaxing requirement on sample

size, as long as the sample size is not too small (> 100).

e. CIs constructed through the likelihood and BCa bootstrap methods are invariant to

transformations on model parameters, while CIs constructed through the Wald method are

variant to transformations.

Previous Research

The three methods discussed above have been used to construct CIs for various statistics

and psychometric indices in empirical studies. They have also been implemented in several

statistical packages via the SEM approach. Some authors also have provided code and syntax for

these methods in SEM applications. However, a very limited number of studies could be found

on the comparison of the empirical performance of these methods. A few examples of such

studies are presented below.

Cheung (2007) used simulation studies to examine the performance of four methods for

CI construction for mediating effects. The four methods were the Wald, percentile bootstrap,

bias-corrected bootstrap, and likelihood-based methods. The results showed that, with large

mediating effects or large sample sizes, these methods performed equally well in terms of the

38 coverage probability. For small mediating effects and sample sizes, the bootstrap CI and likelihood-based CI were recommended (Cheung, 2007). Cheung (2009b) illustrated how to form Wald CI and likelihood-based CI for many statistics and psychometric indices, including dependent correlation coefficients, squared multiple R, standardized regression coefficients, mediating effects, and reliability estimates. First he compared the likelihood-based and Wald CIs constructed with the SEM approach based on a real data set. The results of empirical data analyses suggested that these two methods appeared to be comparable.

In the same paper, Cheung 2009b) presented results from a simulation study on the

Pearson correlation and compared coverage probabilities and interval widths from these two methods. The simulation study showed that the likelihood-based CI outperformed the Wald CI in small samples (Cheung, 2009b). In another study, Cheung (2009a) compared six Wald CIs

(Sobel-fixed, Aroian-fixed, Sobel-random, Aroian-random, Bobko-Rieck, and SEM-Wald), three bootstrap CIs (Naive bootstrap, Percentile bootstrap, and BC bootstrap), likelihood-based CI, and the PRODCLIN CI on standardized indirect effects through a simulation study. The results showed that the percentile bootstrap, BC bootstrap, and the likelihood-based methods performed the best according to coverage probability (Cheung, 2009a). Ye and Wen (2011) conducted a simulation study to compare three interval estimation methods for coefficient : bootstrap, delta

(i.e, the Wald method), and using the model parameters estimated using LISREL (i.e., point estimate and standard error). However, it is not clear which kind of bootstrap method was chosen in their study; the method using parameter estimates directly from LISRELwas not clearly addressed either. Kelley and Cheng (2012) compared the Wald CI and BCa bootstrap CI on coefficient using an empirical example and recommended the BCa bootstrap approach. Padilla and Divers (2013) investigated three different bootstrap CIs for coefficient in a simulation

39 study. The study assessed the performance of the normal theory bootstrap (NTB) CI, percentile based (PB) CI, and the BCa bootstrap CI on coefficient under four simulation factors (number

of items, correlation type, number of item response categories, and sample size). Categorical

item with symmetric distributions was investigated in their study. Although the study compared

three bootstrap CIs for coefficient , no comparison was made between bootstrap methods and

non-bootstrap methods.

The Rationale and Purpose of the Proposed Study

It is noteworthy that a gap exists in the current literature on the evaluation of the

performance of the three methods (i.e., Wald method, likelihood-based method, and bias-

corrected and accelerated method). That is, no simulation study has been conducted to evaluate

the performance of the three approaches to interval construction on coefficient . Given that

various factors may affect the performance of CI construction methods on coefficient (e.g., data distribution, sample size, and variable correlation), it is very hard to find the practical differences among these CI construction methods with only numerical illustrations. Numerical examples are mainly used for illustrative purposes (Yung & Bentler, 1996). In order to investigate how these CI construction methods would work in different settings, a simulation study is considered.

The current simulation study aims to compare three approaches to constructing CIs for coefficient By comparing the performance of these three approaches in different conditions, I hope that the findings would benefit applied researchers in choosing the appropriate method to estimate reliability.

Based on my literature review, I have the following expectations:

1. When data are multivariate normally distributed, the three methods would perform

40

equally well on CI estimation for coefficient .

2. When multivariate normality does not hold, the BCa bootstrap method will perform

better than the other two methods, while the performance of the likelihood method

will be comparable to the Wald method.

3. As the degree of nonnormality increases, the Wald and likelihood methods for CI

estimation for coefficient will perform worse; while the BCa bootstrap method will

be robust to the departures from normality.

4. As sample size increases, the constructed CIs for coefficient should be more

accurate for all three methods in terms of interval width.

5. Based on my review, the number of items and factor loadings in a test should not

have an effect on the performance of the three interval estimation methods. However,

in my study I manipulated these two conditions to explore the potential relationship.

41

METHODS

According to Yung and Bentler (1996), for the purpose of evaluating the appropriateness

of a method, simulation studies make more sense than empirical studies. Monte Carlo study is an

alternative technique to investigate problems when analytical methods are not available

(Bandalos & Leite, 2013). It can also be used to “provide insight into the behavior of a statistic

even when mathematical proofs of the problem being studied are available”, because “theoretical

properties of do not always hold under real data conditions” (Bandalos & Leite, 2013, p. 627). Given that the comparison of these three methods for constructing CI on coefficient

cannot be achieved through analytical approaches directly and also because the assumptions

required by these methods may not be met in real settings, a simulation study is chosen to

investigate the performance of the three CI estimation methods.

Four factors were manipulated in the simulation design: sample size, number of items,

factor loadings, and degree of nonnormality. As indicated by the derivation of coefficient , the

number of items affects the magnitude of the reliability coefficient. Sample size is included in

the study design because: (1) the ML estimator is based on the asymptotic theory; (2) the Wald

and LR statistics are sensitive to sample size; and (3) given that the performance of bootstrap

techniques is also subject to sample size, the proposed research also allows to examine the effect

of sample size on the BCa bootstrap approach to interval estimation of coefficient . Degree of

nonnormality was considered in the study design because the Wald and likelihood methods are

supposed to work well under the multivariate normality assumption while the bootstrap

technique does not require this assumption. “Strictly speaking, test scores are seldom normally

distributed” (Nunnally, 1978, p. 160). Micceri (1989) investigated 440 distributions, which

represented most types of distributions found in applied settings. The results showed that all

42 distributions were significantly nonnormal at the alpha level of .01. Therefore, I am interested in examining the performance of different methods for CI construction for coefficient with different types of distribution. For the one-factor CFA model, sizes of factor loadings directly impact the population correlations among items as well as reliability coefficient. Therefore, the size of the factor loading was considered as a design factor. Below I provide in detail the design factors in the simulation study.

Design Factors

Four design factors were manipulated in the study: sample size, number of items, factor loading, and degree of nonnormality. a. Sample size. Three levels of sample size were investigated: 100, 300, and 500. These three

levels of sample size were chosen to represent small to moderately large size in factor

analysis (Comrey & Lee, 1992). b. Number of items. Two levels were considered for the number of items, 6 and 12. c. Factor loading. Three sets of factor loadings for the 6-item congeneric model

were .3, .3, .3, .7, .7, .7; .4, .4, .4, .8, .8, .8; and .6, .6, .6, .9, .9, .9. The sizes of factor loadings

for the 12-item congeneric model were the same as those for the 6-item model. However, the

number of items with each loading was doubled. The error variance of each item was fit to

one minus the square of the factor loading. Consequently, the variance of the observed scores

was one. These levels of factor loadings were selected for application purposes, because the

widely acceptable level of reliability for measures is .70 or above in most applied settings

(Lance et al., 2006; Nunnally, 1978) and the resulting population reliabilities under normal

distributions with these levels of factor loadings are very close to or higher than this cutoff

point (see Table 3).

43 d. Degree of nonnormality. Degree of nonnormality is quantified by two statistics, skewness

and kurtosis. In my study, I considered the situation where the source of nonnormality of the

observed scores is due to the nonnormality of the factor scores. Three different combinations

of skewness and kurtosis for factor scores were considered: normal distribution with

skewness of 0 and kurtosis of 0 (Sk=0, K=0), moderately nonnormal distribution with

skewness of 2 and kurtosis of 7 (Sk=2, K=7), and seriously nonnormal distribution with

skewness of 3 and kurtosis of 21 (Sk=3, K=21). More details about the selection of skewness

and kurtosis are provided in the next section.

A total of 54 conditions were considered in data generation, created by fully crossing the four factors (3 levels of sample size × 2 levels of item count ×3 levels of factor loadings × 3 levels of nonnormality). For each condition, 2000 data sets were generated, resulting a total of

108000 (2000 × 54) data sets. For each dataset, the Wald method, likelihood method, and BCa bootstrap method were used to construct the 95% Wald CI, likelihood-based CI, and BCa bootstrap CI on coefficient , respectively.

Data Generation Procedure

Data were generated in R 2.15.0 (R Core Team, 2012) based on the specified conditions.

An example of R code for generating data is provided in Appendix A. Specifically, factor scores following a standard normal distribution (i.e., normal distribution with mean of zero and standard

deviation of one {N (0, 1)}) were first generated for each item. After obtaining standard

normally distributed factor scores, Fleishman’s (1978) power transformation method was used to

transform the normal factor scores to nonnormal scores with the prespecified skewness and

kurtosis (e.g., Fan & Fan, 2005): as

Y  a bX  cX 2  dX 3 , (43)

44

where is the transformed nonnormal variable obtained via Fleishman’s procedure; is the

normally distributed variable; a, b, c and d are transformation coefficients (a=-c) relating to

different degrees of skewness and kurtosis. Table 1 presents the transformation coefficients

corresponding to these three combinations of skewness and kurtosis for factor scores. Fleishman

(1978) provided these coefficients for various skewness and kurtosis. SAS code for obtaining

these coefficients is also available in Fan and Fan (2005).

Table 1: Transformation Coefficients Corresponding to Three Types of Distribution

Skewness Kurtosis Transformation coefficients b c d 0 0 1 0 0 2 7 0.761585275 0.260022598 0.053072274 3 21 -0.681632225 0.637118193 0.148741042

After obtaining nonnormal factor scores with the above procedure, random error scores

following a standard normal distribution were generated for each item. Observed item scores

were then obtained as a linear combination of the common factor and error scores, weighted by

the corresponding factor loading and standard deviation of error scores, according to the one-

factor CFA model (e.g. Bernstein & Teng, 1989 ), via

2 X ij  i * Fj  1(  i *) Eij , (44)

where is the observed score for person j on item i; is the factor loading for item i;

denotes the factor score of person j; denotes the random error score for person j on item i;

45

2 and 1 i represents the standard deviation of the error score associated with item i. Item scores

generated in this way were nonnormally distributed.

Based on the data generation procedure presented above, the skewness and kurtosis of

observed item scores are a function of the skewness and kurtosis of factor scores and

corresponding factor loadings, and thus they are different from the skewness and kurtosis of the

factor scores. Considering the distributions of observed scores most commonly encountered in

practical settings, three different combinations of skewness and kurtosis for factor scores were

carefully selected (i.e., Sk=0, K=0; Sk=2, K=7; and Sk=3, K=21). Skewness and kurtosis of

observed item scores corresponding to these three combinations were calculated and are

presented in Table 2. The skewness of observed item scores ranged from 0.06 to 1.45 and from

0.08 to 2.16, when the skewness of factor scores was 2 and 3, respectively. The kurtosis of

observed item scores ranged from 0.07 to 4.55 and from 0.16 to 12.79, when the kurtosis of

factor scores was 7 and 21, respectively. These values represented low level of skewness and low

to high level of kurtosis in the observed variables. To graphically demonstrate the population

distributions with these levels of skewness and kurtosis, a large dataset with sample size of

1,000,000 was generated for each condition. Histograms for nonnormal distributions were

plotted and displayed in Figure 3 and Figure 4 (see Appendix C).

Data Analysis

CI construction with the three approaches was implemented using the R program. For each generated dataset, a 95% Wald CI and a 95% likelihood-based CI for coefficient were

estimated using the MBESS package and a 95% BCa bootstrap CI was formed using the

OpenMx package. To evaluate the performance of the three interval estimation methods, two

criteria were used: coverage probability and interval width. Coverage probability and the mean

46

Table 2: Skewness and Kurtosis of Observed Item Scores.

Skewness and kurtosis of factor scores Factor loading Sk=0, K=0 Sk=2, K=7 Sk=3, K=21 Sk K Sk K Sk K .3 0 0 .05 .06 .08 .17 .4 0 0 .13 .18 .19 .5 .6 0 0 .43 .91 .63 2.54 .7 0 0 .68 1.69 1.00 4.66 .8 0 0 1.02 2.88 1.51 8.02 .9 0 0 1.45 4.6 2.15 13.03

of the interval widths were calculated for each condition. Although the focus of this study was on

CI construction, point estimates were also reported as supplemental evidence for evaluating the

performance of the three CI construction methods. For each point estimate, I computed the

relative bias using the following formula:

Relative bias= ˆ  , (45)  whereˆ and  denote the mean of point estimates and the population coefficient , respectively.

To examine the accuracy of point estimates for coefficient with these three methods, Hoogland and Boomsma’s (1998) criterion was employed. According to their criterion, biases for point estimates with absolute values smaller than .05 are acceptable (Hoogland & Boomsma, 1998).

Coverage probability in this simulation study is defined as “the proportion of estimated

CIs that contain the true population coefficient ” (Padilla & Divers, 2013, p. 82). If the CIs show consistently acceptable coverage under the simulation conditions, the method used to form these CIs performs well. Bradley’s liberal criterion (1978) was used to determine the acceptable

47 coverage probability, which is defined as , where is the true

Type I error rate and is the predetermined Type I error rate. This criterion has been employed

in simulation studies (e.g., Nevitt & Handcock, 2001; Padilla & Divers, 2013). Based on this

criterion, the acceptable coverage for a 95% CI (i.e., Type I error rate is .05) ranges from .925 to

.975 (e.g., Padilla & Divers, 2013). Because the power transformation was performed only on

factor scores and was applied before using equation (43) to obtain the observed item scores, the

correlation among the observed items with normal distributions should be the same as those with

nonnormal distributions. Consequently, the population coefficient is the same across the

conditions with different degrees of nonnormality. Therefore, the population coefficient under

the normal distribution was used to compute the coverage probability (as well as the relative bias

of point estimates) for the conditions with nonnormal distribution. To empirically check it, three

large datasets with sample size equal to 1,000,000 were generated for each condition with a

nonnormal distribution (i.e., Sk=2 and K=7, and Sk=3 and K=21 for factor scores). Coefficient

was calculated based on each dataset and the mean of the coefficient estimates obtained from

the three large datasets was used to approximate the population under that condition. The

values of coefficient under nonnormal distributions were found to be identical to those under

the normal distribution to the third decimal place. Interval width, which is the difference between

the upper and lower CI limits ( , suggests “the precision of the parameter estimates”

(Cheung, 2009b, p. 282). The mean ̂ of ̂ interval widths for each condition was calculated. When

the coverage probabilities are the same, the CIs with a smaller interval width are more accurate

and desirable.

To further explore the effects of the four design factors (sample size, degree of

nonnormality, factor loading, and number of items) on the performance of the three interval

48

Table 3: Population Coefficient under Different Conditions

Number of items Factor loading Population coefficient .3 .7 .679 6 .4 .8 .783 .6 .9 .891 .3 .7 .809 12 .4 .8 .878 .6 .9 .942

estimation methods (Wald, likelihood, and BCa bootstrap), two statistical tests were conducted in SAS program version 9.3 (SAS Institute Inc., 2011): logistic regression and 5-way ANOVA.

The five independent variables for the two tests were sample size, degree of nonnormality, factor

loading, number of items, and CI estimation methods. It is obvious that the assumption of

independence among observed scores for performing regression and ANOVA was violated in

this study. However, I conducted the logistic regression and ANOVA for heuristic purpose. For

logistic regression, the binary outcome variable indicated whether the estimated CIs covered the

population coefficient (“1”=”yes”; “0”=”no”). Fifteen predictors in the logistic regression were identified: sample size, degree of nonnormality, factor loading, the number of items, CI estimation method, and the interaction effects between any pair of them. The likelihood-based pseudo R-square for the full model (i.e., the model with all fifteen predictors) was requested from the SAS program. To examine the contribution of an individual predictor to the improvement of

model fit, partialling out other predictors, the partial pseudo R-square was calculated as the

difference between the pseudo R-square of the full model and the reduced model (i.e., the model

excluding that particular predictor). For ANOVA, the dependent variable was interval width.

49

Given that the sample size of each group in the statistical tests was considerably large (2000 for most of the groups) and statistical tests tend to be significant with large samples (Alhija &

Wisenbaker, 2006), I calculated partial eta squared to evaluate the importance of each individual

predictor (i.e., four different design factors and interval estimation method plus all the two-way

interaction effects). The partial eta squared is explained as the proportion of variance in the

outcome variable attributable to a single predictor, partialling out the other predictors.

50

RESULTS

Fifty-four conditions were yielded by fully crossing all factors: sample size (100, 300,

and 500), number of items (6 and 12), factor loadings (small, median, and large), and degree of

nonnormality on factor scores (Sk=0 and K=0, Sk=2 and K=7, and Sk=3 and K=21 for factor

scores). Population coefficient values were calculated and shown in Table 3. For each

generated dataset, three types of CIs (the Wald CI, likelihood-based CI, and BCa bootstrap CI)

were constructed for coefficient using ML. To evaluate the performance of the three interval estimation methods (the Wald, likelihood, and BCa bootstrap methods), coverage probability, interval width, and relative bias of point estimates were calculated for each condition. Logistic regression and 5-way ANOVA were conducted to examine the effects of design factors on coverage probability and interval width of the constructed CIs, respectively. Results were reported and summarized in this chapter.

Non-convergence

The non-convergence issue occurred only in the condition with the nonnormally distributed data (Sk=3 and K=21) with small loadings (.3/.7) and small sample size (100), when the Wald and BCa bootstrap methods were used. Among 2000 applications, five replications

with non-convergence were found and they were excluded from further data analysis.

Coverage Probability

Table 4 reports coverage probabilities of the 95% CIs for coefficient yielded by the

three methods. Several consistent observations were made from this table.

When the data were multivariate normally distributed, all coverage probabilities were acceptable, according to Bradley’s liberal criterion (1978) (i.e., between 92.5% and 97.5%) and

very close to the prespecified .95 confidence level across all conditions. Figure 5 in Appendix D

51

Table 4: Coverage Probability for Each Condition. Coverage probability (%) Sample Sk & K Loading 6 items 12 items size Wald LL BCa Wald LL BCa 100 94.90 95.40 94.10 93.90 94.80 94.05 .3 .7 300 94.70 94.80 94.40 95.20 95.60 95.10

500 95.70 95.60 95.20 93.65 93.60 93.35 100 94.55 95.30 94.65 94.40 94.55 94.25 .4 .8 Sk=0 300 95.40 95.85 95.40 95.15 95.55 95.55

K=0 500 95.80 95.60 95.40 94.90 95.40 95.25 100 96.25 95.70 94.60 94.25 94.50 94.35 .6 .9 300 94.65 94.65 94.80 95.10 95.05 94.50

500 94.15 94.30 94.30 94.65 94.60 94.65 100 80.75 80.40 89.80 78.35 78.25 89.80 .3 .7 300 80.95 81.00 92.00 74.45 74.65 92.35

500 81.00 80.75 93.55 75.10 75.10 92.80 100 77.25 78.00 89.85 75.10 74.50 88.85 .4 .8 300 78.00 77.20 92.10 71.85 71.95 92.10

Sk=2 500 76.15 76.15 93.15 71.95 71.75 92.70 K=7 100 76.95 75.75 89.75 70.25 68.80 88.50 .6 .9 300 73.15 72.50 91.10 70.40 70.45 91.75 500 72.90 72.20 91.30 70.40 68.60 91.85 100 77.20 75.45 87.30 70.30 67.65 84.95 .3 .7 300 69.00 68.20 87.90 62.15 61.15 85.95 500 64.60 63.55 88.05 59.60 59.15 86.70 100 73.50 70.90 86.25 64.35 61.95 84.50

.4 .8 300 64.30 63.45 86.45 56.55 55.40 85.40 Sk=3 K=21 500 63.05 62.25 87.45 56.95 55.50 87.55 100 67.25 64.05 84.70 61.65 59.25 83.35 .6 .9 300 57.25 56.85 85.05 53.65 52.60 85.30 500 58.15 57.75 87.30 53.70 53.20 88.00 Note: Sk & K denotes skewness and kurtosis of factor scores. Wald, LL, and BCa stand for Wald method, likelihood method, and BCa bootstrap method, respectively. The bolded and italicized numbers indicate that these coverage probabilities were too low (i.e., < 92.50%).

52 shows that the three lines representing coverage probabilities for the three methods were almost overlapped with each other and were very close to the .95 reference line. This indicated that the three methods performed equally well with normal data, in terms of coverage probability.

When the data were nonnormally distributed, none of the coverage probabilities reached .95 and only a few coverage probabilities were acceptable. These coverage probabilities occurred when the BCa bootstrap method was used, in conditions with large sample size (500), small/moderate sets of factor loadings (.3/.7 and .4/.8), and skewness and kurtosis of factor scores equal to 2 and 7. All of the three methods did not perform well with nonnormal data, given that most coverage probabilities were below the lower bound of the acceptable range

(92.5%). However, it is obvious that the BCa bootstrap method outperformed the other two methods, because the coverage probabilities with this method tended to be higher than those with the other two methods. As shown by graphs in Figure 6 and Figure 7 (see Appendix D), the line representing coverage probabilities for the BCa bootstrap method was above the lines for the

Wald and likelihood methods, In addition, coverage probabilities decreased as sample size

increased when the Wald and likelihood methods were used, while they tended to increase as

sample size increased when the BCa bootstrap method was used.

As evidenced from the results of logistic regression for coverage probability in Table 5,

the partial pseudo R-square associated with degree of nonnormality was .02 (the pseudo R-square

of the full model was .183 and the pseudo R-square of the reduced model, excluding degree of

nonnormality, was .162). Although the magnitude of the partial pseudo R-square was small,

degree of nonnormality has the largest pseudo R-square among all the predictors (see Table 5).

As degree of nonnormality increased, coverage probabilities decreased across all conditions,

regardless of estimation method (see Table 4 and Figures 5, 6, and 7 in Appendix D). The second

53 largest partial pseudo R-square was attributable to interval estimation method, which was .015.

The third largest partial pseudo R-square was due to the interaction effect of interval estimation

method and degree of nonnormality (.008). It indicated that the outperformance of BCa method

over the other two methods was more obvious as the data departed further from normality. Other

predictors and interaction effects didn’t seem to impact the model very much (most with partial

pseudo R-square lower than .005).

Table 5: Results from Logistic Regression and ANOVA.

Partial pseudo R-square from Partial eta squared from Main and interactions effects logistic regression analysis for ANOVA for interval width coverage probability Method .015 .06 Sample size .001 .24 Nonnormality .020 .02 Loading .001 .31 #Item .001 .15 Method * Sample size .003 .00 Method * Nonnormality .008 .03 Method * Loading .001 .00 Method * #Item .001 .00 Sample size * Nonnormality .001 .00 Sample size * Loading 0 .04 Sample size * #Item 0 .02 Nonnormality * Loading .001 .00 Nonnormality * #Item .001 .00 Loading * #Item 0 .01 Note: #Item means the number of items.

54

Interval Width

The mean of interval widths across all legitimate replications for each condition is reported in Table 6 and Table 7, along with the mean of upper bounds of CIs and the mean of lower bounds of CIs. When the data were multivariate normal, interval widths for the three CI construction methods were fairly similar across all conditions. Figures 8, 9, and 10 in Appendix

E show that, with nonnormally distributed data, the three lines indicating interval widths using the three methods were almost overlapped. It suggested that the performance of the three methods were comparable for multivariate normal data, in terms of interval width. When data were nonnormally distributed, interval widths using the Wald and likelihood methods were comparable with each other but much narrower than those using the BCa bootstrap method.

Hence, although the BCa bootstrap method performed better than the other two methods according to coverage probability, it yielded greater degree of uncertainty in estimating the

population parameter due to the wider interval width. To find out which method produced more

accurate estimate of interval width, I approximated population CIs for in different conditions.

For each condition, 20,000 datasets were generated and sample coefficient was calculated for each dataset. The obtained point estimates were then sorted by size and the 2.5th and 97.5th percentiles, respectively, were located as the estimated lower and upper limits of the population

CI in a specific condition. Results showed that when data were nonnormality distributed, interval widths estimated using the BCa bootstrap methods were closer to the population values; interval widths using the Wald and likelihood methods were negatively biased. To conclude, although the intervals estimated using the BCa bootstrap method were wider than the other two methods, they were more accurate. Figure 11 (see Appendix F) exhibits means of CI bounds (the mean of upper bounds of CIs and the mean of lower bounds of CIs) and the corresponding population

55 coefficient in different conditions. It can be observed that the population coefficient located equally well in the CIs for all the three methods (i.e., almost in the middle of the two CI bounds, not in the two extremes).

Three general patterns can be observed from Table 6 and Table 7, regardless of the estimation methods: interval widths became narrower as (1) factor loadings became higher

(graphical illustrations were provided in Figure 10 in Appendix E); (2) sample size became larger (see Figure 8 in Appendix E); and (3) the number of items became larger (see Figure 9 in

Appendix E). As evidenced from the 5-way ANOVA shown in Table 5, the largest proportion of variance in interval widths was attributable to the main effect of loading (partial eta squared of .31), followed by sample size and number of items with partial eta squared of .24 and .15, respectively. Interval construction method explained 6% variance in interval widths, controlling for other independent variables. On the other hand, the partial eta squared for degree of nonnormality was small, constituting only 2% of variation in interval widths. When the data were nonnormal, interval widths using the Wald and likelihood methods were comparable but much narrower than those using the BCa bootstrap method. The interaction effect between the CI construction method and the degree of nonnormality accounted for 3% of variation in interval widths.

Although not the focus of this study, I found that these three two-way interaction effects had non-zero partial eta squared: sample size by factor loading (4%), sample size by number of items (2%), and factor loading by number of items (1%). All other interaction effects had nearly zero partial eta squared values. Given that the interaction effect between the interval estimation method and sample size was negligible, sample size had the same effects on the three methods.

56

In other words, the BCa bootstrap method showed similar sensitivity to sample size as the other two methods.

Relative Bias of Point Estimates

As a supplemental evidence for evaluating the performance of the three CI construction

methods on coefficient , I also computed relative bias of point estimates for each condition.

Table 8 presents these results. The absolute values of relative biases were very close to zero and

none of them was greater than .03. Based on Hoogland and Boomsma’s (1998) criterion that

relative biases smaller than .05 indicate negligible bias, I could conclude that the point estimates

of coefficient resulting from the three methods were comparable and acceptable. Given that the performance of point estimation for coefficient with the three methods was similar, it makes the comparison of interval estimation between these methods more important.

57

Table 6: Interval Widths for 6 Items.

Loading Sample Wald Likelihood BCa bootstrap ( size Lower Upper Width Lower Upper Width Lower Upper Width Sk K =0 =0 100 .579 .771 .192 .568 .762 .194 .561 .757 .197 .3 .7 (.679) 300 .624 .734 .110 .620 .731 .110 .619 .730 .111 500 .635 .721 .085 .633 .719 .086 .632 .718 .086 100 .717 .847 .130 .709 .841 .132 .705 .838 .134 .4 .8 (.783) 300 .745 .820 .075 .742 .818 .075 .742 .817 .076 500 .754 .812 .058 .752 .810 .058 .751 .810 .059 100 .855 .923 .068 .850 .919 .069 .849 .918 .069 .6 .9 300 .871 .909 .039 .869 .908 .039 .869 .908 .039 (.891) 500 .875 .905 .030 .874 .904 .030 .874 .904 .030 Sk=2 K=7 100 .571 .766 .195 .559 .757 .198 .521 .779 .258 .3 .7 300 .621 .732 .111 .617 .729 .111 .595 .753 .158 (.679) 500 .635 .720 .086 .633 .718 .086 .615 .740 .125 100 .706 .841 .135 .698 .835 .137 .665 .855 .190 .4 .8 (.783) 300 .74 .817 .076 .738 .815 .077 .718 .835 .117 500 .751 .810 .059 .750 .808 .059 .734 .827 .093 100 .849 .919 .071 .844 .916 .072 .823 .928 .106 .6 .9 300 .869 .908 .039 .867 .907 .039 .855 .919 .064 (.891) 500 .874 .904 .030 .874 .904 .030 .864 .914 .051 Sk=3 K=21 100 .561 .760 .199 .549 .751 .202 .500 .776 .276 .3 .7 (.679) 300 .614 .727 .113 .610 .723 .113 .577 .762 .184 500 .629 .716 .087 .627 .714 .087 .599 .752 .153 100 .692 .833 .141 .683 .826 .143 .641 .849 .207 .4 .8 (.783) 300 .735 .813 .078 .733 .811 .078 .704 .841 .138 500 .747 .806 .060 .745 .805 .060 .721 .834 .113 100 .842 .916 .074 .837 .912 .075 .808 .926 .118 .6 .9 300 .865 .906 .040 .864 .904 .040 .846 .922 .076 (.891) 500 .872 .902 .031 .871 .902 .031 .855 .919 .064 Note: means the population in a specific condition. Sk & K denote skewness and kurtosis of factor scores. Lower, Upper, and Width means the lower bound of CIs, the upper bound of CIs, and the mean of interval widths, respectively.

58

Table 7: Interval Widths for 12 Items.

Loading Sample Wald Likelihood BCa bootstrap ( ) size Lower Upper Width Lower Upper Width Lower Upper Width Sk K =0 =0 100 .751 .862 .111 .745 .857 .112 .742 .855 .113 .3 .7 (.809) 300 .776 .840 .064 .774 .838 .064 .773 .838 .064 500 .783 .833 .049 .782 .832 .049 .782 .832 .050 100 .841 .912 .071 .837 .909 .072 .835 .908 .073 .4 .8 (.878) 300 .858 .899 .041 .857 .897 .041 .856 .897 .041 500 .862 .894 .032 .861 .893 .032 .861 .893 .032 100 .924 .958 .034 .922 .957 .035 .921 .956 .035 .6 .9 300 .932 .952 .020 .931 .951 .020 .931 .951 .020 (.942) 500 .934 .950 .015 .934 .949 .015 .934 .949 .015 Sk=2 K=7 100 .740 .856 .116 .734 .851 .117 .701 .870 .170 .3 .7 300 .773 .837 .065 .771 .836 .065 .752 .855 .103 (.809) 500 .781 .831 .050 .780 .830 .050 .765 .847 .082 100 .831 .907 .076 .827 .903 .076 .803 .918 .114 .4 .8 (.878) 300 .854 .896 .042 .854 .896 .042 .839 .909 .069 500 .861 .893 .032 .860 .892 .032 .849 .904 .055 100 .920 .956 .036 .918 .954 .036 .905 .962 .057 .6 .9 300 .931 .951 .020 .930 .950 .020 .923 .957 .035 (.942) 500 .931 .951 .020 .933 .949 .015 .928 .955 .028 Sk=3 K=21 100 .729 .850 .121 .722 .844 .122 .681 .866 .185 .3 .7 (.809) 300 .768 .834 .066 .766 .832 .066 .739 .860 .121 500 .778 .828 .051 .776 .827 .051 .754 .855 .101 100 .827 .904 .078 .822 .901 .078 .790 .917 .126 .4 .8 (.878) 300 .852 .894 .042 .850 .893 .043 .830 .913 .082 500 .858 .891 .032 .857 .890 .033 .841 .910 .069 100 .916 .954 .038 .914 .952 .038 .896 .960 .064 .6 .9 300 .929 .949 .021 .928 .949 .021 .917 .959 .041 (.942) 500 .932 .948 .016 .932 .948 .016 .923 .958 .035 Note: means the population in a specific condition. Sk & K denote skewness and kurtosis of factor scores. Lower, Upper, and Width means the lower bound of CIs, the upper bound of CIs, and the mean of interval widths, respectively.

59

Table 8: Relative Bias of Point Estimates for Coefficient .

Relative Bias Sample Sk & K Loading 6 items 12 items size Wald LL BCa Wald LL BCa 100 -.006 -.006 -.006 -.002 -.002 -.002 .3 .7 300 .000 .000 .000 -.001 -.001 -.001

500 -.001 -.001 -.001 -.001 -.001 -.001 100 -.001 -.001 -.001 -.001 -.001 -.001 .4 .8 Sk=0 300 .000 .000 .000 .000 .000 .000

K=0 500 .000 .000 .000 .000 .000 .000 100 -.002 -.002 -.002 -.001 -.001 -.001 .6 .9 300 -.001 -.001 -.001 .000 .000 .000

500 -.001 -.001 -.001 .000 .000 .000 100 -.016 -.016 -.016 -.012 -.012 -.012 .3 .7 300 -.004 -.004 -.004 -.004 -.004 -.004

500 -.001 -.001 -.001 -.002 -.002 -.002 100 -.011 -.011 -.011 -.010 -.010 -.010 .4 .8 300 -.005 -.005 -.005 -.003 -.003 -.003

Sk=2 500 -.004 -.004 -.004 -.001 -.001 -.001 K=7 100 -.007 -.007 -.007 -.004 -.004 -.004 .6 .9 300 -.002 -.002 -.002 -.001 -.001 -.001 500 -.001 -.001 -.001 -.001 -.001 -.001 100 -.028 -.028 -.027 -.025 -.025 -.025 .3 .7 300 -.013 -.013 -.013 -.010 -.010 -.010 500 -.009 -.009 -.009 -.007 -.007 -.007 100 -.024 -.024 -.024 -.015 -.015 -.015

.4 .8 300 -.010 -.010 -.010 -.006 -.006 -.006 Sk=3 K=21 500 -.008 -.008 -.008 -.005 -.005 -.005 100 -.013 -.013 -.013 -.007 -.007 -.007 .6 .9 300 -.007 -.007 -.007 -.003 -.003 -.003 500 -.004 -.004 -.004 -.002 -.002 -.002 Note: Sk & K denote skewness and kurtosis of factor scores. Wald, LL, and BCa stand for Wald method, likelihood method, and BCa bootstrap method, respectively.

60

DISCUSSION AND CONCLUSIONS

Increasing numbers of researchers have recommended coefficient as a reliability

estimate of composite scores (Green & Yang, 2009; Kelley & Cheng, 2012; McDonald, 1978;

Raykov, 1997a, 1997b, 1998a, 1998b, 2001). Given that information provided by a point

estimate of coefficient is very limited, empirical researchers have been encouraged to report

the more informative interval estimate for coefficient in their studies. However, CI estimation

for coefficient lacks development and investigation (Padilla & Divers, 2013). Several

approaches are available for constructing CIs on coefficient , such as the Wald method,

likelihood method, and various types of bootstrap methods. All of these approaches can be

implemented within SEM. But the comparison of the performance of these approaches has not

been well addressed in the literature until now.

The purpose of this study is to compare and evaluate three approaches (i.e., the Wald,

likelihood, and BCa bootstrap) to CI construction on coefficient in the SEM framework

through a simulation study. Four design factors were manipulated in the simulation study: (1)

sample size, (2) the number of items, (3) degree of nonnormality, and (4) factor loading. Data

were generated for 54 conditions, which were created by fully crossing the four factors. 2000

datasets were generated under each condition. For each dataset, three types of CIs for

coefficient were constructed by using the three methods. The performance of the three methods w as evaluated in terms of relative bias of point estimates, coverage probability, and interval width. Logistic regression and 5-way ANOVA were conducted to examine the related main and interaction effects. In the following sections, I summarize major findings from the simulation study based on my expectations of the results laid out in Chapter 3, and then discuss

61 the major findings. Finally I list limitations of this study as well as a few directions for future research.

Major Findings from Simulation Study

Several important findings can be concluded from the results of the current simulation

study. Major findings are presented below.

First, I expected that when the data were normally distributed, the three approaches to CI construction of coefficient would perform equally well across all conditions. Results from the simulation study support my expectation. As Table 4, Table 6, and Table 7 illustrated, coverage probabilities were acceptable and comparable, and interval widths were also fairly similar across the three methods for normally distributed data (see also Appendix D and Appendix E).

Second, I expected that when the data were nonormally distributed, the BCa bootstrap

method would outperform the other two methods because BCa bootstrap is a non-parametric

resampling technique, while both Wald and likelihood methods are parametric techniques and

require multivariate normality of variables. This expectation is supported by the results. For

Wald and likelihood methods, all coverage probabilities were far below the prespecified .95 coverage level. For the BCa bootstrap method, the coverage probabilities were higher than the

Wald and likelihood methods across all conditions; however, they were also below the prespecified .95 coverage level. These results indicated that the BCa bootstrap method is a better choice than the other two, but is not robust to departures from data normality. Although CIs constructed with the BCa bootstrap method yielded greater degrees of uncertainty in estimating the population parameter than the Wald and likelihood methods, as evidenced by the wider intervals, the BCa bootstrap method yielded more accurate interval width than the other two methods.

62

Third, as I expected, the performance of the three methods became worse as the degree of nonnormality increased. It can be observed from Table 4, Table 6, and Table 7 (see also

Appendix D and Appendix E) that as degree of nonnormality increased, coverage probabilities became lower and interval widths became wider. There were also small interaction effects between CI construction methods and degrees of nonnormality (the partial eta squared was .03 for interval widths and the partial pseudo R-square was .008 for coverage probabilities). It revealed that the degree of nonnormality has less seriously detrimental impact on CI construction for the BCa bootstrap method than for the Wald and likelihood methods.

Fourth, I expected that as sample size increased, the performance of these three interval estimation methods would become better. My research expectation is partially supported by the results. As Figure 8 (see Appendix E) displays, when sample size increased, interval widths became narrower. Results from ANOVA in Table 5 show that sample size explained 24% of variance in interval widths. Sample size didn’t affect the performance of the three methods in terms of coverage probability. The partial pseudo R-square for sample size in the logistic regression was only about .001.

Last, for the effect of factor loading and number of items on the performance of the three interval estimation methods, even though I didn’t have any expectation, I was interested in exploring the effects of these two factors. As evidenced from the results of ANOVA in Table 5, loading explained 31% variance in interval widths, which constituted the largest percentage of variance explained among all main and interaction effects. It can be noticed from Table 6 and

Table 7 that as loadings become larger, interval widths became narrower. This pattern was also observable from Figure 10 in Appendix E. However, loading did not seem to impact coverage probabilities (the partial pseudo R-square was only .001). From the logistic regression, coverage

63 probabilities for the three methods were insensitive to the number of items, with partial pseudo

R-square of .001. However, results from ANOVA show that number of items explained a medium proportion of variability in interval widths (partial eta squared of .15). In Tables 6 and 7, as the number of items increased, interval widths decreased across all of the conditions (see

Figure 9 in Appendix E for graphical illustrations). These suggest that number of items has

considerable effect on interval width.

Discussion and Suggestions

Findings from the current study are now compared with those of previous studies.

Cheung (2009b) illustrated, with empirical examples, that the Wald CI and likelihood-based CI

are similar for reliability estimate. This is consistent with the results of the current study that the

Wald and likelihood methods were comparable, with odds ratio of .97 (not reported in table) in

logistic regression analysis for coverage probability. Cheung (2009b) showed in the same study

that the likelihood-based CI outperformed the Wald CI for the Pearson correlation in small

samples. However, the current study found that these two methods performed very similarly.

One possible reason for the inconsistency in findings between previous studies and my study is

that coefficient is a complicated function of model parameters and the profile likelihood

function is only an approximation of the real likelihood function. As Cheung discussed:

Although it is suggested in this article that likelihood-based CIs are better alternatives to

Wald CIs, readers should be cautioned the likelihood-based CIs were constructed via the

numerical approximation of the profile likelihood functions… so it is possible that the

numerical approximations might fail to find the likelihood-based CIs, especially when the

parameters are complicated functions with many nonlinear constraints. (Cheung, 2009b,

p. 286)

64

Kelley and Cheng (2012) compared the CIs for coefficient coming from the Wald and

BCa bootstrap methods, based on a real dataset. The results displayed that the estimated CIs with the two methods were fairly similar. They speculated that it is because “sample size may be sufficiently large and/or multivariate normality may hold approximately” (p. 46). The result is consistent with the current simulation study: when data are multivariate normal, CIs estimated with these two methods are comparable, in terms of both coverage probability and interval width.

Findings of the simulation study suggest that: (1) when the assumption of multivariate normality holds for the data, the three methods are comparable for CI estimation on coefficient

; (2) when data are nonnormally distributed, the BCa bootstrap is better than the Wald and likelihood methods, although the uncertainty of point estimates (as evidenced by the wider interval width) is greater. The selection of the appropriate method for CI construction should be based on several considerations, such as time cost and availability of statistical packages. The

BCa bootstrap method is relatively time-consuming to compute compared with the other two methods due to computer-intensive resampling, and it also has potential inconsistency among bootstrapping replications due to random resampling variability (Preacher & Selig, 2012). The

Wald method is available in most statistical programs, while the likelihood and BCa bootstrap techniques are less accessible.

Limitations and Future Research

In the current simulation study, three approaches to confidence interval construction were compared based on their empirical performances in terms of coverage probability and interval width. However, given the generalization issue that most simulation studies encounter, conclusions based on the results of current study may not be valid in other applications, without examination of the conditions in that specific situation. Several limitations and concerns

65 regarding the simulation design were found. These limitations and concerns will suggest further research directions for me. Below is a brief summary of the limitations and concerns for the simulation study.

First, the data generated in this study are continuous. Considering the common use of categorical data in empirical studies, it is worth assessing the three interval estimation methods with categorical data in future research.

Second, the design factors and levels of these factors considered in the current study are limited. Other factors influencing the accuracy of interval estimation may exist and may be worth investigating. One potential factor is the estimation method. ML estimation methods were used throughout the current study. It is widely known that ML estimation is “a normal theory method” because it “assumes that the population distribution for the endogenous variables is multivariate normal” (Kline, 2010, p. 112). For nonnormally distributed data, alternative estimation methods should be considered, such as robust ML, weighted least squares, etc. For example, Maydeu-Olivares, Coffman and Hartmann (2007) conducted a Monte Carlo study to investigate the NT (normal theory) versus ADF ( free) confidence intervals for coefficient alpha.

It may also be meaningful to include more levels for each factor in the current study. For example, the number of sample sizes examined can be increased. In the current study, this factor had values of 100, 300, and 500. Future study may consider adding a larger sample size, such as

1000, particular for the ADF estimation method given that ADF requires large sample sizes.

Last, the three interval estimation methods are evaluated for constructing CIs for coefficient , which is a reliability coefficient for the congenric model. However, the performances of these approaches to interval construction on reliability estimates for more

66 general or more complicated models are unknown. Future studies may be needed to address this research question.

67

APPENDIX A

R-CODES FOR DATA GENERATION

1. Generate normally distributed data in R. nd<-2000 #Number of data set list_data<-rep(0,nd) for(dd in 1:nd){ nvar<-6 #Number of variables ss<-100 #sample size f<-rnorm(ss,0,1) #random factor scores f1<-c(0.3,0.3,0.3,0.7,0.7,0.7); #factor loadings x<-matrix(0,ncol=nvar,nrow=ss) for(i in c(1,2,3,4,5,6)) { el<-rnorm(ss,0,1) for(j in 1:ss) { x[j,i]<-f1[i]*f[j]+sqrt(1-f1[i]^2)*el[j]} } name1<-paste("normal_data", dd, ".csv", sep="") name1_list<-paste("normal_data", dd, ".csv", sep="") write.table(x, name1, sep=",", quote=FALSE, col.names=FALSE, row.names=FALSE) list_data[dd]<-name1_list } 2. Generate nonnormal data in R using Fleishman’s method. nd<-2000 #Number of data set list_data<-rep(0,nd) h=10000+nd cond<-1 name_file<-paste("cond=",cond , sep="") dir.create(name_file) for(dd in 10001:h) { nvar<-6 #Number of variables ss<-100 #sample size f1<-c(0.3,0.3,0.3,0.7,0.7,0.7); y<-matrix(0,ncol=nvar,nrow=ss) f<-rnorm(ss,0,1) ft <- f for(j in 1:ss) { 68

------#Fleishman’s power transformation formula # ft[j] <- a+b*f[j]+c*f[j]^2+d*f[j]^3 ------#SK=2, K=7 b<-0.761585275 c<-0.260022598 d<-0.053072274 a<--c #resulting nonnormally distributed factor scores with SK=2 and K=7 #SK=3, K=21 #b<--0.681632225 #c<-0.637118193 #d<-0.148741042 #a<--c #resulting nonnormally distributed factor scores with SK=3 and K=21 ------ft[j] <- a+b*f[j]+c*f[j]^2+d*f[j]^3 for(i in c(1,2,3,4,5,6)) { el<-rnorm(1,0,1) y[j,i]<-f1[i]*ft[j]+sqrt(1-f1[i]^2)*el # resulting nonnormally distributed observed scores } } name2<-paste(name_file,"/data", dd, ".csv", sep="") name2_list<-paste("data", dd, ".csv", sep="") write.table(y, name2, sep=",", quote=FALSE, col.names=FALSE, row.names=FALSE) list_data[dd]<-name2_list }

69

APPENDIX B

CODES FOR CONFIDENCE INTERVAL ESTIMATION IN R

ci.reliability(data = data1.csv, type = "omega", analysis.type = "default", interval.type = "wald", conf.level = 0.95) #construct a 95% Wald CI for coefficient omega ci.reliability(data = data1.csv, type = "omega", analysis.type = "default", interval.type = "ll", conf.level = 0.95) #construct 95% likelihood_based CI for coefficient omega ci.reliability(data = data1.csv, type = "omega", analysis.type = "default", interval.type = "bca", B = 1000, conf.level = 0.95) #construct a 95% BCa bootstrap CI for coefficient omega

70

APPENDIX C

NONNORMAL DISTRIBUTIONS OF OBSERVED SCORES

Distribution of Observed Scores

10 Summary Statistics

N 1000000 Skew ness 0.054 Kurtosis 0.059 8

6

Percent

4

2

0 -4.55 -4.05 -3.55 -3.05 -2.55 -2.05 -1.55 -1.05 -0.55 -0.05 0.45 0.95 1.45 1.95 2.45 2.95 3.45 3.95 4.45 4.95 5.45 5.95 Item1 Distribution of Observed Scores

12 Summary Statistics

N 1000000 Skew ness 0.128 10 Kurtosis 0.174

8

6

Percent

4

2

0 -4.8 -4.3 -3.8 -3.3 -2.8 -2.3 -1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1.7 2.2 2.7 3.2 3.7 4.2 4.7 5.2 5.7 6.2 6.7 7.2 7.7 8.2 Item2 Figure 3. Distributions of Observed Scores with Sk=2 and K=7 for Factor Scores Note: Factor loadings for item 1 to 6 are .3, .4, .6, .7, .8, .9, respectively. Sk and K mean skewness and kurtosis, respectively.

71

Distribution of Observed Scores

12 Summary Statistics

N 1000000 Skew ness 0.430 10 Kurtosis 0.910

8

6

Percent

4

2

0 -4.3 -3.55 -2.8 -2.05 -1.3 -0.55 0.2 0.95 1.7 2.45 3.2 3.95 4.7 5.45 6.2 6.95 7.7 8.45 9.2 9.95 10.7 Item3

Distribution of Observed Scores

12 Summary Statistics

N 1000000 Skew ness 0.684 10 Kurtosis 1.692

8

6

Percent

4

2

0 -4.05 -3.3 -2.55 -1.8 -1.05 -0.3 0.45 1.2 1.95 2.7 3.45 4.2 4.95 5.7 6.45 7.2 7.95 8.7 9.45 10.2 10.95 Item4

Figure 3 - continued

72

Distribution of Observed Scores

12 Summary Statistics

N 1000000 Skew ness 1.024 10 Kurtosis 2.877

8

6

Percent

4

2

0 -3.55 -2.8 -2.05 -1.3 -0.55 0.2 0.95 1.7 2.45 3.2 3.95 4.7 5.45 6.2 6.95 7.7 8.45 9.2 9.95 10.7 11.45 12.2 12.95 Item5

Distribution of Observed Scores

14 Summary Statistics

N 1000000 12 Skew ness 1.452 Kurtosis 4.602

10

8

Percent 6

4

2

0 -3.3 -2.55 -1.8 -1.05 -0.3 0.45 1.2 1.95 2.7 3.45 4.2 4.95 5.7 6.45 7.2 7.95 8.7 9.45 10.2 10.95 11.7 12.45 13.2 13.95 Item6

Figure 3 - continued

73

Distribution of Observed Scores

10 Summary Statistics

N 1000000 Skew ness 0.082 Kurtosis 0.166 8

6

Percent

4

2

0 -4.8 -4.3 -3.8 -3.3 -2.8 -2.3 -1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1.7 2.2 2.7 3.2 3.7 4.2 4.7 5.2 5.7 6.2 6.7 7.2 7.7 8.2 Item1

Distribution of Observed Scores

12 Summary Statistics

N 1000000 Skew ness 0.185 10 Kurtosis 0.501

8

6

Percent

4

2

0 -4.8 -4.05 -3.3 -2.55 -1.8 -1.05 -0.3 0.45 1.2 1.95 2.7 3.45 4.2 4.95 5.7 6.45 7.2 7.95 8.7 9.45 10.2 Item2

Figure 4. Distributions of Observed Scores with Sk=3 and K =21 for Factor Scores Note: Factor loadings for item 1 to 6 are .3, .4, .6, .7, .8, .9, respectively. Sk and K mean skewness and kurtosis, respectively.

74

Distribution of Observed Scores

12 Summary Statistics

N 1000000 Skew ness 0.632 10 Kurtosis 2.538

8

6

Percent

4

2

0 -3.8 -3.05 -2.3 -1.55 -0.8 -0.05 0.7 1.45 2.2 2.95 3.7 4.45 5.2 5.95 6.7 7.45 8.2 8.95 9.7 10.45 11.2 11.95 12.7 13.45 14.2 14.95 Item3

Distribution of Observed Scores

12 Summary Statistics

N 1000000 Skew ness 1.004 10 Kurtosis 4.662

8

6

Percent

4

2

0 -4.05 -3.05 -2.05 -1.05 -0.05 0.95 1.95 2.95 3.95 4.95 5.95 6.95 7.95 8.95 9.95 10.95 11.95 12.95 13.95 14.95 15.95 Item4

Figure 4 - continued

75

Distribution of Observed Scores

14 Summary Statistics

N 1000000 12 Skew ness 1.508 Kurtosis 8.024

10

8

Percent 6

4

2

0 -3.55 -2.55 -1.55 -0.55 0.45 1.45 2.45 3.45 4.45 5.45 6.45 7.45 8.45 9.45 10.45 11.45 12.45 13.45 14.45 15.45 16.45 17.45 Item5

Distribution of Observed Scores

15.0 Summary Statistics

N 1000000 Skew ness 2.154 12.5 Kurtosis 13.03

10.0

7.5

Percent

5.0

2.5

0 -2.55 -1.55 -0.55 0.45 1.45 2.45 3.45 4.45 5.45 6.45 7.45 8.45 9.45 10.4511.4512.4513.4514.4515.4516.4517.4518.4519.4520.4521.45 Item6

Figure 4 - continued

76

APPENDIX D

EFFECTS OF DESIGN FACTORS ON COVERAGE PROBABILITY

Sk=0 K=0 Item=6 Loading=.3/.7 Sk=0 K=0 Item=12 Loading=.3/.7 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=0 K=0 Item=6 Loading=.4/.8 Sk=0 K=0 Item=12 Loading=.4/.8 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=0 K=0 Item=6 Loading=.6/.9 Sk=0 K=0 Item=12 Loading=.6/.9 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Figure 5. Coverage Probabilities for Normally Distributed Data

77

Sk=2 K=7 Item=6 Loading=.3/.7 Sk=2 K=7 Item=12 Loading=.3/.7 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=2 K=7 Item=6 Loading=.4/.8 Sk=2 K=7 Item=12 Loading=.4/.8 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=2 K=7 Item=6 Loading=.6/.9 Sk=2 K=7 Item=12 Loading=.6/.9 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Figure 6. Coverage Probabilities for Moderately Nonnormally Distributed data

78

Sk=3 K=21 Item=6 Loading=.3/.7 Sk=3 K=21 Item=12 Loading=.3/.7 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60 probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100300500 100300500 Sample size Sample size Sk=3 K=21 Item=6 Loading=.4/.8 Sk=3 K=21 Item=12 Loading=.4/.8 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage 55 probability(%) Coverage 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Sk=3 K=21 Item=6 Loading=.6/.9 Sk=3 K=21 Item=12 Loading=.6/.9 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60

Coverage probability(%) Coverage probability(%) Coverage 55 55 50 BCa Likelihood Wald high_ref low_ref ref 50 BCa Likelihood Wald high_ref low_ref ref 100 300 500 100 300 500 Sample size Sample size Figure 7. Coverage Probabilities for Seriously Nonnormally Distributed Data

79

APPENDIX E

EFFECTS OF DESIGN FACTORS ON INTERVAL WIDTH

Sk=0 K=0 Item=6 Loading=.3/.7 Sk=0 K=0 Item=12 Loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size Sk=0 K=0 Item=6 Loading=.4/.8 Sk=0 K=0 Item=12 Loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size

Sk=0 K=0 Item=6 Loading=.6/.9 Sk=0 K=0 Item=12 Loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size Figure 8. Interval Widths on Different Levels of Sample Size

80

Sk=2 K=7 Item=6 Loading=.3/.7 Sk=2 K=7 Item=12 Loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100300500 100300500 Sample size Sample size

Sk=2 K=7 Item=6 Loading=.4/.8 Sk=2 K=7 Item=12 Loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size

Sk=2 K=7 Item=6 Loading=.6/.9 Sk=2 K=7 Item=12 Loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size

Figure 8 - continued

81

Sk=3 K=21 Item=6 Loading=.3/.7 Sk=3 K=21 Item=12 Loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 0.1 width Interval 0.07 0.07 0.04 0.04 0.01 0.01 100300500 100300500 Sample size Sample size Sk=3 K=21 Item=6 Loading=.4/.8 Sk=3 K=21 Item=12 Loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size

Sk=3 K=21 Item=6 Loading=.6/.9 Sk=3 K=21 Item=12 Loading=.6/.9 0.28 BCa Likelihood Wald 0.28 BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 100 300 500 100 300 500 Sample size Sample size

Figure 8 - continued

82

Sk=0 K=0 n=100 loading=.3/.7 Sk=0 K=0 n=300 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item

Sk=0 K=0 n=500 loading=.3/.7 Sk=0 K=0 n=100 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=0 K=0 n=300 loading=.4/.8 Sk=0 K=0 n=500 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9. Interval Widths on Different Levels of Item Count

83

Sk=0 K=0 n=100 loading=.6/.9 Sk=0 K=0 n=300 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=0 K=0 n=500 loading=.6/.9 Sk=2 K=7 n=100 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=2 K=7 n=300 loading=.3/.7 Sk=2 K=7 n=500 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9 - continued

84

Sk=2 K=7 n=100 loading=.4/.8 Sk=2 K=7 n=300 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=2 K=7 n=500 loading=.4/.8 Sk=2 K=7 n=100 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=2 K=7 n=300 loading=.6/.9 Sk=2 K=7 n=500 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9 - continued

85

Sk=3 K=21 n=100 loading=.3/.7 Sk=3 K=21 n=300 loading=.3/.7 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=3 K=21 n=500 loading=.3/.7 Sk=3 K=21 n=100 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=3 K=21 n=300 loading=.4/.8 Sk=3 K=21 n=500 loading=.4/.8 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval Interval width Interval 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Figure 9 - continued

86

Sk=3 K=21 n=100 loading=.6/.9 Sk=3 K=21 n=300 loading=.6/.9 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Interval width Interval 0.1 width Interval 0.1 0.07 0.07 0.04 0.04 0.01 0.01 6 12 6 12 Number of item Number of item Sk=3 K=21 n=500 loading=.6/.9 0.28 BCa Likelihood Wald 0.25 0.22 0.19 0.16 0.13

Interval width Interval 0.1 0.07 0.04 0.01 6 12 Number of item Figure 9 - continued

87

Sk=0 K=0 Item=6 n=100 Sk=0 K=0 Item=12 n=100 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 123 1 2 3 Loading Loading

Sk=0 K=0 Item=6 n=300 Sk=0 K=0 Item=12 n=300 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading

Sk=0 K=0 Item=6 n=500 Sk=0 K=0 Item=12 n=500 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading

Figure 10. Interval Widths on Different Levels of Factor Loading Note: The level of loading “1” indicates the set of loadings .3/.7; the level of loading “2” indicates the set of loadings .4/.8; and the level of loading “3” indicates the set of loadings .6/.9.

88

Sk=2 K=7 Item=6 n=100 Sk=2 K=7 Item=12 n=100 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 123 123 Loading Loading Sk=2 K=7 Item=6 n=300 Sk=2 K=7 Item=12 n=300 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading

Sk=2 K=7 Item=6 n=500 Sk=2 K=7 Item=12 n=500 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth 0.1 Intervalwidth 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading

Figure 10 - continued

89

Sk=3 K=21 Item=6 n=100 Sk=3 K=21 Item=12 n=100 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 123 123 Loading Loading Sk=3 K=21 Item=6 n=300 Sk=3 K=21 Item=12 n=300 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading

Sk=3 K=21 Item=6 n=500 Sk=3 K=21 Item=12 n=500 0.28 0.28 BCa Likelihood Wald BCa Likelihood Wald 0.25 0.25 0.22 0.22 0.19 0.19 0.16 0.16 0.13 0.13

Intervalwidth Intervalwidth 0.1 0.1 0.07 0.07 0.04 0.04 0.01 0.01 1 2 3 1 2 3 Loading Loading

Figure 10 - continued

90

APPENDIX F

MEANS OF CI BOUNDS AND POPULATION COEFFICIENT OMEGA

Sk=0 K=0 loading=.3/.7 n=100 Sk=0 K=0 loading=.3/.7 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=0 K=0 loading=.3/.7 n=500 Sk=0 K=0 loading=.4/.8 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=0 K=0 loading=.4/.8 n=300 Sk=0 K=0 loading=.4/.8 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11. Means of CI Bounds and Population Coefficient Omega for 6 Items

91

Sk=0 K=0 loading=.6/.9 n=100 Sk=0 K=0 loading=.6/.9 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCaLikelihoodWald Method Method Sk=0 K=0 loading=.6/.9 n=500 Sk=2 K=7 loading=.3/.7 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.3/.7 n=300 Sk=2 K=7 loading=.3/.7 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11 - continued

92

Sk=2 K=7 loading=.4/.8 n=100 Sk=2 K=7 loading=.4/.8 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=2 K=7 loading=.4/.8 n=500 Sk=2 K=7 loading=.6/.9 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.6/.9 n=300 Sk=2 K=7 loading=.6/.9 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11 - continued

93

Sk=3 K=21 loading=.3/.7 n=100 Sk=3 K=21 loading=.3/.7 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=3 K=21 loading=.3/.7 n=500 Sk=3 K=21 loading=.4/.8 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=3 K=21 loading=.4/.8 n=300 Sk=3 K=21 loading=.4/.8 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 11 - continued

94

Sk=3 K=21 loading=.6/.9 n=100 Sk=3 K=21 loading=.6/.9 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=3 K=21 loading=.6/.9 n=500

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60

Confidencelimit 0.55 0.50 0.45

BCa Likelihood Wald Method Figure 11 - continued

95

Sk=0 K=0 loading=.3/.7 n=100 Sk=0 K=0 loading=.3/.7 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=0 K=0 loading=.3/.7 n=500 Sk=0 K=0 loading=.4/.8 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=0 K=0 loading=.4/.8 n=300 Sk=0 K=0 loading=.4/.8 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12. Means of CI Bounds and Population Coefficient Omega for 12 Items

96

Sk=0 K=0 loading=.6/.9 n=100 Sk=0 K=0 loading=.6/.9 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=0 K=0 loading=.6/.9 n=500 Sk=2 K=7 loading=.3/.7 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.3/.7 n=300 Sk=2 K=7 loading=.3/.7 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12 - continued

97

Sk=2 K=7 loading=.4/.8 n=100 Sk=2 K=7 loading=.4/.8 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=2 K=7 loading=.4/.8 n=500 Sk=2 K=7 loading=.6/.9 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=2 K=7 loading=.6/.9 n=300 Sk=2 K=7 loading=.6/.9 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12 - continued

98

Sk=3 K=21 loading=.3/.7 n=100 Sk=3 K=21 loading=.3/.7 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCaLikelihoodWald Method Method Sk=3 K=21 loading=.3/.7 n=500 Sk=3 K=21 loading=.4/.8 n=100

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Sk=3 K=21 loading=.4/.8 n=300 Sk=3 K=21 loading=.4/.8 n=500

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCa Likelihood Wald BCa Likelihood Wald Method Method Figure 12 - continued

99

Sk=3 K=21 loading=.6/.9 n=100 Sk=3 K=21 loading=.6/.9 n=300

1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0.60 0.60

Confidencelimit

Confidencelimit 0.55 0.55 0.50 0.50 0.45 0.45

BCaLikelihoodWald BCa Likelihood Wald Method Method Sk=3 K=21 loading=.6/.9 n=500

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60

Confidencelimit 0.55 0.50 0.45

BCa Likelihood Wald Method Figure 12 - continued

100

REFERENCES

Alhija, F. N., & Wisenbaker, J. (2006). A Monte Carlo study investigating the impact of item parceling strategies on parameter estimates and their standard errors in CFA. Structural Equation Modeling, 13, 204-228.

Azzalini, A. (1996). Statistical inference: Based on the likelihood. London: Chapman & Hall.

Bandalos, D. L., & Leite, W. (2013). Use of Monte Carlo studies in structural equation modeling research. In Hancock, G.R., & Mueller, R.O. (eds.). Structural equation modeling: A second course (2nd ed., p. 625-666). Information Age Publishing.

Becker, G. (2000). How important is transient error in estimating reliability? Going beyond simulation studies. Psycho- logical Methods, 5, 370–379.

Bernstein, I. H., & Teng, G. (1989). Factoring items and factoring scales are different: Spurious evidence for multidimensionality due to item categorization. Psychological Bulletin, 105, 467-477.

Boker, S., Neale, M., Maes, H., Wilde, M., Spiegel, M., Brick, T. R., et al. (2011). OpenMx: An open source extended structural equation modeling framework. Psychometrika, 76, 306- 317.

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144-152.

Brownell, W. A. (1933). On the accuracy with which reliability may be measured by correlating test halves. Journal of Experimental Education, 1, 204-215.

Buse, A. (1982). The likelihood ratio, wald, and lagrange multiplier tests: An expository note. The American Statistician, 36(3), 153-157.

Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Pacific Grove, CA: Duxbury/Thomson Learning.

Cheung, M. W.-L. (2009a). Comparison of methods for constructing confidence intervals of standardized indirect effects. Behavior Research Methods, 41(2), 425-438.

Cheung, M. W.-L. (2009b). Constructing approximate confidence intervals for parameters with structural equation models. Structural Equation Modeling: A Multidisciplinary Journal,16(2), 267-294.

Cheung, M. W. L. (2007). Comparison of approaches to constructing confidence intervals for mediating effects Using structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 14(2), 227-246.

101

Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105–11.

Cohen, J. (1994). The world is round (p < .05). American Psychologist, 49, 997–1003.

Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis. Hillsdale, NJ: Erbaum.

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98-104.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Toronto: Holt, RineHart, and Winston, Inc.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.

Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391-418.

DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3), 189-212.

DiCiccio, T. J., & Romano, J. P. (1988). A review of bootstrap confidence intervals (with discussion). J. R. Statist. Soc., B, 50, 338-370.

Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82, 171-185.

Enders, C. K. (2010). Applied missing data analysis. Guiford Press.

Fan, X., & Fan, X. (2005). Using SAS for Monte Carlo simulation research in SEM. Structural Equation Modeling, 12(2), 299–333.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed.), Washington, DC: American Council on Education.

Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521-532.

Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are and how to use them. Educational and Psychological Measurement, 66(6), 930-944.

Green, S.B. (2003). A coefficient alpha for test-retest data. Psychological Methods, 8, 88–101.

102

Green, S.B., Akey, T.M., Fleming, K.K., Hershberger, S.L., & Marquis, J.G. (1997). Effect of the number of scale points on chi-square fit indices in confirmatory factor analysis. Structural Equation Modeling, 4, 108–120.

Green, S.B., & Hershberger, S.L. (2000). Correlated errors in true score models and their effect on coefficient alpha. Structural Equation Modeling, 7, 251–270.

Green, S. B., & Yang ,Y. (2009). Commentary on coeficient alpha: A Cautionary Tale. Psychometrika, 74, 121–135.

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255-282.

Hart, W. L. (1955). Calculus. D. C. Heath and Campany: Boston.

Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance structure modeling: An overview and a meta-analysis. Sociological Methods & Research, 26, 329-367.

J reskog, K. G. (1971) Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109- 133. ̈ Kelley, K. (2007). Confidence Intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software, 20(8), 1-23.

Kelley, K., & Cheng, Y. (2012). Estimation of and confidence interval formation for reliability coefficients of homogeneous measurement instruments. Methodology, 8(2), 39–50.

Killeen, P. R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16(5), 345-353.

Kline, R. B. (2010). Principles and practice of structural equation modeling (3nd Ed). The Guilford Press.

Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models (5th edition). New York: the McGraw_Hill Companies, Inc.

Lance, C. E. et al. (2006). The Sources of four commonl Reported cutoff criteria what did they really say? Organizational Research Methods, 9, 202-220.

Lord, F. M., & Novick. M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Wesley.

Maydeu-Olivares, A., Coffman, D. L., & Hartmann, W. M. (2007). Asymptotically distribution- free (ADF) interval estimation of coefficient alpha. Psychological Methods, 12(2), 157–176.

McDonald, R. P. (1978). Generalizability in factorable domains: Doman validity and generalizabilty. Educational and Psychological Measurement, 38, 75-79.

103

McDonald, R. P. (1999). Test theory: Unified treatment. Lawrence Erlbaum Associates.

Miceeri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156-166.

Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspectives of classical test theory and structural equation modeling. Structural Equation Modeling, 2(3), 255-273.

Miller, R. G. (1974). The jackknife – A review. Biometrika, 61, 1-15.

Neale, M. C., & Miller, M. B. (1997). The use of likelihood-based confidence intervals in genetic models. Behavior Genetics. 27(2), 113-120.

Nevitt, J., & Hancock, G. R. (2001). Performance of bootstrap approaches to model test statistics parameter standard error estimation in structure equation modeling. Structure Equation Modeling, 8(3), 353-377.

Neyman J. (1935). On the problem of confidence intervals. The Annals of Mathematical Statistics, 6, 111–116.

Neyman J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transaction of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236, 333–380.

Neyman, J., & Pearson, E. S. (1933). The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society, 29, 492–510.

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psycho- logical Methods, 5, 241–301.

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.

Oehlert, G. W. (1992). A note on the delta method. The American Statistician, 46(1), 27-29.

Ogasawara, H. (1999). Standard errors for matrix correlations. Multivariate Behavioral Research, 34(1), 103-122.

Olkin, I., & Finn, J. D. (1995). Correlation redux. Psychological Bulletin, 118, 155–164.

Padilla, M. A., & Divers, J. (2013). Bootstrap interval estimation of reliability via coefficient omega. Journal of Modern Applied Statistical Methods, 12(1), 78-89.

Patterson, D. (2014). Profile likelihood confidence intervals for GLM’s. Retrieved from http://www.math.umt.edu/patterson/ProfileLikelihoodCI.pdf on February 12, 2014.

104

Preacher, K. J., & Selig, J. P. (2012). Advantages of Monte Carlo confidence intervals for indirect effects. Communication Methods and Measures, 6, 77-98.

R Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R- project.org/.

Raykov, T. (1997a). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21, 173-184.

Raykov, T. (1997b). Scale reliability, Cronbach’s coefficient alpha, and violations of essential tau equivalence with fixed congeneric components. Multivariate Behavioral Research, 32, 329-353.

Raykov, T. (1998a). Coefficient alpha and composite relaibility with interrelated nonhomogeneous components. Applied Psychological Measurement, 22, 375-385.

Raykov, T. (1998b). A method for obtaining standard errors and confidence intervals of composite reliability for congeneric items. Applied Psychological Measurement, 22, 369- 374.

Raykov, T. (2000). A method for examining stability in reliability. Multivariate Behavioral Research, 35(3), 289-305.

Raykov, T. (2001). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear constraints. British Journal of Mathematical and Statistical Psychology, 54, 315-323.

Raykov, T. (2002). Analytic estimation of standard error and confidence interval for scale reliability. Multivariate Behavioral Research, 37(1), 89-103.

Raykov, T. (2004). Point and interval estimation of reliability for multiple- component measuring instruments via linear constraint covariance structure modeling. Structural Equation Modeling: A Multidisciplinary Journal, 11(3), 342-356.

Raykov, T. (2009). Evaluation of scale reliability for unidimensional measures using latent variable modeling. Measurement and Evaluation in Counseling and Development, 42(3), 223-232.

Raykov, T., & Marcoulides, G. A. (2004). Using the Delta method for approximate interval estimation of parameter functions in SEM. Structural Equation Modeling: A Multidisciplinary Journal, 11(4), 621-637.

Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: point and interval estimation using a structural equation modeling approach. Structure Equation Modeling, 9(2), 195-212.

105

Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the GLB: Comments on Sijtsma. Psychometrika, 74(1), 145–154.

Rozeboom, W.W. (1966). Foundations of the theory of prediction. Homewood: Dorsey.

Rulon, P. J. (1939). A simplified procedure for determining the reliability of a test by split- halves. Harvard Education Review, 9, 99-103.

SAS Institute Inc. (2011). Base SAS® 9.3 procedures guide. Cary, NC: SAS Institute Inc.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 94. doi:10.1007/s11336-008-9101-0.

Šimundić, A.-M. (2008). Confidence interval. Biochemia Medica, 18(2), 154-161.

Steiger, J.H., & Fouladi, R.T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.). What if there were no significance tests? Mahwah, NJ: Lawrence Erlbaum Associates.

Steinberg, L. (2001). The consequences of pairing questions: Context effects in personality measurement. Journal of Personality and Social Psychology, 81, 332–342.

Steinberg, L., & Thissen, D. (1996). Uses of item response theory and the testlet concept in the measurement of psychopathology. Psychological Methods, 1, 81–97.

Tan, S. J. (1990). Applied calculus. Boston: Kent.

Thompson, B. L., Green, S. B., & Yang, Y. (2010). Assessment of the maximal split-half coefficient to estimate reliability. Educational and Psychological Measurement, 70, 232- 251.

Venzon, D. J., & Moolgavkar, S. H. (1988). A method for computing profile-likelihood-based confidence intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics), 37(1), 87-94.

Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 4, 212–213.

Wainer, H., & Kiely, G.L. (1987). Item clusters and computerized adaptive testing: A case of testlets. Journal of EducationalMeasurement, 24, 185–201.

106

Wilkinson, L., & Inference, T. F. o. S. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Yang, Y., & Green, S. B. (2010). A note on structural equation modeling estimates of reliability. Structural Equation Modeling: A Multidisciplinary Journal, 17(1), 66-81.

Yang, Y., & Green, S. B. (2011). Coefficient alpha: A reliability coefficient for the 21st century?. Journal of Psychoeducational Assessment, 29(4), 377-392.

Ye, Bao-J., & Wen, Zhong-L. (2011). A comparison of three confidence intervals of composite reliability of a unidimensional test. Acta Psychologica Sinita, 43(04), 453-461.

Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of EducationalMeasurement, 30, 187–214.

Yung, Y. F., & Bentler, P. M. (1996). Bootstrapping techniques in analysis of mean and covariance structures. In G. A. Marcoulides & R. E. Schumacker (Eds.), Advanced structural equation modeling: Issues and techniques. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Zimmerman, D. W. (1975). Probability spaces, hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395–412.

Zimmerman, D.W., Zumbo, R.D., & Lalonde, C. (1993). Coefficient alpha as an estimate of test reliability under violation of two assumptions. Educational and Psychological Measurement, 53, 33–49.

Zinbarg, R.E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s , Revelle’s , and McDonald’s H : Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70(1), 123–133.

107

BIOGRAPHICAL SKETCH

Jie Xu earned her Master’s degree in the School of Chinese as a Second Language from

Sun Yat-sen University in China in 2012. In fall 2012, she joined the Measurement and Statistics master’s program in the Department of Educational Psychology and Learning System at the

Florida State University.

During her master’s study, she has been a graduate assistant and teaching assistant for several graduate courses in the Department of Educational Psychology and Learning System.

Her major research interests include methodological studies and applications in structural equation modeling and item response theory.

108