Factor Analysis Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ Amaity[At]Ncsu.Edu

Factor Analysis Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ amaity[at]ncsu.edu

Contents

Introduction 2 Exploratory Factor Analysis (EFA) 2 Confirmatory Factor Analysis (CFA) 3 Exploratory Factor Analysis 4 Single factor model 4 The k-factor model 5 Estimation methods for EFA 9 The principal factor analysis 9 Maximum likelihood factor analysis 9 Other methods 9 How to choose the number of factors? 10 Hypothesis testing 10 Parallel analysis 11 Fit indices 12 Predicting the Factor Scores 13 Factor Rotation 14 PCA vs. EFA 17 Similarities 17 Differences 17 Confirmatory Factor Analysis (CFA) 18 Model fitting in R 19 ST 437/537 factor analysis 2

Introduction

The primary purpose of factor analysis is to describe the covariance structure of multiple variables in terms of a few underlying, unobservable, random variables called factors. Karl Pearson, Charles Spearman and others were the proponents of modern factor analysis models in the early 20th century.1 Charles 1 see Johnson and Wichern (2007); Spearman proposed his “single factor” theory of intelligence in 1904. also Fabrigar and Wegener (2011). Exploratory Factor Analysis, Oxford Specifically, Spearman considered several measures of the mental University Press. ability of children (examination scores in several subjects such as classics, French, English, Mathematics, and music) and proposed that a single unobserved variable can explain the relationship among these variables.2 2 Spearman called the factor “general In general, suppose the observable variables can be grouped by intelligence,” g. their correlations.3 Then it might be that a single underlying factor 3 All variables in a group are highly controls the variables in each group. In other words, variables or correlated among themselves but small correlations with other groups. factors like “intelligence” can not be measured directly; we can only observe their impact through some observable variables or manifest variables, and infer about the underlying factors by inspecting covariance among the manifest variables. Factor analysis formally attempts to find and confirm such structures. There two types of factor analysis, as we discuss below.

Exploratory Factor Analysis (EFA)

Exploratory factor analysis is used to investigate whether any factors are underlying the covariance structure among the manifest variables without making assumptions about which factors are related to which of the manifest variables. Main reasons for performing such an analysis are to

• investigate the structure of the covariance relationship among the manifest variables

• data/dimension reduction

• scoring of different attributes via so-called factor scores

The dimension reduction aspect of factor analysis often plays an essential role in multivariate statistics. In the situation, where one observes data on a large number of variables, but only has a few observations (small sample size), factor analysis can help reduce the number of variables by grouping highly correlated variables. The model used in EFA is called the common factor model, that is, a set of factors contribute to the covariance among the manifest variables. In the case of Spearman’s single factor model, suppose the

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 3

4 4 manifest variable are X = (X1,..., Xp). The the single factor model This model can be generalized to is accommodate multiple factors as well. X1 − µ1 = λ1F + u1 . . . . Xp − µp = λpF + up.

Here f is an unobserved random variable called the common factor, λi (called loadings; unknown) quantiﬁes the strength of the association between the common factor and the i-th manifest variable Xi, and ui is an unobserved random variable describes the part of Xi not explained by the common factor f . Our goal is to estimate/predict all the unknown components on the right-hand side of the equations. As a quick example, consider the Hemangioma data.5 The dataset 5 Table 8.2 of Applied Multivariate contains age (in days) at the time of surgery and expression of seven Statistics with R by Daniel Zelterman. New York: Springer genetic markers (RB, p16, DLK, Nanog, C.Myc, EZH2, IGF.2) for Age p16 DLK RB C.Myc IGF.2 Nanog EZH2 infants who were surgically treated for hemangioma. 1 Age We can see from the correlation plot in Figure 1 that there are 0.8 p16 some grouping among the variables. At this point, we do not have 0.6 0.4 any concrete hypothesis about how many factors there might be, DLK 0.2 RB

what the factors are or which of the factors are related to which of 0

C.Myc the manifest variables. This is a situation where EFA can be applied. −0.2

IGF.2 −0.4 Conﬁrmatory Factor Analysis (CFA) Nanog −0.6 −0.8 EZH2 The exploratory factor analysis is typically used in preliminary/pilot −1 studies to ﬁnd whether a factor analysis is useful for a given multi- Figure 1: Correlation plot of the heman- variate dataset. The EFA is used to determine how many factors there gioma data. might be and how are they related to the manifest variables. The theta5

1 second type of factor analysis, a conﬁrmatory factor analysis, seeks lambda5 EA

Aspiration to formally test whether a pre-speciﬁed factor model ﬁts the covariance theta6 lambda6 rho among the manifest variable well enough. theta1 CP

For example, let us consider the ability data (actually a correlation SCA

lambda1 matrix) in the MVA package (Figure 7.1 in Everitt and Hothorn). Six 1 theta2

variables were recorded [Calsyn and Kenny (1977)] for 556 eighth- lambda2 Ability PPE grade students. The variables are self-concept of ability (SCA), per- lambda3 theta3 ceived parental evaluation (PPE), perceived teacher evaluation (PTE), lambda4 PTE perceived friend’s evaluation (PFE), educational aspiration (EA), and college plans (CP). theta4 Calsyn and Kenny (1977) postulated that there are two factors, PFE and they relate to the manifest variables as shown in Figure 2. The variables in the ellipses are factors, and the variables in the squares Figure 2: Postulated ability factor model. are manifest variables. Confirmatory factor analysis can be use here to formally test whether this specfic factor model fit the data well.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 4

Exploratory Factor Analysis

Single factor model

Let us begin with the single factor model. Suppose we observe the variables (manifest variables) X1,..., Xp for each individual. Since the covariances of the manifest variables are central to factor analysis, we can assume that the manifest variables all have zero mean. The single factor model is as follows.6 6 Thus the X’s are related to each other only through the common factor, F.

X1 = λ1F + u1,

X2 = λ2F + u2, . .

Xp = λpF + up,

The main components of this models are:

• F is the latent (unobservable) variable, called the common factor. The common factor is shared among the observed variables.

• Xi’s are observed variables (through the sample), called the manifest variables.

• ui’s are unique to for Xi’s, called the speciﬁc factors. • λi’s are called factor loadings. The loadings determine the strength of the relationship betwwen the common factor and the observed variables.

To estimate the loadings7 and interpret the factor properly, we use 7 While the single factor model looks the following assumptions. like a regression model, the difﬁculty is that all the quantities on the right-hand side (F, λi, and ui) are unknown. • The common factor has zero mean and unit variance:

E(F) = 0 and var(F) = 1.

• The common factor and the speciﬁc factors are uncorrrelated:

cov(F, ui) = 0 for all i = 1 . . . , p.

• The speciﬁc factors have mean zero but unknown variances:

E(ui) = 0 and var(ui) = ψi for all j.

• The speciﬁc factors are uncorrelated:8 8 Conditional on F the X variables are uncorrelated to one another. This property is sometimes referred to as cov(uj, u ) = 0 when j 6= k. k conditional linear independence.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 5

According to the assumptions, the single factor model imposes a speciﬁc variance-covariance structure among the observed variables,

2 var(Xi) = λi + ψi , |{z} |{z} Communality Speciﬁc variance

Formally, we define communality of Xi as the part of variance of Xi that is captured by the common factor F. The remaining variance is called the specific variance of Xi. In addition, we can also see that, for i 6= j, cov(Xi, Xj) = λiλj. Thus the covariance among the variables are completely determined by the common factor; they do not involve the specific factors in any way. Combining the results presented above, we can write     ψ λ 1 1    .  h i  ψ2  cov(X) = Σ =  .  λ , ··· , λ +  ,  .  1 p  ..   .  λp | {z } ψp ΛΛT | {z } Ψ

T 9 9 where Λ = (λ1,..., λp) , and Ψ = diag(ψ1,..., ψp). As a simple In other words, the matrix Σ − Ψ (but artificial) example, consider the covariance matrix below. has only rank 1 (since only one factor is needed to describe the covariance       1 0.42 0.48 0.6 0.64 0 0 structure). h i       0.42 1 0.56 = 0.7 0.6 0.7 0.8 +  0 0.51 0  0.48 0.56 1 0.8 0 0 0.36 | {z } | {z } | {z } Σ Λ Ψ In real data, we will not be able to exactly factor out a covariance matrix like above; however, the single factor model will attempt to find the best rank-one approximation. Given a dataset, we will replace Σ by the sample covariance matrix S, and try to find the best approximation S ≈ Λˆ Λˆ T + Ψˆ .

The k-factor model

We can generalize the single factor model to a multiple factor model:

X1 = λ11F1 + λ12F2 + ... + λ1k Fk + u1 X2 = λ21F1 + λ22F2 + ... + λ2k Fk + u2 . . . .

Xp = λp1F1 + λp2F2 + ... + λpk Fk + up.

Here we have k common factors: F1,..., Fk. The loadings λij quan- tiﬁes the strength of linear relationship between the j-th factor and the i-th manifest variable. The random erros u1,..., up are speciﬁc factors. We use the following assumptions about the factors:

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 6

• Each of the common factors have mean zero and unit variance:

E(Fj) = 0 and var(Fj) = 1.

• The common factors are uncorrelated:

cov(Fi, Fj) = 0 if i 6= j.

With this assumption, the factor model is known as the orthogonal factor model.10 10 There are general models where factors are allowed to be correlated, called • The speciﬁc factors have mean zero and unknown variances: oblique factor models. We will however primarily focus on the orthogonal factor model in this discussion. E(uj) = 0 and var(uj) = ψj.

• The speciﬁc factors are uncorrelated among themselves and with the common factors:

cov(ui, uj) = 0 if i 6= j,

cov(ui, Fj) = 0 for all i, j.

According to the orthogonal k-factor model, we can write

2 2 var(Xi) = λi1 + ... + λik + ψi , | {z } |{z} Communality Speciﬁc variance and that for i 6= j,

cov(Xi, Xj) = λi1λj1 + ... + λikλjk.

Like the single factor model, the covariance among the variables are completely determined by the common factors; they do not involve the speciﬁc factors in any way. We can still write the covariance matrix of X as Σ = ΛΛT + Ψ, where the matrix Λ contains all loadings:   λ11 λ12 ··· λ1k   λ21 λ22 ··· λ2k  Λ =   .  . . .. .   . . . .  λp1 λp2 ··· λpk

Notice that

• The i-th row of Λ contain loadings of all factors for the i-th variable Xi. Thus, the sum of squares of the i-th row gives the communality of the i-th variable.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 7

• The j-th column of Λ contains loadings of the j-th common factor, Fj, for all the variables.

Given a dataset, we will use the sample covariance matrix S, and ﬁnd the best approximation of the form S ≈ Λˆ Λˆ T + Ψˆ . Consider again the Hemangioma data discussed earlier11. The 11 Table 8.2 of Applied Multivariate dataset contains age (in days) at the time of surgery and expression Statistics with R by Daniel Zelterman. New York: Springer of seven genetic markers for infants who were surgically treated for hemangioma.

# read the data hemangioma <- read.table("data/hemangioma.txt", header =T) # snapshot hemangioma[1:3,]

## Age RB p16 DLK Nanog C.Myc EZH2 IGF.2 ## 1 81 2.046149 3.067127 308974.7 94.17336 6.489601 2.764101 11175.689 ## 2 95 6.540000 1.900000 70988.3 381.83000 1.000000 7.090000 5340.170 ## 3 95 3.610000 3.820000 153060.6 237.28000 0.000000 5.570000 6310.240 RB p16 DLK C.Myc IGF.2 Nanog EZH2 # visual of the correlation matrix. 1 RB 0.8 # The option order = "hclust" arrange the variables p16 0.6

# according their correlation 0.4 DLK library(corrplot) 0.2 corrplot( cor(hemangioma[, -1]), order = "hclust") C.Myc 0

−0.2 IGF.2 For now let us not worry about the exact methods of obtaining Λ, −0.4 Nanog and just inspect the result by using the function factanal() in R. For −0.6 −0.8 this demonstration, we will ﬁt a three factor model (k = 3). We do EZH2 −1 not include Age in the analysis. To make all the variables on the same scale, we ﬁrst standardize the variables. Figure 3: Correlation plot of the hemangioma data. Z <- scale(hemangioma[, -1], scale = TRUE) out <- factanal(Z, factors =3) out

## ## Call: ## factanal(x = Z, factors = 3) ## ## Uniquenesses: ## RB p16 DLK Nanog C.Myc EZH2 IGF.2 ## 0.050 0.293 0.005 0.609 0.005 0.490 0.249 ## ## Loadings:

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 8

## Factor1 Factor2 Factor3 ## RB 0.141 -0.144 0.954 ## p16 0.366 0.757 ## DLK -0.163 0.961 -0.211 ## Nanog 0.559 0.275 ## C.Myc 0.841 0.295 -0.448 ## EZH2 0.682 0.193 ## IGF.2 0.780 0.377 ## ## Factor1 Factor2 Factor3 ## SS loadings 2.274 1.757 1.269 ## Proportion Var 0.325 0.251 0.181 ## Cumulative Var 0.325 0.576 0.757 ## ## Test of the hypothesis that 3 factors are sufficient. ## The chi square statistic is 1.86 on 3 degrees of freedom. ## The p-value is 0.603

The part of the output marked Uniquenesses gives the specific 12 12 variances for each variable. This is the part of var(Xi) not captured Recall, specific variances are var(ui), by the common factors. Ideally, we would like these values to be the variances of the specific factors. small. The part of the output marked Loadings gives the Λ matrix (in this case a 7 × 3 matrix, since we have 7 variables and 3 common factors). The blanks correspond to zero loadings. It seems the first common factor mostly contribute to Nanog, C.Myc, EZH2 and IGF.2. The second factor has strong relationship with p16 and DLK. Finally, the third factor essentially relates to RB. The communalities of the variables can be computed by taking the sum of squares of the rows of the loading matrix. In this example, we compute the communalities as below.13 13 Note that we standardized each variable prior to performing the factor # Loading matrix Lambda analysis. Thus each variable has variance one. If this three factor model is a L <- out$loadings good fit to the data, we should expect # communalities are row sum of squares communalities for each variable to be close to 1. round(rowSums(L^2),3)

## RB p16 DLK Nanog C.Myc EZH2 IGF.2 ## 0.950 0.707 0.995 0.391 0.995 0.510 0.751

Overall, the variables have moderate to high loadings except for Nanog. Ideally we would look for more bilogical information about the genes loading highly for each factor for similar functionality.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 9

Estimation methods for EFA

Recall that we want the loadings matrix Λ so that the approximation S = ΛΛT + Ψ is as precise as possible. Two methods to estimate the loadings matrix, Λ, are discussed below.

The principal factor analysis

The procedure of estimating Λ typically involves the following steps:

(1) Obtain an initial estimator of communalities and uniquenesses, Ψˆ .

(2) Determine the reduced covariance matrix S∗ = S − Ψˆ . Obtain ∗ estimate of the loading matrix Λˆ using PCA on S .14 14 The loadings will be eigenvectors as in PCA. (3) Update the estimates of communalities and Ψˆ using the estimated loadings

(4) Repeat steps (2)-(3) until convergence: We can use some convergence criterion such that the max absolute difference between the new Ψˆ and the old Ψˆ is smaller than a pre-specified tolerance.15 15 Some researchers also prefer to not iterate the steps at all; they stop after A real life computational difficulty is that sometimes the commu- step (3) and take the results as the final estimates. nality estimates may exceed the observed sample variance of some variables, resulting in negative estimate of the specific variances. This is clearly an unacceptable solution. This phenomenon is called a Heywood case.16 16 Attributed to Heywood, H.(1931), “On finite sequences of real numbers,” Proceedings of the Royal Society of Maximum likelihood factor analysis London, Series A, Containing Papers of a Mathematical and Physical Character, The maximum likelihood factor analysis assumes that the data being 134, 486 – 501 analyzed have a multivariate normal distribution. Under the multivariate normality assumotion, one can write a “likelihood function” for the loading matrix Λ and uniquenesses Ψ.17 We can think of the 17 See Section 5.5.2 of Everitt and negative of likelihood function as a criterion that measures the dis- Hothorn (2011), An Introduction to Applied Multivariate Analysis with R crepency between S and the covariance structure posited by the factor for more details model, ΛΛT + Ψ. Thus we minimize the negative likelihood function (or, equivalently maximize the likelihood) to obtain etimates Λˆ . This optimization is done iteratively. Such estimates are called the maximum likelihood estimates (MLE). The function factanal() fits MLE by default.

Other methods

There are many other methods to estimate the loadings. Some of them are mentioned below.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 10

• Minimum Residual Factor Analysis: [Harman, Harry and Jones, Wayne (1966) Factor analysis by minimizing residuals (minres), Psychometrika, 31, 3, 351-368.]

• Alpha Factor Analysis: [Kaiser, Henry F. and Caffrey, John. Alpha factor analysis, Psychometrika, (30) 1-14.]

• Weighted/Generalized Least Squares Factor Analysis

• Minimum rank factor analysis: [Shapiro, A. and ten Berge, Jos M. F, (2002) Statistical inference of minimum rank factor analysis. Psychometika, (67) 79-84.]

Many of these options are available in the psych package; the com- mand for factor analysis is fa().

How to choose the number of factors?

Hypothesis testing

We can formally determine the number of factors using the MLE approach. For any k (number of factors), we can test the hypothesis

H0 : k common factors are sufﬁcient.

If we reject this hypothesis (i.e., obtain a small p-value), we need to add more factors. In EFA, we typically can not pre-specify k. Thus we need to perform sequential tests, starting from k = 1, and gradually add more factors until we can not reject the hypothesis anymore. 2 For large sample sizes, the test statistic follows a χv distribution with v = (p − k)2/2 − (p + k)/2.18 18 If at some point the degrees of free- In the Hemangeoma data, let us test for k = 1, 2 and 3. dom, v, of the test become zero, then it might be that no non-trivial solution is appropriate or the factor model itself is # p-values questionable. pv <- rep(NA,3) for (k in1:3){ out <- factanal(Z, factors = k) pv[k] <- out$PVAL } names(pv) <- c("k=1", "k=2", "k=3") pv

## k=1 k=2 k=3 ## 0.003946155 0.060689832 0.602708042

The p-value for k = 1 is very small and clearly suggests that the one factor model is not sufﬁcient. While for k = 2 produces p-value lerger than 0.05, the value is still quite close to 0.05. However k = 3

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 11

provides a p-value of 0.6. Using this criterion, we can go with a two- or three-factor model. Ultimately, it will be important to interpret the factors as well.19 19 If we attemp to increase the numer of factors further, say k = 4, factanal() produces an error saying 4 factors are Parallel analysis too many for the dataset.

Another way to determine the appropriate number of factors is to compare the results of a factor analysis using the data at hand to that of a randomly generated data matrix of the same size and same number of variables as the original data. Suppose we ﬁt a k factor model. Parallel analysis is based on the idea that the k largest (sample) eigenvalues of the estimated correlation matrix (ΛΛT + Ψ) should be quite larger than the k largest eigenvalues from a same size random dataset.20 We can use the function fa.parallel() here. 20 Fabrigar and Wegener (2011). Ex- ploratory Factor Analysis, Oxford library(psych) University Press. fa.parallel(Z, n.iter = 100, fm="ml", fa="fa") Parallel Analysis Scree Plots 3.0 FA Actual Data FA Simulated Data FA Resampled Data 2.0 1.0 0.0 eigen values of principal factors eigen values 1 2 3 4 5 6 7

Factor Number

## Parallel analysis suggests that the number of factors = 2 and the number of components = NA

The argument n.iter speciﬁes the number of simulated datasets to generate. By default, 95-th percentile21 of the eigenvalues from 21 We can specify the level using the the simulated data is compared to those from real data. One can argument quant. use resampled data (rows of the original data permuted) instead of simulated data as well. Both the results are presented in the plot above – the means of the random solutions are shown. Another function to perform the parallel analysis is parallel() from the nFactors package.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 12

Fit indices

The factanal() function by default produces test results of a chi- squared test. This test is based on comparing the sample and ﬁtted covariance matrices. There are several other indices or measures we can use to assess how well the factor model ﬁts the observed data. Two such indices are shown below.

• Tucker-Lewis Index: compares a k-factor model to a model with no constraint. A model with index > 0.95 is considered good.

• Root Mean Square Error of Approximation (RMSEA): small values (e.g., < 0.05) is considered good.

Many of these options are available in the psych package; the com- mand for factor analysis is fa(). We present the MLE approach using the fa() function below 22 22 using a k = 1, 2 and 3. The argumant nfactors speciﬁes the number of factors to extract, the rotate library(psych) argument speciﬁes how to rotate the factor loadigs, and the statement fm = # List to store outputs "ml" produces the MLE solution. We fa.mle <- vector(mode = "list", length =3) will learn about rotation later. # Fit for k=1,2,3 for (k in1:3){ fa.mle[[k]] <- fa(Z, nfactors = k, rotate = "none", fm = "ml") }

23 23 Tucker-Lewis Index (TLI) values are shown below. Use names(fa.mle[[1]]) to see what other ﬁelds are in the fa.mle[[1]] TLI <- c(fa.mle[[1]]$TLI, fa.mle[[2]]$TLI, fa.mle[[3]]$TLI) output. names(TLI) <- c("k=1", "k=2", "k=3") TLI

## k=1 k=2 k=3 ## 0.2793121 0.4779667 1.2497563

Root Mean Square Error of Approximation (RMSEA) values are displayed below. tab <- rbind(fa.mle[[1]]$RMSEA, fa.mle[[2]]$RMSEA, fa.mle[[3]]$RMSEA) round(tab,3)

## RMSEA lower upper confidence ## [1,] 0.326 0.141 0.380 0.1 ## [2,] 0.287 NA 0.379 0.1 ## [3,] 0.000 NA 0.322 0.1

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 13

Based on these indices, perhaps a three factor model is preferred. We need to keep in mind that a factor model is only useful if it provides a representation of the data that we can interpret. Thus loadings for the factors must be readily interpretable. Purely statistical approaches may sometimes provide ambiguous results (in our example, we have either two- or a three-factor model); however, the interpretability of the candidate models will have to be used to make the ﬁnal choice.

Predicting the Factor Scores

Once we estimate the loadings, λˆ ij, the next step is to predict the factor scores24 for each individual. 24 Typically, we can not “estimate” the factor scores as they are random • The factor scores are low-dimensional summaries of the individual variables. Instead we “predict” them. data vectors;

• The scores can be used in further analysis, e.g., regression,

• The scores are often regarded as more reliable measures of the underlying latent factors compared to the observed variables.

T Deﬁne the vector of factors as F = (F1,..., Fk) for a k-factor model. Under the MLE approach (i.e., under the assumption of normality of the data), we can show that, conditional on the observed X’s, F follows a multivariate normal distribution:25 25 Recall we have assumed that X has mean zero, and that we are working F|X ∼ N ΛTΣ−1X, I − ΛTΣ−1Λ with standardized data.

Thus we can predict the factor scores by the mean of the distribution above, ΛTΣ−1X. Speciﬁcally, suppose we have observed data vector 26 26 xi. Then the corresponding factor scores will be Johnson and Wichern (2007) presents this method as “regression” method.   Fˆ1i  .  − Fˆ =  .  = Λˆ TS 1x ,  .  i Fˆik

where Λˆ are the estimated loadings, S is the sample covariance matrix (since we do not know Σ). Thus we can get a vector of k scores (one for each factor) for each individual in the dataset. Clearly, we need the full dataset (since we need xi to obtain the scores. Using the fa() function on the Hemangioma data with three 27 27 factor model, we can extract the scores as follows. The argument scores = "regression" speciﬁes the method # Z is the standardized hemangioma data of computing the scores. fa.out <- fa(Z, nfactors =3, n.obs = nrow(Z), fm = "ml", rotate = "none", scores = "regression")

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 14

# extract the scores scores <- fa.out$scores dim(scores)

## [1] 19 3 head(scores)

## ML1 ML2 ML3 −2 −1 0 1 2 3

## [1,] 0.1859661 0.03059981 -0.6971633 2.0

## [2,] -0.7774303 0.16341701 0.9082747 1.0 ML1

## [3,] -0.6272077 -0.15944232 -0.2501564 0.0

## [4,] 0.2869953 -1.35047906 -0.6106099 −1.0 3

## [5,] 0.7243978 0.39645441 -0.2773025 2 1 ML2 ## [6,] 1.9730696 -1.90916509 0.5998125 0 −1 −2

Since we ﬁt three factors, and the dataset has 19 individuals, we 1.0 0.0 obtain 3 scores for each individual (each row above). A pairs-plot of ML3 the scores is shown in Figure 4. We can see that perhaps there is one −1.0 −2.0 individual with very different scores from the rest of the dataset. −1.0 0.0 1.0 2.0 −2.0 −1.0 0.0 1.0

Figure 4: Pairs plot of the factor scores Factor Rotation of the Hemangioma data.

An interesting aspect of factor analysis is that the loadings matrix is not unique. In other words, two or more different loadings matrices can produce the exact same fit.28 We show such an example below. 28 Specifically, there will be two loading T 29 matrices Λ1 and Λ2 such that Λ1Λ1 = Consider the Harman23.cor dataset in the datasets package. T Λ2Λ2 . In fact, there are infinitely many The list contains a correlation matrix of eight physical measurements such Λ matrices. on n = 305 girls between ages seven and seventeen. The correla- 29 Harman, H. H. (1976) Modern Fac- tor Analysis, Third Edition Revised, Harman23.cor$cov tion matrix is stored in , and the sample size in University of Chicago Press, Table 2.3. Harman23.cor$n.obs. Figure 5 shows a correlation plot of the dataset. height arm.span forearm lower.leg weight bitro.diameter chest.girth chest.width library(corrplot) 1 height par(mar = c(2,3,4,3)) 0.8 arm.span 0.6 corrplot(Harman23.cor$cov) forearm 0.4

0.2 lower.leg Let us ﬁt a two-factor model to this correlation matrix. Below we 0 weight see two different loadings matrices. In the fa() (in psych library) −0.2

bitro.diameter −0.4 The argument rotate controls the rotation we want to apply. In the chest.girth −0.6 function factanal() the argument is rotation. −0.8 chest.width

−1 # Model fit-1 Figure 5: Correlation plot of the Har- fa.out.none <- fa(r = Harman23.cor$cov, nfactors =2, man23 data of physical measurements. n.obs = Harman23.cor$n.obs, fm = "ml", rotate = "none") # Model-fit-2 fa.out.varimax <- fa(r = Harman23.cor$cov, nfactors =2,

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 15

n.obs = Harman23.cor$n.obs, fm = "ml", rotate = "varimax") # Loadings load <- cbind(fa.out.none$loadings, fa.out.varimax$loadings) colnames(load) <- c("None F1", "None F2", "Varimax F1", "Varimax F2") round(load,3)

## None F1 None F2 Varimax F1 Varimax F2 ## height 0.880 -0.237 0.865 0.287 ## arm.span 0.874 -0.360 0.927 0.181 ## forearm 0.846 -0.344 0.895 0.179 ## lower.leg 0.855 -0.263 0.859 0.252 ## weight 0.705 0.644 0.233 0.925 ## bitro.diameter 0.589 0.538 0.194 0.774 ## chest.girth 0.527 0.554 0.134 0.752 ## chest.width 0.574 0.365 0.278 0.621

These are two different loading matrices – ﬁrst two columns cor- responds to one set of loadings, and the last two columns another. However, they produce the same communalities and uniqueness.30 30 Recall, communality is the sum of squares of the rows of the loading ma- # communalities for model-1 and model-2 trix. Also for standardized variables (i.e., using correlation matrix), uniqueness = com.1 <- rowSums(fa.out.none$loadings^2) 1 − communality. com.2 <- rowSums(fa.out.varimax$loadings^2) comboth <- cbind(com.1, com.2) colnames(comboth) <- c("None", "Varimax") round(comboth,3)

## None Varimax ## height 0.830 0.830 ## arm.span 0.893 0.893 ## forearm 0.834 0.834 ## lower.leg 0.801 0.801 ## weight 0.911 0.911 ## bitro.diameter 0.636 0.636 ## chest.girth 0.584 0.584 ## chest.width 0.463 0.463

Clearly the communalities are identical. Similarly, the uniquenesses are identical as well. It can be shown that there are infinite number of such loadings.31 31 Specifically, given a loadings matrix Given a particular loadings matrix, we can obtain other loadings ma- Λ1, we would multiply it by a matrix P such that PPT = I. So we get another trices by applying a method called roration. Often, after performing loading matrix Λ2 = Λ1P. Clearly, T T T T EFA, we find the resulting loadings matrix difficult to interpret. Then Λ2Λ2 = Λ1PP Λ1 = Λ1Λ1 . we might apply rotation to obtain other loadings, which will provide identical communalities and uniquenesses, in hope of getting one

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 16

that is easy to interpret. Speciﬁcally, loadings are easy to interpret if each variable loads highly on at most one factor; and that all factor loadings are either large or close to zero. There are two types of rotation:

• orthogonal rotation: the factors remain uncorrelated after rotation; the varimax roration shown before is an example of such a rotaion.

• oblique rotation: the rotated factors can be correlated.

We will mostly focus on orthogonal rotation.32 Among many avail- 32 For more details on both types of able techinques for orthogonal rotations, the follwoing two are popu- rotations, see Chapter 5.7 of Everitt and Hothorn (2011). lar:

• Varimax rotation: tries to have factors with only a few large loadings and as many near-zero loadings as possible. This is parhaps one of the most popular rotation methods.

• Quartimax rotation: tries to force each variable to correlate highly with at most one factor, and have zero of small correlation with the rest of the factors.

Estimted EFA models using the Harman23.cor data for different rotations are shown in Figure 6 using the fa.diagram() function in 33 33 the psych package. We regard any loadings with absolute value The argument cut = 0.3 speciﬁes below 0.3 to be insigniﬁcant for better interpretation. that any loading less than 0.3 (in absolute value) will not be drawn, simple = F draws all the loadings (not par(mfrow = c(1,3)) just the largest loading per variable). # Model-fit-3 using quartimax fa.out.quartimax <- fa(r = Harman23.cor$cov, nfactors =2, n.obs = Harman23.cor$n.obs, fm = "ml", rotate = "quartimax") fa.diagram(fa.out.none, cut = 0.3, simple =F, main = "No rotation") fa.diagram(fa.out.varimax, cut = 0.3, simple =F, main = "Varimax rotation") fa.diagram(fa.out.quartimax, cut = 0.3, simple =F, main = "Quartimax rotation")

We can see that the unrotated loadings (left panel in Figure 6) do not group the variables clearly – there are multiple variables which are correlated with both the factors. In contrast, the varimax rotated loadings (middle panel in Figure 6) clearly groups variables into two groups. Quartimax rotated loadings are in between if we set the cutoff at 0.3. Thus for better interpretation, we might use the varimax rotation to get the ﬁnal result. We should note that while factor rotation gives the same communalities and uniqueness of the variables, the factor scores will be

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 17

No rotation Varimax rotation Quartimax rotation

height arm.span arm.span 0.9 0.9 0.9 arm.span forearm forearm 0.9 0.9 0.9 lower.leg −0.40.9 ML1 height 0.9 ML1 height 0.9 ML1 0.8 0.9 0.9 forearm lower.leg lower.leg −0.30.7 0.3 weight weight weight 0.60.6 0.9 0.9 bitro.diameter 0.60.5 ML2 bitro.diameter 0.8 ML2 bitro.diameter 0.8 ML2 0.50.4 0.8 0.70.3 chest.width chest.girth chest.girth 0.6 0.6 0.6 chest.girth chest.width chest.width

Figure 6: Factor rotations for Harman23 data. different depending on our choice of loadings. Thus we should ﬁrst ﬁx a set of loadings that we can interpret, and only then compute the corresponding factor scores.

PCA vs. EFA

Similarities

• Both techniques look for hidden structures in the data

• can be used for data reduction/scoring

Differences

• PCA does not assume any model for the covariance matrix (i.e, it is an unstructured ﬁt). EFA posits a speciﬁc model (of the form ΛΛT + Ψ, i.e, a “low rank” matrix plus a diagonal matrix) and estimates the parameters.

• IN PCA, PCs are linear combinations of the variables. In factor model, the variables are modeled as a linear combinations of latent factors.

• PCA attempts capture most of the total variance. EFA tries to maximize the variance due to common factors.

• PCA, as it is, does not account for possible measurement errors in the observed variables. EFA can accomodate measurement errors in the variables through speciﬁc factors.

• In PCA, the PCs are orthogonal by construction. In EFA, there are oblique factor models that allow for correlation among factors.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 18

Conﬁrmatory Factor Analysis (CFA)

We typically apply the exploratory factor analysis in a pilot study to determine any grouping among the manifest variables and form possible hypotheses. Then we test these hypotheses (or any other pre-specified hypotheses we might already have) using a separate dataset. We can do so using confirmatory factor analysis (CFA). To be specific, CFA attempts to formally test a particular factor model, where particular manifest variables are allowed to relate to particular factors. For example, let us consider the ability data (a correlation matrix) in the MVA package.34 Six variables were recorded 34 Figure 7.1 in Everitt and Hothorn [Calsyn and Kenny (1977)] for 556 eighth-grade students: (2011), An Introduction to Applied Multivariate Analysis with R. SCA: self-concept of ability; PPE: perceived parental evaluation; PTE: perceived teacher evaluation; PFE: perceived friend’s evaluation; EA: educational aspiration; CP: college plans. The ability dataset shows the correlation among these variables. library(MVA) ## code taken from demo("Ch-SEM") ab <- c(0.73, 0.70, 0.68, 0.58, 0.61, 0.57,

0.46, 0.43, 0.40, 0.37, theta5

1 0.56, 0.52, 0.48, 0.41, 0.72) lambda5 EA ability <- diag(6)/2 Aspiration theta6 lambda6 rho ability[upper.tri(ability)] <- ab theta1 CP ability <- ability + t(ability) SCA rownames(ability) <- c("SCA","PPE","PTE","PFE","EA","CP") lambda1 1 theta2 colnames(ability) <- c("SCA","PPE","PTE","PFE","EA","CP") Ability lambda2 PPE

lambda3 theta3 lambda4 ability PTE

## SCA PPE PTE PFE EA CP theta4 ## SCA 1.00 0.73 0.70 0.58 0.46 0.56 PFE ## PPE 0.73 1.00 0.68 0.61 0.43 0.52 ## PTE 0.70 0.68 1.00 0.57 0.40 0.48 Figure 7: Postulated ability factor model. ## PFE 0.58 0.61 0.57 1.00 0.37 0.41 ## EA 0.46 0.43 0.40 0.37 1.00 0.72 ## CP 0.56 0.52 0.48 0.41 0.72 1.00

Calsyn and Kenny (1977) postulated that there are two factors, and they relate to the manifest variables as shown in Figure 7.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 19

The variables in the ellipses are factors, and the variables in the squares are manifest variables. Confirmatory factor analysis can be used here to formally test whether this specific factor model fit the data well. Recall that in exploratory factor analysis the loadings matrix is not unique.35 In general, the overall EFA model 35 we can rotate them to get the same communality and uniqueness, but get Σ = ΛΛT + Ψ, different interpretation cannot be appropriately identified, as we need to estimate all of the parameters in Λ. In contrast, in CFA, by imposing a specific structure (specified by a hypothesis) on Λ, we fix some parameters to be zero (e.g., in the example above, there is no arrow from Ability to EA, and thus the corresponding loading is set to zero), and decrease the number of the parameters we need to estimate. The remaining parameters are estimate by the maximum likelihood (MLE) approach, and a chi-squared test is used to assess the goodness of fit.

Model ﬁtting in R

Let us ﬁt the model in the example above in R. We will use the package sem. The proposed factor model is shown below. The factors f1 and f2 are Ability and Aspiration, respectively.

SCA = λ11 f1 + u1; PPE = λ21 f1 + u2; PTE = λ31 f1 + u3; PFE = λ41 f1 + u4; EA = λ52 f2 + u5; CP = λ62 f2 + u6;

In addition, E( f1) = E( f2) = 0 and var( f1) = var( f2) = 1. We also assume that the specific variances of the six variables are ψ1,..., ψ6, respectively, and they are unknown. The postulated model also assumes correlated factors, that is, cor( f1, f2) = ρ, where ρ needs to be estimated.36 36 So this is not a orhogonal factor model. First we use the specify.model() function to read the model in text format.37 37 We can save the model text in a file and read from the file too. library(sem) ability_model <- specifyModel(text =" ## Specification of Ability factor Ability -> SCA, lambda11, NA Ability -> PPE, lambda21, NA Ability -> PTE, lambda31, NA

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 20

Ability -> PFE, lambda41, NA ## Specification of Aspiration factor Aspiration -> EA, lambda52, NA Aspiration -> CP, lambda62, NA ## Uniquenesses for each variable SCA <-> SCA, psi1, NA PPE <-> PPE, psi2, NA PTE <-> PTE, psi3, NA PFE <-> PFE, psi4, NA EA <-> EA, psi5, NA CP <-> CP, psi6, NA ## Fixed variances for the two factors Ability <-> Ability, NA, 1 Aspiration <-> Aspiration, NA, 1 ## Correlation between two factors Ability <-> Aspiration, rho, NA")

Each line in the code represents one arrow/one equation in the factor model.

• The line starting with ## is a comment line. This is optional; used only by the programmaer to comment the code (still highly recom- mended)

• The line afterward represents one arrow/one equation. For example, the line “Ability -> SCA, lambda11, NA” represents the equation SCA = λ11 f1 + u1; the last NA tells the program that λ11 is not ﬁxed and needs to be estimated.

• The last two line represent the assumption on the factors. For example, the line “Aspiration <-> Aspiration, NA, 1” corre- sponds to the assumption var( f2) = 1. The NA tells the program that variannce of the factor is not a parameter that needs to be estimated; the last entry 1 says that the variance is ﬁxed at 1.

The model, as R reads it, is shown below. ability_model

## Path Parameter StartValue ## 1 Ability -> SCA lambda11 ## 2 Ability -> PPE lambda21 ## 3 Ability -> PTE lambda31 ## 4 Ability -> PFE lambda41 ## 5 Aspiration -> EA lambda52 ## 6 Aspiration -> CP lambda62 ## 7 SCA <-> SCA psi1

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 21

## 8 PPE <-> PPE psi2 ## 9 PTE <-> PTE psi3 ## 10 PFE <-> PFE psi4 ## 11 EA <-> EA psi5 ## 12 CP <-> CP psi6 ## 13 Ability <-> Ability 1 ## 14 Aspiration <-> Aspiration 1 ## 15 Ability <-> Aspiration rho

In general, each line has the following structure:

Path Parameter Value ‘Ability -> SCA‘ lambda11 NA ‘SCA <-> SCA‘ psi1 NA ‘Aspiration <-> Aspiration‘ NA 1

Then the model is fitted using the sem() function. At the least, it needs the model, the correlation matrix, and sample size.38 38 The arguments “model” specifies the model we defined before, “S” is ability_sem <- sem(model = ability_model, the correlation matrix, and “N” is the sample size. S = ability, N = 556 ) summary(ability_sem)

## ## Model Chisquare = 9.255732 Df = 8 Pr(>Chisq) = 0.3211842 ## AIC = 35.25573 ## BIC = -41.31041 ## ## Normalized Residuals ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.4409685 -0.1870306 -0.0000018 -0.0130992 0.2107128 0.5333068 ## ## R-square for Endogenous Variables ## SCA PPE PTE PFE EA CP ## 0.7451 0.7213 0.6482 0.4834 0.6008 0.8629 ## ## Parameter Estimates ## Estimate Std Error z value Pr(>|z|) ## lambda11 0.8632049 0.03514508 24.561188 3.284552e-133 SCA <--- Ability ## lambda21 0.8493226 0.03545022 23.958178 7.593661e-127 PPE <--- Ability ## lambda31 0.8050861 0.03640470 22.114892 2.272503e-108 PTE <--- Ability ## lambda41 0.6952671 0.03863370 17.996387 2.079489e-72 PFE <--- Ability ## lambda52 0.7750850 0.04035675 19.205834 3.307658e-82 EA <--- Aspiration ## lambda62 0.9289304 0.03940959 23.571177 7.615270e-123 CP <--- Aspiration

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 22

## psi1 0.2548772 0.02336722 10.907470 1.061704e-27 SCA <--> SCA ## psi2 0.2786512 0.02412754 11.549097 7.460043e-31 PPE <--> PPE ## psi3 0.3518366 0.02691875 13.070321 4.865973e-39 PTE <--> PTE ## psi4 0.5166036 0.03472534 14.876847 4.659431e-50 PFE <--> PFE ## psi5 0.3992432 0.03819583 10.452535 1.426604e-25 EA <--> EA ## psi6 0.1370884 0.04350459 3.151126 1.626425e-03 CP <--> CP ## rho 0.6663697 0.03095414 21.527645 8.578257e-103 Aspiration <--> Ability ## ## Iterations = 29

We can graph the results by using the DiagrammeR package. The ﬁrst function pathDiagram() creates a grpah output that can be plot- ted. pathDiagram(ability_sem, ## output from ‘sem()‘ fit ignore.double = FALSE, ## whether to suppress the variances edge.labels = "both", ## Put both the name and estimated value of edge labels file = "ability_seb_fitted", ## Output file name output.type = "dot", ## Output file extension node.colors = c("steelblue", "transparent")) ## Node colors

# Create the plot

psi5=0.4 # library(DiagrammeR) 1=1 lambda52=0.78 EA # grViz("ability_seb_fitted.dot") Aspiration psi6=0.14 lambda62=0.93 rho=0.67 From the output, we see that the p-value is 0.32. This is the test psi1=0.25 CP SCA where the null hypothesis is H0 the model is sufﬁcient to describe lambda11=0.86 1=1 psi2=0.28 the data. Since we have a large p-value, we can not reject H0 and Ability lambda21=0.85 PPE conclude the posited two-factor model is plausible for this data. lambda31=0.81 psi3=0.35 One of the parameters of interest is the correlation between the lambda41=0.7 PTE two factors ρ.39 It is estimated to be 0.666 with a standard error of psi4=0.52

0.031. Thus we can obtain a 95% approximate large sample confi- PFE dence interval as [0.666 ± (1.96)(0.031)] = [0.605, 0.727]. Other than examining the p-value from the chi-squared test, we Figure 8: Estimated ability factor can also examine the following when possible.40 model. 39 It is called disattenuated correlation; it is uncontaminated by possible measure- • Parameter estimates: if the estimated values are unreasonable (e.g., ment errors in the manifest variables. variances of the specific factors become negative, or correlations 40 Everitt and Hothorn (2011) outside the -1 to +1 range), the model fit is suspect. There might be something fundamentally wrong with the data. In our data example, we do not see any such patterns.

• Standard errors and correlations: if the correlations among the parameter estimates are large (close to 1 or -1), the model might be considered almost unidentiﬁable. In our data example, we can extract the covariance matrix for the estimates using “vcov” ﬁeld,

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 factor analysis 23

and then convert it into a correlation matrix. Figure 9 shows the lambda11 lambda21 lambda31 lambda41 lambda52 lambda62 psi1 psi2 psi3 psi4 psi5 psi6 rho 1 correlation plot of the estimates. lambda11 lambda21 0.8 _ lambda31 0.6 Vhat <- ability sem$vcov lambda41 0.4 lambda52 library(corrplot) 0.2 lambda62 corrplot(cov2cor(Vhat)) psi1 0

psi2 −0.2 psi3 −0.4 psi4

• Residual covariances: we should expect the sample covariance psi5 −0.6

psi6 matrix to be close to the model based covariance matrix (from the −0.8 rho fitted factor model). In our data example, we stored the original −1 covariance matrix in “ability”. The fitted covariance can be obtaind Figure 9: Correlation among the esti- in the field “C”. Let us look at the difference. mated parameters. round(ability_sem$S - ability_sem$C,3)

## SCA PPE PTE PFE EA CP ## SCA 0.000 -0.003 0.005 -0.020 0.014 0.026 ## PPE -0.003 0.000 -0.004 0.019 -0.009 -0.006 ## PTE 0.005 -0.004 0.000 0.010 -0.016 -0.018 ## PFE -0.020 0.019 0.010 0.000 0.011 -0.020 ## EA 0.014 -0.009 -0.016 0.011 0.000 0.000 ## CP 0.026 -0.006 -0.018 -0.020 0.000 0.000

Overall, for our data example, the two-factor model seems to be a good ﬁt.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu