Factor Analysis Rachael Smyth and Andrew Johnson
Total Page:16
File Type:pdf, Size:1020Kb
Factor Analysis Rachael Smyth and Andrew Johnson Introduction For this lab, we are going to explore the factor analysis technique, looking at both principal axis and principal components extraction methods, two different methods of identifying the correct number of factors to extract (scree plot and parallel analysis), and two different methods of rotating factors to facilitate interpretation. Load Libraries The libraries we’ll need for this lab are psych and GPArotation. library(psych) library(GPArotation) The Data The dataset that we’ll use for this demonstration is called bfi and comes from the psych package. It is made up of 25 self-report personality items from the International Personality Item Pool, gender, education level and age for 2800 subjects and used in the Synthetic Aperture Personality Assessment. The personality items are split into 5 categories: Agreeableness (A), Conscientiousness (C), Extraversion (E), Neuroticism (N), Openness (O). Each item was answered on a six point scale: 1 Very Inaccurate, 2 Moderately Inaccurate, 3 Slightly Inaccurate, 4 Slightly Accurate, 5 Moderately Accurate, 6 Very Accurate. data("bfi") Describing the data It’s a good idea to look at your data before you run any analysis. Any participant who is missing any piece of data will be fully excluded from the analysis. It’s important to keep that in mind. describe(bfi[1:25]) ## vars n mean sd median trimmed mad min max range skew kurtosis se ## A1 1 2784 2.41 1.41 2 2.23 1.48 1 6 5 0.83 -0.31 0.03 ## A2 2 2773 4.80 1.17 5 4.98 1.48 1 6 5 -1.12 1.05 0.02 ## A3 3 2774 4.60 1.30 5 4.79 1.48 1 6 5 -1.00 0.44 0.02 ## A4 4 2781 4.70 1.48 5 4.93 1.48 1 6 5 -1.03 0.04 0.03 ## A5 5 2784 4.56 1.26 5 4.71 1.48 1 6 5 -0.85 0.16 0.02 ## C1 6 2779 4.50 1.24 5 4.64 1.48 1 6 5 -0.85 0.30 0.02 ## C2 7 2776 4.37 1.32 5 4.50 1.48 1 6 5 -0.74 -0.14 0.03 ## C3 8 2780 4.30 1.29 5 4.42 1.48 1 6 5 -0.69 -0.13 0.02 ## C4 9 2774 2.55 1.38 2 2.41 1.48 1 6 5 0.60 -0.62 0.03 ## C5 10 2784 3.30 1.63 3 3.25 1.48 1 6 5 0.07 -1.22 0.03 1 ## E1 11 2777 2.97 1.63 3 2.86 1.48 1 6 5 0.37 -1.09 0.03 ## E2 12 2784 3.14 1.61 3 3.06 1.48 1 6 5 0.22 -1.15 0.03 ## E3 13 2775 4.00 1.35 4 4.07 1.48 1 6 5 -0.47 -0.47 0.03 ## E4 14 2791 4.42 1.46 5 4.59 1.48 1 6 5 -0.82 -0.30 0.03 ## E5 15 2779 4.42 1.33 5 4.56 1.48 1 6 5 -0.78 -0.09 0.03 ## N1 16 2778 2.93 1.57 3 2.82 1.48 1 6 5 0.37 -1.01 0.03 ## N2 17 2779 3.51 1.53 4 3.51 1.48 1 6 5 -0.08 -1.05 0.03 ## N3 18 2789 3.22 1.60 3 3.16 1.48 1 6 5 0.15 -1.18 0.03 ## N4 19 2764 3.19 1.57 3 3.12 1.48 1 6 5 0.20 -1.09 0.03 ## N5 20 2771 2.97 1.62 3 2.85 1.48 1 6 5 0.37 -1.06 0.03 ## O1 21 2778 4.82 1.13 5 4.96 1.48 1 6 5 -0.90 0.43 0.02 ## O2 22 2800 2.71 1.57 2 2.56 1.48 1 6 5 0.59 -0.81 0.03 ## O3 23 2772 4.44 1.22 5 4.56 1.48 1 6 5 -0.77 0.30 0.02 ## O4 24 2786 4.89 1.22 5 5.10 1.48 1 6 5 -1.22 1.08 0.02 ## O5 25 2780 2.49 1.33 2 2.34 1.48 1 6 5 0.74 -0.24 0.03 This suggests that most items are only missing 20 or 30 participants worth of data - which is no big deal in a data set with 2800 observations. It is possible, however, that some of these missing values are non-overlapping - meaning that it could be a different 20 or 30 individuals missing from each of the variables. We can, however, determine the number of “complete cases” within the data - these are individuals that are missing no data whatsoever on the questionnaire. sum(complete.cases(bfi[1:25])) ## [1] 2436 The complete.cases function generates a Boolean vector, where a value of “TRUE” means that the case is complete, and a value of “FALSE” means that the case is missing at least one value. Summing across this vector gives us the total number of complete cases. This means that there are 2436 cases with no missing data. This means that 13% of the data is missing. There is no magic number as to the amount of missing data that is acceptable - but sample size is important for factor analysis. Some authors suggest that you need at least 10 observations for each variable in the factor analysis - our sample size is, therefore, adequate for our purposes. Assessing the Factorability of the Data Before we go too far down the road with this analysis, we should evaluate the “factorability” of our data. In other words, “are there meaningful latent factors to be found within the data?” We can check two things: (1) Bartlett’s test of sphericity; and (2) the Kaiser-Meyer-Olkin measure of sampling adequacy. 2 Bartlett’s Test of Sphericity The most liberal test is Bartlett’s test of sphericity - this evaluates whether or not the variables intercorrelate at all, by evaluating the observed correlation matrix against an “identity matrix” (a matrix with ones along the principal diagonal, and zeroes everywhere else). If this test is not statistically significant, you should not employ a factor analysis. cortest.bartlett(bfi[1:25]) ## R was not square, finding R from data ## $chisq ## [1] 20163.79 ## ## $p.value ## [1] 0 ## ## $df ## [1] 300 Bartlett’s test was statistically significant, suggesting that the observed correlation matrix among the items is not an identity matrix. This really isn’t a particularly powerful indication that you have a factorable dataset, though - all it really tells you is that at least some of the variables are correlated with each other. KMO The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is a better measure of factorability. The KMO tests to see if the partial correlations within your data are close enough to zero to suggest that there is at least one latent factor underlying your variables. The minimum acceptable value is 0.50, but most authors recommend a value of at 0.60 before undertaking a factor analysis. The KMO function in the psych package produces an overall Measure of Sampling Adequacy (MSA, as its labelled in the output), and an MSA for each item. Theoretically, if your overall MSA is too low, you could look at the item MSA’s and drop items that are too low. This should be done with caution, of course, as is the case with any atheoretical, empirical method of item selection. KMO(bfi[1:25]) ## Kaiser-Meyer-Olkin factor adequacy ## Call: KMO(r = bfi[1:25]) ## Overall MSA = 0.85 ## MSA for each item = ## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 ## 0.74 0.84 0.87 0.87 0.90 0.83 0.79 0.85 0.82 0.86 0.83 0.88 0.89 0.87 0.89 0.78 0.78 0.86 ## N4 N5 O1 O2 O3 O4 O5 ## 0.88 0.86 0.85 0.78 0.84 0.76 0.76 The overall KMO for our data is 0.85 which is excellent - this suggests that we can go ahead with our planned factor analysis. 3 Determining the Number of Factors to Extract The first decision that we will face in our factor analysis is the decision as to the number of factors that we will need to extract, in order to achieve the most parsimonious (but still interpretatable) factor structure. There are a number of methods that we could use, but the two most commonly employed methods are the scree plot, and parallel analysis. The simplest technique is the scree plot. Scree Plot Eigenvalues are a measure of the amount of variance accounted for by a factor, and so they can be useful in determining the number of factors that we need to extract. In a scree plot, we simply plot the eigenvalues for all of our factors, and then look to see where they drop off sharply. Let’s take a look at the scree plot for the bfi dataset: scree(bfi[,1:25]) Scree plot 5 PC FA 4 3 2 1 0 Eigen values of factors and components of factors Eigen values 5 10 15 20 25 factor or component number The scree plot technique involves drawing a straight line through the plotted eigenvalues, starting with the largest one.