Correspondence Analysis
Total Page:16
File Type:pdf, Size:1020Kb
NCSS Statistical Software NCSS.com Chapter 430 Correspondence Analysis Introduction Correspondence analysis (CA) is a technique for graphically displaying a two-way table by calculating coordinates representing its rows and columns. These coordinates are analogous to factors in a principal components analysis (used for continuous data), except that they partition the Chi-square value used in testing independence instead of the total variance. For those of you who are new to CA, we suggest that you obtain Greenacre (1993). This is an excellent introduction to the subject, is very readable, and is suitable for self-study. If you want to understand the technique in detail, you should obtain this (paperback) book. Discussion We will explain CA using the following example. Suppose an aptitude survey consisting of eight yes or no questions is given to a group of tenth graders. The instructions on the survey allow the students to answer only those questions that they want to. The results of the survey are tabulated as follows. Aptitude Survey Results – Counts Question Yes No Total Q1 155 938 1093 Q2 19 63 82 Q3 395 542 937 Q4 61 64 125 Q5 1336 876 2212 Q6 22 14 36 Q7 864 354 1218 Q8 920 185 1105 Total 3772 3036 6808 Take a few moments to study this table and see what you can discover. The most obvious pattern is that many of the students did not answer all the questions. This makes response patterns between rows difficult to analyze. To solve this problem of differential response rates, we create a table of row percents (or row profiles as they are called in CA). 430-1 © NCSS, LLC. All Rights Reserved. NCSS Statistical Software NCSS.com Correspondence Analysis Aptitude Survey Results – Row Profiles Question Yes No Total Q1 14.18 85.82 100.00 Q2 23.17 76.83 100.00 Q3 42.16 57.84 100.00 Q4 48.80 51.20 100.00 Q5 60.40 39.60 100.00 Q6 61.11 38.89 100.00 Q7 70.94 29.06 100.00 Q8 83.26 16.74 100.00 Total 55.41 44.59 100.00 This table allows us to see the underlying patterns within the table. We note that only 14% answered yes to question one while 83% answered yes to question eight. Although we can inspect this table directly, a picture of the data will allow us to find patterns much more quickly. This is done in the following scatter plot. The plot shows the row profiles with the questions on the horizontal axis and the row percents on the vertical axis. Notice that there are two possible responses (yes or no) and two corresponding plotting symbols. Aptitude Survey Results – Row Percents versus Questions Question vs Row_Pct 100 Answer Yes No 75 50 Row_Pct 25 0 0 1 2 3 4 5 6 7 8 Question We notice a steady gain in the percent of students answering yes as we move from question one to question eight. Also, we can see the obvious relationship between the percent answering yes and the percent answering no. In fact, if you think about it for a moment you will realize that we really only need to plot the yes’s or the no’s, but not both since they both relate the same information. Another way of plotting this data is to plot the percentage of each possible answer on a different axis. Since, in this example, we have two possible answers, we plot the yes percentage on one axis and the no percentage on the other axis. 430-2 © NCSS, LLC. All Rights Reserved. NCSS Statistical Software NCSS.com Correspondence Analysis Aptitude Survey Results – Scatter Plots Pct_Yes vs Pct_No 100 1 75 2 3 50 4 Pct_No 56 25 7 8 0 0 25 50 75 100 Pct_Yes Spend some time analyzing this plot. Can you see the connection between the last plot and this plot? In the previous plot, each possible answer was a horizontal set of points. In this plot, each answer is an axis. Hence, if our survey was made up of multiple-choice questions each with three possible answers, this plot would need to have been three dimensional. This plot is an example of a correspondence map, the primary output of CA. It is important to understand the features of this plot. Each axis of the plot represents a column and each point represents a row of the original table. If you were to draw a bar chart for this data, you would create bars representing the distances from each point to each axis. Since there are eight points and two axes, the bar chart would have sixteen bars. Notice also how you interpret the plot. Each point represents a specific yes or no combination. For example, question eight had about 83% answering yes and 17% answering no. This high proportion of yes respondents positions the point very close to the Pct_Yes (horizontal) axis. Conversely, question one had a large percentage of no answers and is very near the Pct_No axis. We see that points relatively close to the end of a particular axis have a high percentage value on that axis. Also notice that points that are close to each other have very similar patterns of yes or no answers. Look at the row profiles (percents) for questions five and six (two points that are almost in the same position). Notice that they differ by only one percentage point. The following figure shows the CA plot of this data generated by the program. Although the orientation of the plot is different, the distances between the points are the same. 430-3 © NCSS, LLC. All Rights Reserved. NCSS Statistical Software NCSS.com Correspondence Analysis Aptitude Survey Results – CA Plot Correspondence Plot 0.60 Q8 Q7 Q5Q6 Q4 Q3 Factor1 (100%) Q2 Q1 -1.00-1.00 0.60 Factor1 (100%) This plot and the last plot appear the same because, in this example, the CA plot reproduces the plot of the row percents. This occurs because there were only two possible answers to each question. If the questions in our survey had been multiple choice so that there were four or five answers, we would have needed to create a four- or five-dimensional plot to reproduce the Row Profile Plot. To summarize, then, a CA plot is a plot of the row profiles (percentages) constructed so that each column category becomes a different dimension. Since it is only possible to view two dimensions at a time, we must project this high dimensional space onto a two-dimensional subspace. This projection is constructed so as to maintain as much of the original information (or variation) as possible. Technical Details We will now present an outline of the computational methods used to perform the analysis. We will use standard matrix terminology to present the steps. 1. Read in the n (rows) by m (columns) data matrix, K. Note that the elements of K must be non-negative and that none of the row or column totals is zero. 2. Compute the proportion matrix, P, by dividing the elements of K by the total of all numbers in K. Mathematically, we write P = {pij } = {kij / k..} 3. Compute the totals of the rows of P and the columns of P, putting the results in the vectors r and c. Using standard matrix notation, we write r = P1 c = P'1 where 1 is an appropriately dimensioned vector of ones. 430-4 © NCSS, LLC. All Rights Reserved. NCSS Statistical Software NCSS.com Correspondence Analysis 4. Change the square roots of the vectors r and c into diagonal matrices and take the inverse of the resulting square matrices. −1/2 Dr = [diag(r)] −1/2 Dc = [diag(c)] 5. Compute the scaled matrix, A. A = Dr PDc 6. Compute the Singular Value Decomposition (SVD) of A. B, W,C = SVD(A) 7. Compute the coordinate matrices, F and G, as follows: F = Dr BW G = DcCW′ 8. Compute the eigenvalues, V. V = WW′ 9. Compute the row distances, di, and the column distances, dj. 2 p 1 ij di = ∑ − p. j j p. j pi. 2 p 1 ij d j = ∑ − pi. i pi. p. j 10. Note that the weights, wi and w j , come from the vectors r and c that were formed in step 3. wi = {ri } w j = {c j } 11. Compute the reported statistics as follows: Statistic Formula Mass wi w d 2 Inertia i i 2 2 ∑ wk dk k 2 Distance di Row Factor fij Column Factor gij 2 fij Row COR 2 di 2 gij Column COR 2 d j 430-5 © NCSS, LLC. All Rights Reserved. NCSS Statistical Software NCSS.com Correspondence Analysis Statistic Formula 2 wi fij Row CTR vi w g 2 Column CTR j ij v j Angle ArcCos( CORij ) Data Structure We will use a set of data from Greenacre (1993) in the tutorial that follows. The table below shows the results of a survey relating the smoking habits of the employees of a fictitious company to their position within the company. These data are contained in the Corres1 dataset. The entries in the table are the counts of the number of employees falling into each cell. Corres1 dataset None Light Medium Heavy Staff 4 2 3 2 (SM) Senior Managers 4 3 7 4 (JM) Junior Managers 25 10 12 4 (SE) Senior Employees 18 24 33 13 (JE) Junior Employees 10 6 7 2 (SE) Secretaries 430-6 © NCSS, LLC. All Rights Reserved.