Stat 315c: Transposable Data Correspondence Analysis
Art B. Owen
Stanford Statistics
Art B. Owen (Stanford Statistics) Correspondence Analysis 1 / 17 It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.
Correspondence Analysis
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.
Correspondence Analysis
It plots both variables and cases in the same plane.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.
Correspondence Analysis
It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.
Correspondence Analysis
It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 The treatment by Greenacre is particularly clear.
Correspondence Analysis
It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Correspondence Analysis
It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Nomenclature
Correspondence matrix P pij = nij/n••
Row masses ri = pi• = ni•/n••
Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . . . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . . . , pIj/cj) ∈ R
These are conditional and marginal distributions
Contingency tables
I × J table of counts
n11 n12 ··· n1J n21 n22 ··· n2J ...... nI1 nI2 ··· nIJ
Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 These are conditional and marginal distributions
Contingency tables
I × J table of counts
n11 n12 ··· n1J n21 n22 ··· n2J ...... nI1 nI2 ··· nIJ
Nomenclature
Correspondence matrix P pij = nij/n••
Row masses ri = pi• = ni•/n••
Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . . . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . . . , pIj/cj) ∈ R
Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 Contingency tables
I × J table of counts
n11 n12 ··· n1J n21 n22 ··· n2J ...... nI1 nI2 ··· nIJ
Nomenclature
Correspondence matrix P pij = nij/n••
Row masses ri = pi• = ni•/n••
Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . . . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . . . , pIj/cj) ∈ R
These are conditional and marginal distributions
Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 Column centroid J X 0 cjcj = (r1, . . . , rI ) ≡ r j=1
Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns
First moments: centroids
Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . . . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i
Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns
First moments: centroids
Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . . . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i
Column centroid J X 0 cjcj = (r1, . . . , rI ) ≡ r j=1
Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 First moments: centroids
Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . . . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i
Column centroid J X 0 cjcj = (r1, . . . , rI ) ≡ r j=1
Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns
Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 This is the total inertia of the row profiles. It equals total inertia of column profiles.
Second moments: inertias
Chisquare for independence as weighted Euclidean distance 2 X X (nij − ni•n•j/n••) X2 = n n /n i j i• •j •• 2 X X (nij/ni• − n•j/n••) = n i• n /n i j •j •• 2 X ni• X (nij/ni• − n•j/n••) = n •• n n /n i •• j •j •• X 0 −1 = n•• ri(ri − c) diag(c) (ri − c) i
= n•• × Inertia
Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17 Second moments: inertias
Chisquare for independence as weighted Euclidean distance 2 X X (nij − ni•n•j/n••) X2 = n n /n i j i• •j •• 2 X X (nij/ni• − n•j/n••) = n i• n /n i j •j •• 2 X ni• X (nij/ni• − n•j/n••) = n •• n n /n i •• j •j •• X 0 −1 = n•• ri(ri − c) diag(c) (ri − c) i
= n•• × Inertia
This is the total inertia of the row profiles. It equals total inertia of column profiles.
Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17 Geometry
For J = 3 12 8 8 7 7 6 8 8 10 6 9 8 9 8 7 3 We can plot profiles in R Low inertia
Art B. Owen (Stanford Statistics) Correspondence Analysis 6 / 17 Geometry
This example has higher inertia
Art B. Owen (Stanford Statistics) Correspondence Analysis 7 / 17 Geometry
Still higher inertia. χ2 statistics describes variation of row profiles Similarly for col profiles
Art B. Owen (Stanford Statistics) Correspondence Analysis 8 / 17 Rescale Distances Euclidean distance in plot ignores column values rij Replace ri by ri with rij = √ e e cj 2 Euclidean dist between rei and rei0 is “χ dist” between ri and ri0 .
Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17 Rescale Distances Euclidean distance in plot ignores column values rij Replace ri by ri with rij = √ e e cj 2 Euclidean dist between rei and rei0 is “χ dist” between ri and ri0 .
Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17 Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass
Role of χ2 in statistical significance is not considered important in this literature
Reason for χ2
Invariance Suppose rows i and i0 are proportional
nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows
New ni∗j = nij + ni0j and delete originals
Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 Role of χ2 in statistical significance is not considered important in this literature
Reason for χ2
Invariance Suppose rows i and i0 are proportional
nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows
New ni∗j = nij + ni0j and delete originals
Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass
Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 Reason for χ2
Invariance Suppose rows i and i0 are proportional
nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows
New ni∗j = nij + ni0j and delete originals
Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass
Role of χ2 in statistical significance is not considered important in this literature
Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 If J is too big reduce dimension by principal components of
rij − cj √ cj
plot in reduced dimension along with images of corners
Dimension reduction
Now we have a plot J−1 With rows and cols both in R
Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17 Dimension reduction
Now we have a plot J−1 With rows and cols both in R
If J is too big reduce dimension by principal components of
rij − cj √ cj
plot in reduced dimension along with images of corners
Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17 More notation
Dr = diag(r) = diag(r1, . . . , rI )
Dc = diag(c) = diag(c1, . . . , cJ )
Duality
Rows lie in min(I − 1,J − 1) dimensional space So do columns In PC of row profiles ... columns are outside In PC of column profiles ... rows are outside Symmetric correspondence analysis overlap the points after rescaling
Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17 Duality
Rows lie in min(I − 1,J − 1) dimensional space So do columns In PC of row profiles ... columns are outside In PC of column profiles ... rows are outside Symmetric correspondence analysis overlap the points after rescaling
More notation
Dr = diag(r) = diag(r1, . . . , rI )
Dc = diag(c) = diag(c1, . . . , cJ )
Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17 Coordinates Rows (1st k cols of)
−1 0 −1 −1 F = (Dr P − 1c )Dc V1:k = Dr UΣ
Columns (1st k cols of)
−1 0 −1 −1 0 G = (Dc P − 1r )Dr U1:k = Dc V Σ
Symmetric analysis
Uses SVD S = UΣV 0 where
pij − ricj sij = √ ricj
2 Total inertia is kSkF 2 ’principal inertias’ are λj
Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17 Symmetric analysis
Uses SVD S = UΣV 0 where
pij − ricj sij = √ ricj
2 Total inertia is kSkF 2 ’principal inertias’ are λj
Coordinates Rows (1st k cols of)
−1 0 −1 −1 F = (Dr P − 1c )Dc V1:k = Dr UΣ
Columns (1st k cols of)
−1 0 −1 −1 0 G = (Dc P − 1r )Dr U1:k = Dc V Σ
Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17 Symmetric analysis
Interpretation is tricky/controversial √ ri near ri0 √ cj near cj0
ri near cj ?? Rows and columns are not in the same space
Biplots Due to Gabriel (1971) Biometrika
For matrix Xij 2 plot rows as ui ∈ R 2 cols as vj ∈ R 0 . with uivj = Xij A biplot interpretation applies to asymmetric plots
Art B. Owen (Stanford Statistics) Correspondence Analysis 14 / 17 Merged points Add linear combination or sum of rows, E.G. 1 pool columns for math and statistics into “math sciences” 2 pool rows for EU countries into an EU point
Some finer points
Ghost points Apply projection to point not in table E.G. hypothetical row entity, 1 impute a president’s ’senate voting record’ 2 compare a state’s economy to those of countries Treat as fixed profile with mass ↓ 0
Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17 Some finer points
Ghost points Apply projection to point not in table E.G. hypothetical row entity, 1 impute a president’s ’senate voting record’ 2 compare a state’s economy to those of countries Treat as fixed profile with mass ↓ 0
Merged points Add linear combination or sum of rows, E.G. 1 pool columns for math and statistics into “math sciences” 2 pool rows for EU countries into an EU point
Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17 Fuzzy coding x ∈ R becomes two columns (1, 0) for small x say x < L (0, 1) for large x say x > U (1 − t, t) for intermediate x t = (x − L)/(U − L) Generalizations to > 2 columns
Data types Counts are straightforward Other ’near measures’ are reasonable
I rainfalls, heights, volumes, temperatures Kelvin I dollars spent I parts per million Reweight cols to equalize inertia ≈ standardizing to equalize variance Requires iteration
Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17 Data types Counts are straightforward Other ’near measures’ are reasonable
I rainfalls, heights, volumes, temperatures Kelvin I dollars spent I parts per million Reweight cols to equalize inertia ≈ standardizing to equalize variance Requires iteration
Fuzzy coding x ∈ R becomes two columns (1, 0) for small x say x < L (0, 1) for large x say x > U (1 − t, t) for intermediate x t = (x − L)/(U − L) Generalizations to > 2 columns
Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17 Further reading “Correspondence Analysis in Practice” M.J. Greenacre, 1993 Emphasizes geometry with examples “Theory and Applications of Correspondence Analysis” M.J. Greenacre, 1984 Good coverage of theory with examples “Correspondence Analysis and Data Coding with Java and R” F. Murtagh, 2005 Code and worked examples
Puzzlers Does it scale? (eg 108 points in the plane) Is there a tensor version? (Beyond all pairs of two way versions) Distributional equivalence vs Poisson models
Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17 Puzzlers Does it scale? (eg 108 points in the plane) Is there a tensor version? (Beyond all pairs of two way versions) Distributional equivalence vs Poisson models
Further reading “Correspondence Analysis in Practice” M.J. Greenacre, 1993 Emphasizes geometry with examples “Theory and Applications of Correspondence Analysis” M.J. Greenacre, 1984 Good coverage of theory with examples “Correspondence Analysis and Data Coding with Java and R” F. Murtagh, 2005 Code and worked examples
Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17