Stat 315c: Transposable Data Correspondence Analysis

Art B. Owen

Stanford

Art B. Owen (Stanford Statistics) Correspondence Analysis 1 / 17 It plots both variables and cases in the same plane. Clearest motivation is for data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.

Correspondence Analysis

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.

Correspondence Analysis

It plots both variables and cases in the same plane.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.

Correspondence Analysis

It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.

Correspondence Analysis

It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 The treatment by Greenacre is particularly clear.

Correspondence Analysis

It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Correspondence Analysis

It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benz´ecri in the 1960s. The treatment by Greenacre is particularly clear.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Nomenclature

Correspondence P pij = nij/n••

Row masses ri = pi• = ni•/n••

Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . . . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . . . , pIj/cj) ∈ R

These are conditional and marginal distributions

Contingency tables

I × J table of counts

n11 n12 ··· n1J n21 n22 ··· n2J ...... nI1 nI2 ··· nIJ

Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 These are conditional and marginal distributions

Contingency tables

I × J table of counts

n11 n12 ··· n1J n21 n22 ··· n2J ...... nI1 nI2 ··· nIJ

Nomenclature

Correspondence matrix P pij = nij/n••

Row masses ri = pi• = ni•/n••

Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . . . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . . . , pIj/cj) ∈ R

Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 Contingency tables

I × J table of counts

n11 n12 ··· n1J n21 n22 ··· n2J ...... nI1 nI2 ··· nIJ

Nomenclature

Correspondence matrix P pij = nij/n••

Row masses ri = pi• = ni•/n••

Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . . . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . . . , pIj/cj) ∈ R

These are conditional and marginal distributions

Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 Column centroid J X 0 cjcj = (r1, . . . , rI ) ≡ r j=1

Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns

First moments: centroids

Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . . . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i

Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns

First moments: centroids

Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . . . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i

Column centroid J X 0 cjcj = (r1, . . . , rI ) ≡ r j=1

Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 First moments: centroids

Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . . . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i

Column centroid J X 0 cjcj = (r1, . . . , rI ) ≡ r j=1

Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns

Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 This is the total inertia of the row profiles. It equals total inertia of column profiles.

Second moments: inertias

Chisquare for independence as weighted Euclidean distance 2 X X (nij − ni•n•j/n••) X2 = n n /n i j i• •j •• 2 X X (nij/ni• − n•j/n••) = n i• n /n i j •j •• 2 X ni• X (nij/ni• − n•j/n••) = n •• n n /n i •• j •j •• X 0 −1 = n•• ri(ri − c) diag(c) (ri − c) i

= n•• × Inertia

Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17 Second moments: inertias

Chisquare for independence as weighted Euclidean distance 2 X X (nij − ni•n•j/n••) X2 = n n /n i j i• •j •• 2 X X (nij/ni• − n•j/n••) = n i• n /n i j •j •• 2 X ni• X (nij/ni• − n•j/n••) = n •• n n /n i •• j •j •• X 0 −1 = n•• ri(ri − c) diag(c) (ri − c) i

= n•• × Inertia

This is the total inertia of the row profiles. It equals total inertia of column profiles.

Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17 Geometry

For J = 3 12 8 8 7 7 6 8 8 10 6 9 8 9 8 7 3 We can plot profiles in R Low inertia

Art B. Owen (Stanford Statistics) Correspondence Analysis 6 / 17 Geometry

This example has higher inertia

Art B. Owen (Stanford Statistics) Correspondence Analysis 7 / 17 Geometry

Still higher inertia. χ2 statistics describes variation of row profiles Similarly for col profiles

Art B. Owen (Stanford Statistics) Correspondence Analysis 8 / 17 Rescale Distances Euclidean distance in plot ignores column values rij Replace ri by ri with rij = √ e e cj 2 Euclidean dist between rei and rei0 is “χ dist” between ri and ri0 .

Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17 Rescale Distances Euclidean distance in plot ignores column values rij Replace ri by ri with rij = √ e e cj 2 Euclidean dist between rei and rei0 is “χ dist” between ri and ri0 .

Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17 Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass

Role of χ2 in statistical significance is not considered important in this literature

Reason for χ2

Invariance Suppose rows i and i0 are proportional

nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows

New ni∗j = nij + ni0j and delete originals

Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 Role of χ2 in statistical significance is not considered important in this literature

Reason for χ2

Invariance Suppose rows i and i0 are proportional

nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows

New ni∗j = nij + ni0j and delete originals

Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass

Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 Reason for χ2

Invariance Suppose rows i and i0 are proportional

nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows

New ni∗j = nij + ni0j and delete originals

Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass

Role of χ2 in statistical significance is not considered important in this literature

Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 If J is too big reduce dimension by principal components of

rij − cj √ cj

plot in reduced dimension along with images of corners

Dimension reduction

Now we have a plot J−1 With rows and cols both in R

Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17 Dimension reduction

Now we have a plot J−1 With rows and cols both in R

If J is too big reduce dimension by principal components of

rij − cj √ cj

plot in reduced dimension along with images of corners

Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17 More notation

Dr = diag(r) = diag(r1, . . . , rI )

Dc = diag(c) = diag(c1, . . . , cJ )

Duality

Rows lie in min(I − 1,J − 1) dimensional space So do columns In PC of row profiles ... columns are outside In PC of column profiles ... rows are outside Symmetric correspondence analysis overlap the points after rescaling

Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17 Duality

Rows lie in min(I − 1,J − 1) dimensional space So do columns In PC of row profiles ... columns are outside In PC of column profiles ... rows are outside Symmetric correspondence analysis overlap the points after rescaling

More notation

Dr = diag(r) = diag(r1, . . . , rI )

Dc = diag(c) = diag(c1, . . . , cJ )

Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17 Coordinates Rows (1st k cols of)

−1 0 −1 −1 F = (Dr P − 1c )Dc V1:k = Dr UΣ

Columns (1st k cols of)

−1 0 −1 −1 0 G = (Dc P − 1r )Dr U1:k = Dc V Σ

Symmetric analysis

Uses SVD S = UΣV 0 where

pij − ricj sij = √ ricj

2 Total inertia is kSkF 2 ’principal inertias’ are λj

Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17 Symmetric analysis

Uses SVD S = UΣV 0 where

pij − ricj sij = √ ricj

2 Total inertia is kSkF 2 ’principal inertias’ are λj

Coordinates Rows (1st k cols of)

−1 0 −1 −1 F = (Dr P − 1c )Dc V1:k = Dr UΣ

Columns (1st k cols of)

−1 0 −1 −1 0 G = (Dc P − 1r )Dr U1:k = Dc V Σ

Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17 Symmetric analysis

Interpretation is tricky/controversial √ ri near ri0 √ cj near cj0

ri near cj ?? Rows and columns are not in the same space

Biplots Due to Gabriel (1971) Biometrika

For matrix Xij 2 plot rows as ui ∈ R 2 cols as vj ∈ R 0 . with uivj = Xij A interpretation applies to asymmetric plots

Art B. Owen (Stanford Statistics) Correspondence Analysis 14 / 17 Merged points Add linear combination or sum of rows, E.G. 1 pool columns for math and statistics into “math sciences” 2 pool rows for EU countries into an EU point

Some finer points

Ghost points Apply projection to point not in table E.G. hypothetical row entity, 1 impute a president’s ’senate voting record’ 2 compare a state’s economy to those of countries Treat as fixed profile with mass ↓ 0

Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17 Some finer points

Ghost points Apply projection to point not in table E.G. hypothetical row entity, 1 impute a president’s ’senate voting record’ 2 compare a state’s economy to those of countries Treat as fixed profile with mass ↓ 0

Merged points Add linear combination or sum of rows, E.G. 1 pool columns for math and statistics into “math sciences” 2 pool rows for EU countries into an EU point

Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17 Fuzzy coding x ∈ R becomes two columns (1, 0) for small x say x < L (0, 1) for large x say x > U (1 − t, t) for intermediate x t = (x − L)/(U − L) Generalizations to > 2 columns

Data types Counts are straightforward Other ’near measures’ are reasonable

I rainfalls, heights, volumes, temperatures Kelvin I dollars spent I parts per million Reweight cols to equalize inertia ≈ standardizing to equalize Requires iteration

Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17 Data types Counts are straightforward Other ’near measures’ are reasonable

I rainfalls, heights, volumes, temperatures Kelvin I dollars spent I parts per million Reweight cols to equalize inertia ≈ standardizing to equalize variance Requires iteration

Fuzzy coding x ∈ R becomes two columns (1, 0) for small x say x < L (0, 1) for large x say x > U (1 − t, t) for intermediate x t = (x − L)/(U − L) Generalizations to > 2 columns

Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17 Further reading “Correspondence Analysis in Practice” M.J. Greenacre, 1993 Emphasizes geometry with examples “Theory and Applications of Correspondence Analysis” M.J. Greenacre, 1984 Good coverage of theory with examples “Correspondence Analysis and Data Coding with Java and R” F. Murtagh, 2005 Code and worked examples

Puzzlers Does it scale? (eg 108 points in the plane) Is there a tensor version? (Beyond all pairs of two way versions) Distributional equivalence vs Poisson models

Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17 Puzzlers Does it scale? (eg 108 points in the plane) Is there a tensor version? (Beyond all pairs of two way versions) Distributional equivalence vs Poisson models

Further reading “Correspondence Analysis in Practice” M.J. Greenacre, 1993 Emphasizes geometry with examples “Theory and Applications of Correspondence Analysis” M.J. Greenacre, 1984 Good coverage of theory with examples “Correspondence Analysis and Data Coding with Java and R” F. Murtagh, 2005 Code and worked examples

Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17