<<

Lecture 3: Estimating Parameters and Assessing Normality

M˚ansThulin Department of Mathematics, Uppsala University [email protected]

Multivariate Methods • 1/4 2011

1/36 Homeworks

I To pass the course (grade 3), all mandatory problems must be solved, you must hold an oral presentation of a clustering method and you must pass the exam.

I For grade 4, you must present satisfactory solutions to at least 4 bonus problems (at least one from each homework).

I For grade 5, you must present satisfactory solutions to at least 8 bonus problems (at least two from each homework).

I Bonus problems on the exam can be counted as belonging to the corresponding homework.

2/36 I Estimation for the multivariate

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

I Transformations to normality

Outline

I Sample moments

I Unbiasedness I Asymptotics

3/36 I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

I Transformations to normality

Outline

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators

3/36 I Outliers

I Transformations to normality

Outline

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

3/36 I Transformations to normality

Outline

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

3/36 Outline

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

I Transformations to normality

3/36 Multivariate data

We study a p-dimensional consisting of n observations. The data is stored in a matrix:   x11 x12 ... x1p  x21 x22 ... x2p  X =    . . .. .   . . . .  xn1 xn2 ... xnp Row j contains the p measurements for subject j. xjk = measurement k for subject j.

4/36 Sample moments

  x¯1 n  x¯2  1 X Sample : X¯ =  .  wherex ¯k = xjk .  .  n  .  j=1 x¯p   s11 s12 ··· s1p  s12 s22 ··· s2p  Sample covariance matrix: S =    . . .. .   . . . .  s1p s2p ··· spp

where 1 Pn sk` = s`k = n−1 j=1(xjk − x¯k )(xj` − x¯`), k, ` = 1, 2,..., p.

5/36 Then

E(X¯) = µ 1 Cov(X¯) = Σ n Furthermore, E(S) = Σ X¯ is an unbiased estimator of µ and S is an unbiased estimator of Σ.

Sample moments: Unbiasedness

Theorem. Let X1,..., Xn be an i.i.d. sample from a joint distribution with mean µ and covariance matrix Σ.

6/36 Furthermore, E(S) = Σ X¯ is an unbiased estimator of µ and S is an unbiased estimator of Σ.

Sample moments: Unbiasedness

Theorem. Let X1,..., Xn be an i.i.d. sample from a joint distribution with mean µ and covariance matrix Σ.Then

E(X¯) = µ 1 Cov(X¯) = Σ n

6/36 X¯ is an unbiased estimator of µ and S is an unbiased estimator of Σ.

Sample moments: Unbiasedness

Theorem. Let X1,..., Xn be an i.i.d. sample from a joint distribution with mean µ and covariance matrix Σ.Then

E(X¯) = µ 1 Cov(X¯) = Σ n Furthermore, E(S) = Σ

6/36 Sample moments: Unbiasedness

Theorem. Let X1,..., Xn be an i.i.d. sample from a joint distribution with mean µ and covariance matrix Σ.Then

E(X¯) = µ 1 Cov(X¯) = Σ n Furthermore, E(S) = Σ X¯ is an unbiased estimator of µ and S is an unbiased estimator of Σ.

6/36 If we further assume that the observations have a finite covariance matrix Σ, then we also have

The multivariate : √ n(X¯ − µ) −→d N(0, Σ) as n → ∞, √ that is, the distribution function of n(X¯ − µ) converges to the distribution function of the N(0, Σ) distribution.

Sample moments: Asymptotics

Let X1,..., Xn be i.i.d. observations with mean µ. Then we have

The multivariate law of large numbers:

p X¯ −→ µ as n → ∞,

that is, P(|X¯ − µ| > ) → 0 as n → ∞ for all  > 0.

7/36 Sample moments: Asymptotics

Let X1,..., Xn be i.i.d. observations with mean µ. Then we have

The multivariate law of large numbers:

p X¯ −→ µ as n → ∞,

that is, P(|X¯ − µ| > ) → 0 as n → ∞ for all  > 0.

If we further assume that the observations have a finite covariance matrix Σ, then we also have

The multivariate central limit theorem: √ n(X¯ − µ) −→d N(0, Σ) as n → ∞, √ that is, the distribution function of n(X¯ − µ) converges to the distribution function of the N(0, Σ) distribution.

7/36 I X¯ and S seem like natural (unbiased!) estimators. What are their properties?

I We’ll find the maximum likelihood estimators of µ and Σ and study their distributions.

Estimation

We would like to be able to estimate the parameters µ and Σ for the multivariate normal distribution.

8/36 I We’ll find the maximum likelihood estimators of µ and Σ and study their distributions.

Estimation

We would like to be able to estimate the parameters µ and Σ for the multivariate normal distribution.

I X¯ and S seem like natural (unbiased!) estimators. What are their properties?

8/36 Estimation

We would like to be able to estimate the parameters µ and Σ for the multivariate normal distribution.

I X¯ and S seem like natural (unbiased!) estimators. What are their properties?

I We’ll find the maximum likelihood estimators of µ and Σ and study their distributions.

8/36 Estimation: ML principle

Let X ,..., X be observations with densities f with an 1 n Xi,θ unknown parameter θ. The maximum likelihood estimate of θ is the value of θ that maximizes the

n Y L(θ) = f (x ). Xi i i=1 In general, maximum likelihood estimators have desirable properties, such as consistency and asymptotic efficiency.

9/36 I Taking µˆ = X¯ maximizes L(µ, Σ) with respect to µ. ˆ n−1 I Taking Σ = n S maximizes L(µ, Σ) with respect to Σ.

Estimation: MLE for MVN

For i.i.d. X1,..., Xn from Np(µ, Σ), the likelihood function is

n 1 1  1 X  L(µ, Σ) = exp − (x − µ)0Σ−1(x − µ) (2π)np/2 |Σ|n/2 2 i i i=1

10/36 ˆ n−1 I Taking Σ = n S maximizes L(µ, Σ) with respect to Σ.

Estimation: MLE for MVN

For i.i.d. X1,..., Xn from Np(µ, Σ), the likelihood function is

n 1 1  1 X  L(µ, Σ) = exp − (x − µ)0Σ−1(x − µ) (2π)np/2 |Σ|n/2 2 i i i=1

I Taking µˆ = X¯ maximizes L(µ, Σ) with respect to µ.

10/36 Estimation: MLE for MVN

For i.i.d. X1,..., Xn from Np(µ, Σ), the likelihood function is

n 1 1  1 X  L(µ, Σ) = exp − (x − µ)0Σ−1(x − µ) (2π)np/2 |Σ|n/2 2 i i i=1

I Taking µˆ = X¯ maximizes L(µ, Σ) with respect to µ. ˆ n−1 I Taking Σ = n S maximizes L(µ, Σ) with respect to Σ.

10/36 I For the multivariate normal distribution: −1 – The ML estimator of µ0Σ−1µ is µˆ0Σˆ µˆ. √ √ – The ML estimator of σii is σˆii .

I For the multivariate normal distribution, X¯ and Sn are sufficient .

I Thus all the information about µ and Σ in the data matrix X is contained in X¯ and Sn.

Estimation: MLE for MVN

Some further remarks: I Functions of parameters: If θˆ is the ML estimator of θ, then h(θˆ) is the ML estimator of h(θ).

11/36 I For the multivariate normal distribution, X¯ and Sn are sufficient statistics.

I Thus all the information about µ and Σ in the data matrix X is contained in X¯ and Sn.

Estimation: MLE for MVN

Some further remarks: I Functions of parameters: If θˆ is the ML estimator of θ, then h(θˆ) is the ML estimator of h(θ).

I For the multivariate normal distribution: −1 – The ML estimator of µ0Σ−1µ is µˆ0Σˆ µˆ. √ √ – The ML estimator of σii is σˆii .

11/36 Estimation: MLE for MVN

Some further remarks: I Functions of parameters: If θˆ is the ML estimator of θ, then h(θˆ) is the ML estimator of h(θ).

I For the multivariate normal distribution: −1 – The ML estimator of µ0Σ−1µ is µˆ0Σˆ µˆ. √ √ – The ML estimator of σii is σˆii .

I For the multivariate normal distribution, X¯ and Sn are sufficient statistics.

I Thus all the information about µ and Σ in the data matrix X is contained in X¯ and Sn.

11/36 Estimation: Distribution of X¯

Theorem. Let X1,..., Xn be i.i.d. observations from Np(µ, Σ). Then  1  X¯ ∼ N µ, Σ . p n

12/36 By the definition of the χ2-distribution, this that Pn ¯ 2 i=1(Xi − X ) is distributed as

2 2 2 2 2 σ (Z1 + ... Zn−1) = (σZ1) + ... + (σZn−1) ,

2 where the Zi are i.i.d. N(0, 1), so that σZi ∼ N(0, σ ).

Returning to the multivariate setting, let Z1,..., Zm be Pm 0 i.i.d.Np(0, Σ). The distribution of the matrix i=1 ZiZi is called the Wishart distribution, denoted Wm(Σ), where m is called degrees of freedom and Σ is called the scale matrix.

Estimation: Distribution of Sn

In the univariate setting, 2 Pn ¯ 2 2 2 (n − 1)s = i=1(Xi − X ) ∼ σ · χ (n − 1).

13/36 Returning to the multivariate setting, let Z1,..., Zm be Pm 0 i.i.d.Np(0, Σ). The distribution of the matrix i=1 ZiZi is called the Wishart distribution, denoted Wm(Σ), where m is called degrees of freedom and Σ is called the scale matrix.

Estimation: Distribution of Sn

In the univariate setting, 2 Pn ¯ 2 2 2 (n − 1)s = i=1(Xi − X ) ∼ σ · χ (n − 1).

By the definition of the χ2-distribution, this means that Pn ¯ 2 i=1(Xi − X ) is distributed as

2 2 2 2 2 σ (Z1 + ... Zn−1) = (σZ1) + ... + (σZn−1) ,

2 where the Zi are i.i.d. N(0, 1), so that σZi ∼ N(0, σ ).

13/36 Estimation: Distribution of Sn

In the univariate setting, 2 Pn ¯ 2 2 2 (n − 1)s = i=1(Xi − X ) ∼ σ · χ (n − 1).

By the definition of the χ2-distribution, this means that Pn ¯ 2 i=1(Xi − X ) is distributed as

2 2 2 2 2 σ (Z1 + ... Zn−1) = (σZ1) + ... + (σZn−1) ,

2 where the Zi are i.i.d. N(0, 1), so that σZi ∼ N(0, σ ).

Returning to the multivariate setting, let Z1,..., Zm be Pm 0 i.i.d.Np(0, Σ). The distribution of the matrix i=1 ZiZi is called the Wishart distribution, denoted Wm(Σ), where m is called degrees of freedom and Σ is called the scale matrix.

13/36 Properties of the Wishart distribution:

I If A1 ∼ Wm1 (Σ), A2 ∼ Wm2 (Σ) and A1 and A1 are

independent, then A1 + A2 ∼ Wm1+m2 (Σ), i.e. their sum is Wishart distributed with m1 + m2 degrees of freedom. 0 0 I If A ∼ Wm1 (Σ) then CAC ∼ Wm1 (CΣC ).

Estimation: Distribution of Sn

Theorem. Let X1,..., Xn be i.i.d. observations from Np(µ, Σ). Then (n − 1)S ∼ Wn−1(Σ).

14/36 Estimation: Distribution of Sn

Theorem. Let X1,..., Xn be i.i.d. observations from Np(µ, Σ). Then (n − 1)S ∼ Wn−1(Σ).

Properties of the Wishart distribution:

I If A1 ∼ Wm1 (Σ), A2 ∼ Wm2 (Σ) and A1 and A1 are

independent, then A1 + A2 ∼ Wm1+m2 (Σ), i.e. their sum is Wishart distributed with m1 + m2 degrees of freedom. 0 0 I If A ∼ Wm1 (Σ) then CAC ∼ Wm1 (CΣC ).

14/36 I Due to the multivariate central limit theorem, methods based on the normal distribution can often be used as approximations for large n, but it is often better to use other (perhaps non-parametric) methods if the data is non-normal or if n is small.

I For multivariate data, the ways in which distributions can deviate from normality are many and varied.

I Using univariate normality tests on the marginal distributions may miss departures in multivariate combinations of variables.

I Using multivariate tests may dilute the effects of a single non-normal variable.

Assessing normality

I The assumption of normality is fundamental for many methods of .

15/36 I For multivariate data, the ways in which distributions can deviate from normality are many and varied.

I Using univariate normality tests on the marginal distributions may miss departures in multivariate combinations of variables.

I Using multivariate tests may dilute the effects of a single non-normal variable.

Assessing normality

I The assumption of normality is fundamental for many methods of multivariate statistics.

I Due to the multivariate central limit theorem, methods based on the normal distribution can often be used as approximations for large n, but it is often better to use other (perhaps non-parametric) methods if the data is non-normal or if n is small.

15/36 I Using univariate normality tests on the marginal distributions may miss departures in multivariate combinations of variables.

I Using multivariate tests may dilute the effects of a single non-normal variable.

Assessing normality

I The assumption of normality is fundamental for many methods of multivariate statistics.

I Due to the multivariate central limit theorem, methods based on the normal distribution can often be used as approximations for large n, but it is often better to use other (perhaps non-parametric) methods if the data is non-normal or if n is small.

I For multivariate data, the ways in which distributions can deviate from normality are many and varied.

15/36 I Using multivariate tests may dilute the effects of a single non-normal variable.

Assessing normality

I The assumption of normality is fundamental for many methods of multivariate statistics.

I Due to the multivariate central limit theorem, methods based on the normal distribution can often be used as approximations for large n, but it is often better to use other (perhaps non-parametric) methods if the data is non-normal or if n is small.

I For multivariate data, the ways in which distributions can deviate from normality are many and varied.

I Using univariate normality tests on the marginal distributions may miss departures in multivariate combinations of variables.

15/36 Assessing normality

I The assumption of normality is fundamental for many methods of multivariate statistics.

I Due to the multivariate central limit theorem, methods based on the normal distribution can often be used as approximations for large n, but it is often better to use other (perhaps non-parametric) methods if the data is non-normal or if n is small.

I For multivariate data, the ways in which distributions can deviate from normality are many and varied.

I Using univariate normality tests on the marginal distributions may miss departures in multivariate combinations of variables.

I Using multivariate tests may dilute the effects of a single non-normal variable.

15/36 Assessing normality: Graphical methods

Graphical presentations of the data can be very useful for detecting deviations from normality. Useful methods include:

I Scatter plots

I Q-Q-plots of marginal distributions 2 I χ -plots (β-plots)

16/36 Assessing normality: Scatter plots Here all variables are normal. The resemble the normal density and the point clouds are elliptic.

−2 −1 0 1 2 −8 −6 −4 −2

● ● ● ● ● ● x1 ●● ●●● ●●●● ●●● ●● 10 ● ●●● ● ● ● ●●●● ● ● ●●● ● ●● ● ●● ● ●●●● ●● ● ● ● ●● ●●● ●● ●●●●●●●● ●● ●● ● ●●● ● ● ●● ● ●● ●●● 5 ●●●●●●●●●●●● ● ●●●●●●● ● ● ● ●●●●●●●●● ● ●●●●●●●●● ●●● ●● ●●●●●●●●● ● ●●●●●●●●●●● ● ●●●●● ●●●●●● ●● ●● ●●●●●● ● ●●● ●●●●● ● ● ● ● ●●●●●●●●●●● ●●● ●●●●●●●● ● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●● ● ● ● ●●●●●●●●●●●●● ● ●●●●●●●●● ●●●● ●●●●●●●●● ● ●●●●● ●●●● ●● ● ● ● ● ● ● 0 ● ●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●● ● ●●●●●● ● ●●●●●●●● ●●●● ●●●● ● ●● ● ●● ● ●●●●● ● ● ● ● ●● ●●● ● ● ● ●● ●● ● ●●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● −5

● ●● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● x2 ● ● ● ●● ● ●●●● ● ●●● ●●●● ● ●● ●● ●● ●● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●● ●● ● ● 1 ●● ● ●● ● ● ●●● ● ●●●● ●● ●●●●●●● ● ● ●●●● ● ●● ●●●●●●●● ● ●● ●●●●●●●●●● ●●● ●●●●●●●●● ●●●●●●● ●●● ●● ●●●●●●●●● ●● ● ●●●● ●●●●●● ● ● ● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●● ●●● 0 ● ● ● ● ● ● ●●● ●●●●● ● ●●● ●●● ●● ●●●●● ●●● ● ●●●●●●●●● ●●● ● ●●●●●●●●●●●●●●● ● ●●●●● ●● ●● ●●●●●●●● ●●●●●●●● ● ●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●● ● ● ●●●●●●● ● ● ●●●●●●●● ● ● ●● ●●●●●●●●● ●●●●●●● ● ●● ●●●●●●● ● ●●●●●● ●●●● −1 ●● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●● ●●● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● −2 ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ●● ● ● ● ●● x3 ● ● ● ● ●●●● ● ● ●● ● ● ●●●●●● ●● ●● ●● ●● ●● ● ● ●●● ● ● ● ●●●● ●●●● ● ● ● ● ● ● ●●●● ● ●●●●●● ● ● ● ●● ● ● ●● ● ●●●● 5 ●●●●●●●●● ●●●●●●●●●●●●● ●● ●●●●●●●●●●●● ● ●●●●●● ●●● ●● ●●● ● ● ●●●●● ●●●●● ●●●●●●●●● ● ●●●● ●●●●●●●● ●●●●● ●●●●● ● ● ●●●●●●●● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●● ●●●●●●●●●●● ● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●● ● ●●● ● ● ● ● 0 ●●●●●●●● ●●●●●●●● ●●●● ● ● ●●●●●●●●● ● ●●●●●●●● ●●●●●●●● ● ●●●●●●●● ● ●●● ●●●● ● ●● ● ● ●● ● ●●●● ● ● ● ●● ● ●● ● ●● ● ●●●●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ●●● ● ● ● ● ● −5

● ● ● ● ● ● ● ●

−2 ●●● ● ● ● ●●● ●●● ●●● ● ● ● ● ● ●●●●●●● ●● ●●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ● ● ●● ● ● ● ●●● ●● x4 ● ●●●●●●●●●● ● ● ●●●●●●●●● ● ● ●●● ●●●● ● ● ● ●●●●●●● ● ● ●●●●● ●●●● ●●●●●●●● ●●

−4 ● ● ● ● ●●●●●●●●●●●●● ● ●●● ●●●●●●●●● ●● ● ● ●●●●●●●●● ●● ●● ●●●●●●● ●● ● ● ● ●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ● ● ●● ●●●● ● ●● ●● ●●● ● ● ● ●● ● ●●●●● ● ● ● ● ●●●●●● ● ●●●●●●●●●● ●● ●●●●●●●● ●●●●● ● ●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ●●●●●●●●●● ● ● ●●●●●●●●● ● ●●●●●●●●●●● ● ●●● ●●●●●●●● −6 ● ●●●● ●● ● ● ●● ●● ● ● ●●● ● ● ●●●●● ●●●● ● ●●●● ● ●● ●● ●●● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ●●●● ●●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●

−8 ● ● ● ● ● ● 17/36 −5 0 5 10 −5 0 5 10 Assessing normality: Scatter plots Here X3 and X4 are non-normal. Their histograms are far from the normal density and the clouds are not elliptic.

−3 −1 0 1 2 3 −0.5 0.5 1.5 2.5

● ● ● 10 ● ●●● ● ●● ●●● ● ●● ●● ● ●●●●●●● ●●●●● ● x1 ●● ●●● ● ●●●● ● ● ●●● ●●●●●●● ●●●● ● ●● ●●● ● ● ●● ●●● ●●● ●● ● ● ● ●●●●●●●●● ● ● ● ● ●●●●●●● ● ●●●●●●● ● ● 5 ●●●●●●●● ● ● ● ● ●● ●● ●● ●●●● ●●●●●●●●● ● ●● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●●●●●●● ● ● ● ●● ● ● ●●●●●● ●●●●●●●●●● ● ●●●●●●●●●●●●● ● ●●● ●● ●●●●●●●● ●● ●●●●●●● ●● ● ●●●●●●●●●● ● ● ●●●●●●● ●●●●●● ●●●●●● ● ● ●● ●●● ●● ●●● ● ●●● ● 0 ●●●●●●● ●●●●●●●● ● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ● ● ● ● ●● ●●●●●●● ●●●●●●●●●●●●● ● ● ●● ● ●● ● ●●● ●●●● ●●●●● ●●● ●● ● ●● ● ●● ●●● ● ●● ● ● ● ● ● −5 ● ● ●

3 ● ● ● ● ● ● ● ● ●● ● ●●● 2 ● ● ● ● ● ● ● ● ●● ●● x2 ● ● ●● ●● ●●●● ● ●●● ●●●●●● ●● ●● ● ●●● ●●●●●●● ●● 1 ●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●● ●●●●●●● ●●● ● ●● ●●●●● ●●●●●●●●● ●●●●●●●●●● ● ● ● ● ●● ●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●● ●● ●● ●●●●●●●● ●●●●●●●●●●●●●● 0 ● ● ●●●●●●● ● ● ● ●●●●● ●●●●●●●● ●●●●●●●●●●● ●●● ●●●● ●● ●●●●●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●● ●●●●●● ●●●●●●●●●● ●●●●●●●●●● ● ●● ● ● ●● ●●● ●●●●●●●●●● ●●●●●●●●●● ● ● ● ● ●● ●●●●● ●●●●●●●● −1 ●● ● ● ●●●● ● ●● ●●●●●●●●● ● ● ● ● ●●●●●● ●●●●●●●●● ●● ●● ● ● ● ●●● ●●●● ●●● ● ● ●●● ● ● ● ●●● ●●●●● ● ● ● ● ● ● −3

● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●● 0.0 ● ●●●●●●●● ●● ● ●●●● ●●●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●●● ●●● ● ● ●●●●●●●●● ●●● ●●●●●●●●●●●● ● ●●●●●●●● ●● ● ● ●●● ●●●●●● ● ● ●●● ●●●●●● ● ●● ● ●●●● ● ●●● ● ●● ● ●● ●● ●● ● x3 ● ●●●●● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ●●●●●● ●● ● ● ●●● ●●● ● ● ●●●●●● ●● ●●●●● ●● ● ● ● ●●●● ● ●●●●●●●● ● ● ● −0.2 ●● ● ● ● ● ●●● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ●●●● ● ● ● ● ● ●

● ● ● −0.4 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● −0.6

● ● ● ● ● ● 2.5 x4

1.5 ●● ● ● ●●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●●●● ● ● ● ●●● ●●●●● ● ● ●●●● ●●● ● ● ●● ●●●●● ●●●●●● ●●● ● ● ●●●●●● ●●● ● ●● ●●●●●● ● ●●●●●●●● ● ●●●●●●●●● ● ● ● ● ●●●●●● ● ●●●●● ●●●● ●● ● ●●●●●●●●●●● ● ● ●●●● ●●●●● 0.5 ● ●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ●●●● ●●●●●●●● ●●●●●●●●●● ●●●● ● ●● ● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●●● ● ● ● ●●●● ●●●●●●● ●●●●● ●●●●●● ● ●●●●●●●●●●● ● ● ●● ●●●●●● ● ●●●●● ●●●● ●●●●●●●●● ●● ● ●●● ●●●●●● ● ●●●●● ● ●● ●●●● −0.5 18/36 −5 0 5 10 −0.6 −0.4 −0.2 0.0 Assessing normality: Q-Q-plots Normal samples. First row n = 15, second row n = 40 and third row n = 100.

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

● ● ● ● ● 1 1.5 ● ● ● 1.0 ● ● ● ● ● ● ●

● 0 ● ● 0.5 ● ● ● ● ● ● ●

0.0 ●

−1 ● ● ● ●

−0.5 ● ● ● ● ● ● ● ●

● −2 ● ● ● Sample Sample Quantiles Sample Quantiles

● −1.0 ● ● −3 −2.0 −1 0 1 −1 0 1 −1 0 1

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

● ● ●

2 ● ● ● ● ● ●● ●● 2

1 ●● ● ●● ●●● ●● ● ●●

1 ● ● ●

● 1 ● ●●● ●● ● ●● ● ●● 0 ● ● ●●● ● ● ●●●● ●●●● ●●● ● ●● 0 ●●● ●●● ●● 0 ●● ●● ●●●●●●● ●●●● ●●●● ● ●● ●●● ●●● −1 ●● ●●●● ●●● ●● ● ● −1 −1 ● ●● Sample Quantiles ● Sample Quantiles ● Sample Quantiles ● −2 ● ● ● −2

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

● ● ● ● ● ●● 2

2 ● ● ● 2 ● ● ●●● ●●●● ●● ●●●● ●● ●●● ●●●●●●● ●● ●● ● 1 ●●●● ● ●● 1 ● ●● ●● ●● 1 ●●●● ●●●● ●●●● ●● ●●●●●●●●● ●● ●●●● ●●●● ●●● ●●●● ●●●●● ●●● ●● ●●● ●●● ●●●● ●●● 0 ●● ●●● ●● ● 0 ●●● ●●●●● ●●●●●● ●●● ●●●●● ●●● ●●● −1 ●● ●● ●●●● ●● ●● ● ●●●● ●● ●●● ●●●●● ●●●●

−1 ● ● ●●● ●●●● −1 ●●● ●●●●

Sample Quantiles Sample Quantiles ● Sample Quantiles ● −3 ●

−2 ● ● ● ● −2

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 19/36

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Assessing normality: Q-Q-plots First column: uniform distribution. Second column: exponential distribution: Third column: β(1/2, 1/2) distribution (bimodal).

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

● ● ● ● 3.0 ● ● ● 0.8 ● ● ● ● 0.8 2.0 0.6 ● ● ● ● ● ● ● ● ● ● ● 0.4 ● 0.4 ●

● 1.0 ● ● ● ● ● ●

0.2 ● Sample Quantiles Sample Quantiles ● ● Sample Quantiles ● ● ● ● ● ● ● ● ● ● ● 0.0 0.0 −1 0 1 −1 0 1 −1 0 1

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

● ● ● ● ●● ● ● ● ● 1.0 ● ●●● ● ● ●●● ●● ● ●●● 3 ●● 0.8 ●●● ● ● ●●● ● ● ●●● ●● ●●● 2 ●● ●●●● 0.6 ●● ● ● ●●● ● ●●

● 0.4 ● ●

1 ●● ● ● ●● ●●●● ●● ●● ●● ●●●●●● ● ●●● ●● Sample Quantiles Sample Quantiles ●●● Sample Quantiles ● 0.2 ●●●●●● ●● ● ● ● ● ●●●●● ● ● ●●●●● 0 0.0 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

●● ● ● ● ●●●●●● ● ●● ● ●●●●●●● ●●● 2 ●●●●● ●●● ●●● ●●●●●● ●●●● ●● ● ●●●●●●● ●● ●● ●●●● ●

0.8 ● ● 0.8 1 ●● ● ●●● ●● ●● ●●● ● ●●●● ●●● ● ● ●●●●●● ●● ●● ●●●● ● 0 ● ●● ●●● ●● ●●● ●●●● ●● ●● ●●●●● ● ● ●●● ● ● ●● ●● ●●● ●●●●●● ● 0.4 ●● 0.4 ● −1 ● ●● ● ● ● ● ●●● ●●● ● ●●● ●● ●●●● ●● ●●● ● ● Sample Quantiles ●● Sample Quantiles ● Sample Quantiles ●●● ●●●● ●●●●● ● ● ● ● ●●●●●●●●●●● −3 0.0 0.0 20/36 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Idea: for large n

2 0 −1 dj = (xj − ¯x) S (xj − ¯x) 2 should be approximately χp-distributed. We could thus do a 2 2 Q-Q-plot of dj against the χp quantiles. This reduces the p-dimensional data to just one dimension and a single Q-Q-plot instead of p plots. 2 Problem: convergence to χp turns out to be slow. However, Gnanadesikan and Kettenring (1972) showed that n · d2 j ∼ β(p/2, (n − p − 1)/2). (n − 1)2 and thus the quantiles from the beta distribution are more appropriate to use. Gnanadesikan & Kettenring (1972), Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data, Biometrics, 28, pp. 81-124.

Assessing normality: χ2-plots (β-plots) ¯ 0 −1 ¯ 2 For normal data, (Xj − X) Σ (Xj − X) ∼ χp.

21/36 We could thus do a 2 2 Q-Q-plot of dj against the χp quantiles. This reduces the p-dimensional data to just one dimension and a single Q-Q-plot instead of p plots. 2 Problem: convergence to χp turns out to be slow. However, Gnanadesikan and Kettenring (1972) showed that n · d2 j ∼ β(p/2, (n − p − 1)/2). (n − 1)2 and thus the quantiles from the beta distribution are more appropriate to use. Gnanadesikan & Kettenring (1972), Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data, Biometrics, 28, pp. 81-124.

Assessing normality: χ2-plots (β-plots) ¯ 0 −1 ¯ 2 For normal data, (Xj − X) Σ (Xj − X) ∼ χp. Idea: for large n

2 0 −1 dj = (xj − ¯x) S (xj − ¯x) 2 should be approximately χp-distributed.

21/36 2 Problem: convergence to χp turns out to be slow. However, Gnanadesikan and Kettenring (1972) showed that n · d2 j ∼ β(p/2, (n − p − 1)/2). (n − 1)2 and thus the quantiles from the beta distribution are more appropriate to use. Gnanadesikan & Kettenring (1972), Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data, Biometrics, 28, pp. 81-124.

Assessing normality: χ2-plots (β-plots) ¯ 0 −1 ¯ 2 For normal data, (Xj − X) Σ (Xj − X) ∼ χp. Idea: for large n

2 0 −1 dj = (xj − ¯x) S (xj − ¯x) 2 should be approximately χp-distributed. We could thus do a 2 2 Q-Q-plot of dj against the χp quantiles. This reduces the p-dimensional data to just one dimension and a single Q-Q-plot instead of p plots.

21/36 Assessing normality: χ2-plots (β-plots) ¯ 0 −1 ¯ 2 For normal data, (Xj − X) Σ (Xj − X) ∼ χp. Idea: for large n

2 0 −1 dj = (xj − ¯x) S (xj − ¯x) 2 should be approximately χp-distributed. We could thus do a 2 2 Q-Q-plot of dj against the χp quantiles. This reduces the p-dimensional data to just one dimension and a single Q-Q-plot instead of p plots. 2 Problem: convergence to χp turns out to be slow. However, Gnanadesikan and Kettenring (1972) showed that n · d2 j ∼ β(p/2, (n − p − 1)/2). (n − 1)2 and thus the quantiles from the beta distribution are more appropriate to use. Gnanadesikan & Kettenring (1972), Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data, Biometrics, 28, pp. 81-124. 21/36 I Mardia’s tests.

I The tests are generalizations of tests based on and .

Assessing normality: Formal tests

I Univariate data:

I The Shapiro-Wilk test. I Tests based on skewness and kurtosis. I Univariate tests can be used in multivariate analysis by looking at the marginal distributions one by one.

22/36 Assessing normality: Formal tests

I Univariate data:

I The Shapiro-Wilk test. I Tests based on skewness and kurtosis. I Univariate tests can be used in multivariate analysis by looking at the marginal distributions one by one. I Mardia’s tests.

I The tests are generalizations of tests based on skewness and kurtosis.

22/36 The Shapiro-Wilk test is based on the 2  Pn  i=1 ai x(i) W = Pn 2 i=1(x(i) − x¯)

where the ai come from the covariance matrix of the observations.

I Published 1956.

I Scale and location invariant.

I A formalization of Q-Q-plots.

Assessing normality: Shapiro-Wilk

For a univariate sample, consider the order statistics x(1) ≤ x(2) ≤ ... ≤ x(n).

23/36 I Published 1956.

I Scale and location invariant.

I A formalization of Q-Q-plots.

Assessing normality: Shapiro-Wilk

For a univariate sample, consider the order statistics x(1) ≤ x(2) ≤ ... ≤ x(n). The Shapiro-Wilk test is based on the statistic 2  Pn  i=1 ai x(i) W = Pn 2 i=1(x(i) − x¯)

where the ai come from the covariance matrix of the observations.

23/36 I Scale and location invariant.

I A formalization of Q-Q-plots.

Assessing normality: Shapiro-Wilk

For a univariate sample, consider the order statistics x(1) ≤ x(2) ≤ ... ≤ x(n). The Shapiro-Wilk test is based on the statistic 2  Pn  i=1 ai x(i) W = Pn 2 i=1(x(i) − x¯)

where the ai come from the covariance matrix of the observations.

I Published 1956.

23/36 I A formalization of Q-Q-plots.

Assessing normality: Shapiro-Wilk

For a univariate sample, consider the order statistics x(1) ≤ x(2) ≤ ... ≤ x(n). The Shapiro-Wilk test is based on the statistic 2  Pn  i=1 ai x(i) W = Pn 2 i=1(x(i) − x¯)

where the ai come from the covariance matrix of the observations.

I Published 1956.

I Scale and location invariant.

23/36 Assessing normality: Shapiro-Wilk

For a univariate sample, consider the order statistics x(1) ≤ x(2) ≤ ... ≤ x(n). The Shapiro-Wilk test is based on the statistic 2  Pn  i=1 ai x(i) W = Pn 2 i=1(x(i) − x¯)

where the ai come from the covariance matrix of the observations.

I Published 1956.

I Scale and location invariant.

I A formalization of Q-Q-plots.

23/36 and the kurtosis is E(X − µ)4 κ = − 3. σ4 Both these quantities are 0 for the normal distribution, but non-zero for many other distributions. In particular, all symmetric distributions have γ = 0. κ is related to how heavy the tails of the distribution are, and to some extent to bimodality. To use skewness and kurtosis for a tests for normality, compute 1 Pn (x − x¯)3 1 Pn (x − x¯)4 γˆ = n i=1 i andκ ˆ = n i=1 i 1 Pn 2 3/2 1 Pn 2 2 ( n i=1(xi − x¯) ) ( n i=1(xi − x¯) ) and reject the hypothesis of normality if the statistics are too far from 0.

Assessing normality: Skewness and kurtosis For a univariate X the skewness is E(X − µ)3 γ = σ3

24/36 Both these quantities are 0 for the normal distribution, but non-zero for many other distributions. In particular, all symmetric distributions have γ = 0. κ is related to how heavy the tails of the distribution are, and to some extent to bimodality. To use skewness and kurtosis for a tests for normality, compute 1 Pn (x − x¯)3 1 Pn (x − x¯)4 γˆ = n i=1 i andκ ˆ = n i=1 i 1 Pn 2 3/2 1 Pn 2 2 ( n i=1(xi − x¯) ) ( n i=1(xi − x¯) ) and reject the hypothesis of normality if the statistics are too far from 0.

Assessing normality: Skewness and kurtosis For a univariate random variable X the skewness is E(X − µ)3 γ = σ3 and the kurtosis is E(X − µ)4 κ = − 3. σ4

24/36 In particular, all symmetric distributions have γ = 0. κ is related to how heavy the tails of the distribution are, and to some extent to bimodality. To use skewness and kurtosis for a tests for normality, compute 1 Pn (x − x¯)3 1 Pn (x − x¯)4 γˆ = n i=1 i andκ ˆ = n i=1 i 1 Pn 2 3/2 1 Pn 2 2 ( n i=1(xi − x¯) ) ( n i=1(xi − x¯) ) and reject the hypothesis of normality if the statistics are too far from 0.

Assessing normality: Skewness and kurtosis For a univariate random variable X the skewness is E(X − µ)3 γ = σ3 and the kurtosis is E(X − µ)4 κ = − 3. σ4 Both these quantities are 0 for the normal distribution, but non-zero for many other distributions.

24/36 To use skewness and kurtosis for a tests for normality, compute 1 Pn (x − x¯)3 1 Pn (x − x¯)4 γˆ = n i=1 i andκ ˆ = n i=1 i 1 Pn 2 3/2 1 Pn 2 2 ( n i=1(xi − x¯) ) ( n i=1(xi − x¯) ) and reject the hypothesis of normality if the statistics are too far from 0.

Assessing normality: Skewness and kurtosis For a univariate random variable X the skewness is E(X − µ)3 γ = σ3 and the kurtosis is E(X − µ)4 κ = − 3. σ4 Both these quantities are 0 for the normal distribution, but non-zero for many other distributions. In particular, all symmetric distributions have γ = 0. κ is related to how heavy the tails of the distribution are, and to some extent to bimodality.

24/36 Assessing normality: Skewness and kurtosis For a univariate random variable X the skewness is E(X − µ)3 γ = σ3 and the kurtosis is E(X − µ)4 κ = − 3. σ4 Both these quantities are 0 for the normal distribution, but non-zero for many other distributions. In particular, all symmetric distributions have γ = 0. κ is related to how heavy the tails of the distribution are, and to some extent to bimodality. To use skewness and kurtosis for a tests for normality, compute 1 Pn (x − x¯)3 1 Pn (x − x¯)4 γˆ = n i=1 i andκ ˆ = n i=1 i 1 Pn 2 3/2 1 Pn 2 2 ( n i=1(xi − x¯) ) ( n i=1(xi − x¯) ) and reject the hypothesis of normality if the statistics are too far from 0. 24/36 I The kurtosis test based onκ ˆ is sensitive against kurtotic distributions but not against asymmetric distributions.

I Should we use the skewness test or the kurtosis test? Rule of thumb: for inference about µ, we should worry more about asymmetric distributions. For inference about σ2, deviations in kurtosis is more dangerous.

I The Shapiro-Wilk test is usually less sensitive thanγ ˆ andκ ˆ against asymmetric and kurtotic alternatives, respectively, but has high average power against all classes of alternatives.

Assessing normality: Univariate tests

I The skewness test based onγ ˆ is sensitive against asymmetric distributions but not against kurtotic distributions.

25/36 I Should we use the skewness test or the kurtosis test? Rule of thumb: for inference about µ, we should worry more about asymmetric distributions. For inference about σ2, deviations in kurtosis is more dangerous.

I The Shapiro-Wilk test is usually less sensitive thanγ ˆ andκ ˆ against asymmetric and kurtotic alternatives, respectively, but has high average power against all classes of alternatives.

Assessing normality: Univariate tests

I The skewness test based onγ ˆ is sensitive against asymmetric distributions but not against kurtotic distributions.

I The kurtosis test based onκ ˆ is sensitive against kurtotic distributions but not against asymmetric distributions.

25/36 Rule of thumb: for inference about µ, we should worry more about asymmetric distributions. For inference about σ2, deviations in kurtosis is more dangerous.

I The Shapiro-Wilk test is usually less sensitive thanγ ˆ andκ ˆ against asymmetric and kurtotic alternatives, respectively, but has high average power against all classes of alternatives.

Assessing normality: Univariate tests

I The skewness test based onγ ˆ is sensitive against asymmetric distributions but not against kurtotic distributions.

I The kurtosis test based onκ ˆ is sensitive against kurtotic distributions but not against asymmetric distributions.

I Should we use the skewness test or the kurtosis test?

25/36 I The Shapiro-Wilk test is usually less sensitive thanγ ˆ andκ ˆ against asymmetric and kurtotic alternatives, respectively, but has high average power against all classes of alternatives.

Assessing normality: Univariate tests

I The skewness test based onγ ˆ is sensitive against asymmetric distributions but not against kurtotic distributions.

I The kurtosis test based onκ ˆ is sensitive against kurtotic distributions but not against asymmetric distributions.

I Should we use the skewness test or the kurtosis test? Rule of thumb: for inference about µ, we should worry more about asymmetric distributions. For inference about σ2, deviations in kurtosis is more dangerous.

25/36 Assessing normality: Univariate tests

I The skewness test based onγ ˆ is sensitive against asymmetric distributions but not against kurtotic distributions.

I The kurtosis test based onκ ˆ is sensitive against kurtotic distributions but not against asymmetric distributions.

I Should we use the skewness test or the kurtosis test? Rule of thumb: for inference about µ, we should worry more about asymmetric distributions. For inference about σ2, deviations in kurtosis is more dangerous.

I The Shapiro-Wilk test is usually less sensitive thanγ ˆ andκ ˆ against asymmetric and kurtotic alternatives, respectively, but has high average power against all classes of alternatives.

25/36 I Some univariate normality test can be used for each of the marginal variables in the p-variate sample.

I However, the variables may be dependent. If so, the outcomes of the p tests will also be dependent!

I What is the joint significance level of the normality tests? How can we control this level?

I One way of handling this problem is to use Bonferroni’s inequality. (We’ll discuss this in the next lecture.)

I Some authors suggest reducing the dimension of the problem by performing a univariate normality test on ˆe1xj , where ˆe1 is the eigenvector corresponding to the largest eigenvalue of S. (More on this when we discuss PCA.)

Assessing normality: Univariate tests

I A necessary, but not sufficient, condition for a distribution to be multivariate normal is that all marginal distributions are normal.

26/36 I However, the variables may be dependent. If so, the outcomes of the p tests will also be dependent!

I What is the joint significance level of the normality tests? How can we control this level?

I One way of handling this problem is to use Bonferroni’s inequality. (We’ll discuss this in the next lecture.)

I Some authors suggest reducing the dimension of the problem by performing a univariate normality test on ˆe1xj , where ˆe1 is the eigenvector corresponding to the largest eigenvalue of S. (More on this when we discuss PCA.)

Assessing normality: Univariate tests

I A necessary, but not sufficient, condition for a distribution to be multivariate normal is that all marginal distributions are normal.

I Some univariate normality test can be used for each of the marginal variables in the p-variate sample.

26/36 I What is the joint significance level of the normality tests? How can we control this level?

I One way of handling this problem is to use Bonferroni’s inequality. (We’ll discuss this in the next lecture.)

I Some authors suggest reducing the dimension of the problem by performing a univariate normality test on ˆe1xj , where ˆe1 is the eigenvector corresponding to the largest eigenvalue of S. (More on this when we discuss PCA.)

Assessing normality: Univariate tests

I A necessary, but not sufficient, condition for a distribution to be multivariate normal is that all marginal distributions are normal.

I Some univariate normality test can be used for each of the marginal variables in the p-variate sample.

I However, the variables may be dependent. If so, the outcomes of the p tests will also be dependent!

26/36 I One way of handling this problem is to use Bonferroni’s inequality. (We’ll discuss this in the next lecture.)

I Some authors suggest reducing the dimension of the problem by performing a univariate normality test on ˆe1xj , where ˆe1 is the eigenvector corresponding to the largest eigenvalue of S. (More on this when we discuss PCA.)

Assessing normality: Univariate tests

I A necessary, but not sufficient, condition for a distribution to be multivariate normal is that all marginal distributions are normal.

I Some univariate normality test can be used for each of the marginal variables in the p-variate sample.

I However, the variables may be dependent. If so, the outcomes of the p tests will also be dependent!

I What is the joint significance level of the normality tests? How can we control this level?

26/36 I Some authors suggest reducing the dimension of the problem by performing a univariate normality test on ˆe1xj , where ˆe1 is the eigenvector corresponding to the largest eigenvalue of S. (More on this when we discuss PCA.)

Assessing normality: Univariate tests

I A necessary, but not sufficient, condition for a distribution to be multivariate normal is that all marginal distributions are normal.

I Some univariate normality test can be used for each of the marginal variables in the p-variate sample.

I However, the variables may be dependent. If so, the outcomes of the p tests will also be dependent!

I What is the joint significance level of the normality tests? How can we control this level?

I One way of handling this problem is to use Bonferroni’s inequality. (We’ll discuss this in the next lecture.)

26/36 Assessing normality: Univariate tests

I A necessary, but not sufficient, condition for a distribution to be multivariate normal is that all marginal distributions are normal.

I Some univariate normality test can be used for each of the marginal variables in the p-variate sample.

I However, the variables may be dependent. If so, the outcomes of the p tests will also be dependent!

I What is the joint significance level of the normality tests? How can we control this level?

I One way of handling this problem is to use Bonferroni’s inequality. (We’ll discuss this in the next lecture.)

I Some authors suggest reducing the dimension of the problem by performing a univariate normality test on ˆe1xj , where ˆe1 is the eigenvector corresponding to the largest eigenvalue of S. (More on this when we discuss PCA.)

26/36 The test statistics

n n 1 X 1 X γˆ2 = d3 andκ ˆ = d2 p n2 ij p n ii i,j=1 i=1

are generalizations ofγ ˆ2 andκ ˆ.

I Published 1970.

I Scale and location invariant.

I Extends the notions of skewness and kurtosis to the multivariate setting.

I Various generalizations exist, where the dij are used in slightly different ways.

Assessing normality: Mardia’s multivariate tests Mardia’s tests for multivariate normality are based on the statistic

0 −1 dij = (xi − ¯x) S (xj − ¯x).

27/36 I Published 1970.

I Scale and location invariant.

I Extends the notions of skewness and kurtosis to the multivariate setting.

I Various generalizations exist, where the dij are used in slightly different ways.

Assessing normality: Mardia’s multivariate tests Mardia’s tests for multivariate normality are based on the statistic

0 −1 dij = (xi − ¯x) S (xj − ¯x).

The test statistics

n n 1 X 1 X γˆ2 = d3 andκ ˆ = d2 p n2 ij p n ii i,j=1 i=1

are generalizations ofγ ˆ2 andκ ˆ.

27/36 I Scale and location invariant.

I Extends the notions of skewness and kurtosis to the multivariate setting.

I Various generalizations exist, where the dij are used in slightly different ways.

Assessing normality: Mardia’s multivariate tests Mardia’s tests for multivariate normality are based on the statistic

0 −1 dij = (xi − ¯x) S (xj − ¯x).

The test statistics

n n 1 X 1 X γˆ2 = d3 andκ ˆ = d2 p n2 ij p n ii i,j=1 i=1

are generalizations ofγ ˆ2 andκ ˆ.

I Published 1970.

27/36 I Extends the notions of skewness and kurtosis to the multivariate setting.

I Various generalizations exist, where the dij are used in slightly different ways.

Assessing normality: Mardia’s multivariate tests Mardia’s tests for multivariate normality are based on the statistic

0 −1 dij = (xi − ¯x) S (xj − ¯x).

The test statistics

n n 1 X 1 X γˆ2 = d3 andκ ˆ = d2 p n2 ij p n ii i,j=1 i=1

are generalizations ofγ ˆ2 andκ ˆ.

I Published 1970.

I Scale and location invariant.

27/36 I Various generalizations exist, where the dij are used in slightly different ways.

Assessing normality: Mardia’s multivariate tests Mardia’s tests for multivariate normality are based on the statistic

0 −1 dij = (xi − ¯x) S (xj − ¯x).

The test statistics

n n 1 X 1 X γˆ2 = d3 andκ ˆ = d2 p n2 ij p n ii i,j=1 i=1

are generalizations ofγ ˆ2 andκ ˆ.

I Published 1970.

I Scale and location invariant.

I Extends the notions of skewness and kurtosis to the multivariate setting.

27/36 Assessing normality: Mardia’s multivariate tests Mardia’s tests for multivariate normality are based on the statistic

0 −1 dij = (xi − ¯x) S (xj − ¯x).

The test statistics

n n 1 X 1 X γˆ2 = d3 andκ ˆ = d2 p n2 ij p n ii i,j=1 i=1

are generalizations ofγ ˆ2 andκ ˆ.

I Published 1970.

I Scale and location invariant.

I Extends the notions of skewness and kurtosis to the multivariate setting.

I Various generalizations exist, where the dij are used in slightly different ways.

27/36 I Inspect scatter plots and univariate Q-Q-plots.

I Perform univariate tests of normality on each variable.

I Use methods based on dimension reduction – test some linear combination and look at the β plot.

I Use a test for multivariate normality. Care must be taken to make sure that the joint significance level of the tests is reasonable!

Assessing normality: Recommendations

When assessing multivariate normality, it is often a good idea to use more than one method in order to account for different possibilities of deviations from normality.

28/36 I Perform univariate tests of normality on each variable.

I Use methods based on dimension reduction – test some linear combination and look at the β plot.

I Use a test for multivariate normality. Care must be taken to make sure that the joint significance level of the tests is reasonable!

Assessing normality: Recommendations

When assessing multivariate normality, it is often a good idea to use more than one method in order to account for different possibilities of deviations from normality.

I Inspect scatter plots and univariate Q-Q-plots.

28/36 I Use methods based on dimension reduction – test some linear combination and look at the β plot.

I Use a test for multivariate normality. Care must be taken to make sure that the joint significance level of the tests is reasonable!

Assessing normality: Recommendations

When assessing multivariate normality, it is often a good idea to use more than one method in order to account for different possibilities of deviations from normality.

I Inspect scatter plots and univariate Q-Q-plots.

I Perform univariate tests of normality on each variable.

28/36 I Use a test for multivariate normality. Care must be taken to make sure that the joint significance level of the tests is reasonable!

Assessing normality: Recommendations

When assessing multivariate normality, it is often a good idea to use more than one method in order to account for different possibilities of deviations from normality.

I Inspect scatter plots and univariate Q-Q-plots.

I Perform univariate tests of normality on each variable.

I Use methods based on dimension reduction – test some linear combination and look at the β plot.

28/36 Care must be taken to make sure that the joint significance level of the tests is reasonable!

Assessing normality: Recommendations

When assessing multivariate normality, it is often a good idea to use more than one method in order to account for different possibilities of deviations from normality.

I Inspect scatter plots and univariate Q-Q-plots.

I Perform univariate tests of normality on each variable.

I Use methods based on dimension reduction – test some linear combination and look at the β plot.

I Use a test for multivariate normality.

28/36 Assessing normality: Recommendations

When assessing multivariate normality, it is often a good idea to use more than one method in order to account for different possibilities of deviations from normality.

I Inspect scatter plots and univariate Q-Q-plots.

I Perform univariate tests of normality on each variable.

I Use methods based on dimension reduction – test some linear combination and look at the β plot.

I Use a test for multivariate normality. Care must be taken to make sure that the joint significance level of the tests is reasonable!

28/36 √ I Examine standardized observations zjk = (xjk − x¯k )/ skk and look for unusually large or small values. 2 0 −1 I Examine dj = (xj − ¯x) S (xj − ¯x) and look for unusually large values.

I Wilks’ test

Outliers

I Graphical methods

I Scatter plots I Chernoff faces I Stars I Andrew’s curves

29/36 2 0 −1 I Examine dj = (xj − ¯x) S (xj − ¯x) and look for unusually large values.

I Wilks’ test

Outliers

I Graphical methods

I Scatter plots I Chernoff faces I Stars I Andrew’s curves √ I Examine standardized observations zjk = (xjk − x¯k )/ skk and look for unusually large or small values.

29/36 I Wilks’ test

Outliers

I Graphical methods

I Scatter plots I Chernoff faces I Stars I Andrew’s curves √ I Examine standardized observations zjk = (xjk − x¯k )/ skk and look for unusually large or small values. 2 0 −1 I Examine dj = (xj − ¯x) S (xj − ¯x) and look for unusually large values.

29/36 Outliers

I Graphical methods

I Scatter plots I Chernoff faces I Stars I Andrew’s curves √ I Examine standardized observations zjk = (xjk − x¯k )/ skk and look for unusually large or small values. 2 0 −1 I Examine dj = (xj − ¯x) S (xj − ¯x) and look for unusually large values.

I Wilks’ test

29/36 I Published in 1963.

I A formalization of β plots. 2 I Equivalent to looking at max(dj ). I Related to the hat matrix in .

I A multitude of extensions and variations of the test exists.

Outliers: Wilks’ test

Wilks’ test for outliers in a multivariate normal sample is based on the statistic Λ = min(Λj ) j n 2 2 0 −1 where Λj = 1 − (n−1)2 dj ; recall that dj = (xj − ¯x) S (xj − ¯x). If Λ is small, there are likely outliers in the sample.

30/36 I A formalization of β plots. 2 I Equivalent to looking at max(dj ). I Related to the hat matrix in linear regression.

I A multitude of extensions and variations of the test exists.

Outliers: Wilks’ test

Wilks’ test for outliers in a multivariate normal sample is based on the statistic Λ = min(Λj ) j n 2 2 0 −1 where Λj = 1 − (n−1)2 dj ; recall that dj = (xj − ¯x) S (xj − ¯x). If Λ is small, there are likely outliers in the sample.

I Published in 1963.

30/36 2 I Equivalent to looking at max(dj ). I Related to the hat matrix in linear regression.

I A multitude of extensions and variations of the test exists.

Outliers: Wilks’ test

Wilks’ test for outliers in a multivariate normal sample is based on the statistic Λ = min(Λj ) j n 2 2 0 −1 where Λj = 1 − (n−1)2 dj ; recall that dj = (xj − ¯x) S (xj − ¯x). If Λ is small, there are likely outliers in the sample.

I Published in 1963.

I A formalization of β plots.

30/36 I Related to the hat matrix in linear regression.

I A multitude of extensions and variations of the test exists.

Outliers: Wilks’ test

Wilks’ test for outliers in a multivariate normal sample is based on the statistic Λ = min(Λj ) j n 2 2 0 −1 where Λj = 1 − (n−1)2 dj ; recall that dj = (xj − ¯x) S (xj − ¯x). If Λ is small, there are likely outliers in the sample.

I Published in 1963.

I A formalization of β plots. 2 I Equivalent to looking at max(dj ).

30/36 I A multitude of extensions and variations of the test exists.

Outliers: Wilks’ test

Wilks’ test for outliers in a multivariate normal sample is based on the statistic Λ = min(Λj ) j n 2 2 0 −1 where Λj = 1 − (n−1)2 dj ; recall that dj = (xj − ¯x) S (xj − ¯x). If Λ is small, there are likely outliers in the sample.

I Published in 1963.

I A formalization of β plots. 2 I Equivalent to looking at max(dj ). I Related to the hat matrix in linear regression.

30/36 Outliers: Wilks’ test

Wilks’ test for outliers in a multivariate normal sample is based on the statistic Λ = min(Λj ) j n 2 2 0 −1 where Λj = 1 − (n−1)2 dj ; recall that dj = (xj − ¯x) S (xj − ¯x). If Λ is small, there are likely outliers in the sample.

I Published in 1963.

I A formalization of β plots. 2 I Equivalent to looking at max(dj ). I Related to the hat matrix in linear regression.

I A multitude of extensions and variations of the test exists.

30/36 Transformations to normality

If the data is found to be non-normal, it is still possible that it can be transformed to normality.

A useful family of power transformations was described by Box and Cox in 1964, along with a method of determining which transformation to use.

31/36 Note that, by L’Hospital’s rule, it can be shown that

λ limλ→0(xik − 1)/λ = ln(xik )

Transformations: Box–Cox transformation

Assumption: xik > 0. Box and Cox (1964):

( xλ−1 (λ) ik λ when λ 6= 0 xik = ln(xik ) when λ = 0

where i = 1,..., n and k is fixed.

32/36 Transformations: Box–Cox transformation

Assumption: xik > 0. Box and Cox (1964):

( xλ−1 (λ) ik λ when λ 6= 0 xik = ln(xik ) when λ = 0

where i = 1,..., n and k is fixed. Note that, by L’Hospital’s rule, it can be shown that

λ limλ→0(xik − 1)/λ = ln(xik )

32/36 Which λ should we choose? A maximum likelihood approach is to 1 2 try to find the λ that maximizes g(λ) = − 2 n ln(s (λ)). Rewrite as:

n X n g(λ) = (λ − 1) ln(x ) − ln[ˆσ2(λ)] ik 2 i=1 where 1 σˆ2(λ) = y(λ)0(I − H)y n R function: boxcox (in library MASS)

Transformations: Box–Cox transformation

Assumption: xik > 0. Box and Cox (1964):

( xλ−1 (λ) ik λ when λ 6= 0 xik = ln(xik ) when λ = 0

where i = 1,..., n and k is fixed.

33/36 Transformations: Box–Cox transformation

Assumption: xik > 0. Box and Cox (1964):

( xλ−1 (λ) ik λ when λ 6= 0 xik = ln(xik ) when λ = 0

where i = 1,..., n and k is fixed. Which λ should we choose? A maximum likelihood approach is to 1 2 try to find the λ that maximizes g(λ) = − 2 n ln(s (λ)). Rewrite as:

n X n g(λ) = (λ − 1) ln(x ) − ln[ˆσ2(λ)] ik 2 i=1 where 1 σˆ2(λ) = y(λ)0(I − H)y n R function: boxcox (in library MASS)

33/36 Transformations: Box–Cox transformation

95% log−Likelihood −360 −340 −320 −300 −280 −260

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

λ

34/36 I If some xik < 0, sometimes adding a small constant to all xik can work.

I If maxi xik / mini xik is small, Box–Cox will not do anything; power transforms are well approximated by linear transformations over short intervals.

I Should estimation of λ count as an extra parameter to be taken into account of into the degrees of freedom? Difficult question: λ is not a linear parameter.

Transformations: Box–Cox method, notes

I Box–Cox gets upset by outliers — if one finds λˆ = 5 that is probably the reason.

35/36 I If maxi xik / mini xik is small, Box–Cox will not do anything; power transforms are well approximated by linear transformations over short intervals.

I Should estimation of λ count as an extra parameter to be taken into account of into the degrees of freedom? Difficult question: λ is not a linear parameter.

Transformations: Box–Cox method, notes

I Box–Cox gets upset by outliers — if one finds λˆ = 5 that is probably the reason.

I If some xik < 0, sometimes adding a small constant to all xik can work.

35/36 I Should estimation of λ count as an extra parameter to be taken into account of into the degrees of freedom? Difficult question: λ is not a linear parameter.

Transformations: Box–Cox method, notes

I Box–Cox gets upset by outliers — if one finds λˆ = 5 that is probably the reason.

I If some xik < 0, sometimes adding a small constant to all xik can work.

I If maxi xik / mini xik is small, Box–Cox will not do anything; power transforms are well approximated by linear transformations over short intervals.

35/36 Transformations: Box–Cox method, notes

I Box–Cox gets upset by outliers — if one finds λˆ = 5 that is probably the reason.

I If some xik < 0, sometimes adding a small constant to all xik can work.

I If maxi xik / mini xik is small, Box–Cox will not do anything; power transforms are well approximated by linear transformations over short intervals.

I Should estimation of λ count as an extra parameter to be taken into account of into the degrees of freedom? Difficult question: λ is not a linear parameter.

35/36 I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

I Transformations to normality

Summary

I Sample moments

I Unbiasedness I Asymptotics

36/36 I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

I Transformations to normality

Summary

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators

36/36 I Outliers

I Transformations to normality

Summary

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

36/36 I Transformations to normality

Summary

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

36/36 Summary

I Sample moments

I Unbiasedness I Asymptotics I Estimation for the multivariate normal distribution

I Maximum likelihood estimation I Distributions of estimators I Assessing normality

I How to investigate the validity of the assumption of normality

I Outliers

I Transformations to normality

36/36