Multivariate Normal Distribution Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ Amaity[At]Ncsu.Edu
Total Page:16
File Type:pdf, Size:1020Kb
Multivariate Normal Distribution Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ amaity[at]ncsu.edu Contents Review of univariate normal distribution 2 Some properties 2 R functions 3 Assessing univariate normality 3 Bivariate and Multivariate normal distributions 5 Some properties of the multivariate normal distribution 7 Mahalanobis distance 7 Sampling distribution of X¯ and S 8 Checking multivariate normality 9 Check univariate normality 9 Check scatterplots 10 Check bivariate distributions 11 Construct a chi-square plot 11 Outlier detection 13 Bivariate boxplot 14 Bagplot 14 Steps to detect outliers 15 ST 437/537 multivariatenormaldistribution2 Review of univariate normal distribution We say that Y follows a normal distribution, that is, Y ∼ N(m, s2), if the pdf of Y is 2 1 1 2 fY(y; m, s ) = p expf− (y − m) g, − ¥ < y < ¥. 2ps2 2s2 We can show that E(Y) = m and Var(Y) = s2. PDF and CDF of normal distribution N(m, s2) for different values of m and s2 are shown in Figure 1. PDF of Normal distribution CDF of Normal distribution Figure 1: Normal PDF (left panel) and 1.00 CDF (right panel) for various choices for mean and variance. 0.75 Parameters: µ = 0, σ2 = 0.2 0.75 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = − 2, σ2 = 0.5 0.50 0.50 Probability Probability Parameters: µ = 0, σ2 = 0.2 2 0.25 µ = 0, σ = 1 0.25 µ = 0, σ2 = 5 µ = − 2, σ2 = 0.5 0.00 0.00 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 y values y values Some properties There are some basic properties of normal distribution: • Standard Normal Distribution: Z ∼ N(0, 1). Any normal random variable Y ∼ N(m, s2) can be standardized using Z = s−1(Y − m). 0.4 • The function f(·) is often used to denote the pdf of the standard 0.3 normal distribution: p −1 −t2/2 0.2 f(t) = ( 2p) e . dnorm(x) 0.1 • The function F(·) is often used to denote the cdf of the standard normal (no closed form) 0.0 −3 −2 −1 0 1 2 3 Z z p 2 F(z) = P(Z ≤ z) = ( 2p)−1e−t /2dt. x −¥ If for some value zp, we have F(zp) = p, then zp is called the p- Figure 2: Normal CDF and quantiles quantile of the standard normal distribution. For example, the area of shaded region in Figure 2 is 0.4; this corresponds to F(-0.253). Thus -0.253 is the 0.4-quantile. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution3 • Any normal distribution can be created from a standard normal distribution using Y = m + Zs. Specifically, if Z ∼ N(0, 1) then 0.4 m + Zs ∼ N(m, s2). 50% 0.3 • Each interval has an associated probality, see Figure 3 for some 70% 0.2 examples. PDF value 0.1 95% 0.0 R functions −4 −2 0 2 4 R requires that you specify the mean and standard deviation, rather x values than mean and variance. The R functions related to normal distribu- tion are Figure 3: Normal PDF and associated probabilities (a) PDF: dnorm(x, mean, sd, log = FALSE) (b) CDF: pnorm(q, mean, sd, lower.tail = TRUE, log.p = FALSE) (c) Quantiles: qnorm(p, mean, sd, lower.tail = TRUE, log.p = FALSE) (d) Random number: rnorm(n, mean, sd) Here the arguments are: • x, q: the value at which to compute the probability PMF of CDF • mean: mean m • sd: standard deviation s • p: probability, it must be between 0 and 1 • n: the number of times to repeat the experiment. • log, log.p: logical; if TRUE, probabilities p are given as log(p). • lower.tail: logical; if TRUE (default), probabilities are P[Y ≤ x], otherwise, P[Y > x]. Assessing univariate normality We can use graphical as well as hypothesis testing techniques to assess wheather the normality assumption is reasonable for a dataset. A common graphical technique to check for normality is to create a normal quantile-quantile plot (Q-Q plot). Normal quantile-quantile plot A scatterplot of the sorted data, x(1) ≤ ... ≤ x(n), against nor- mal quantiles, F−1f(1 − 0.5)/ng,..., F−1f(n − 0.5)/ng. If this plot shows a linear pattern, then assumption of normalty is supported.1 1 We can actually create Q-Q plots for Let us consider the sepal length of setosa flowers, and create a any distribution. We just need to use the quantiles of that distribution instead of normal Q-Q plot. We can use the qqnorm() function in R. normal. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution4 # Extract only sepal.length of setosa flowers Normal Q−Q Plot SL <- iris$Sepal.Length[1:50] 5.5 # Q-Q plot to assess normality qqnorm(SL, pch = 19) 5.0 Sample Quantiles Note that the theoretical quantiles of a N(0, 1) distribution are ploted in the x-axis. Since the plot is fairly linear, normality assump- 4.5 tion seems reasonable in this case. −2 −1 0 1 2 We can also employ formal statistical tests to check for normality. Theoretical Quantiles • Shapiro-Wilks test, and Shapiro–Francia test: the later test is a Figure 4: Normal Q-Q plot of sepal simplification of the former; they show similar power to each length of setosa flowers. other. These two are among the more powerful normality tests. • Kolmogorov-Smirnov (K-S) test, and Lilliefors corrected K-S test: the later test is usually preferred among the two tests. • Cramer von Mises test, and Anderson-Darling test (a modifica- tion of the CVM test): based on weighted difference between the empirical and theoretical CDFs. • Person’s Chi-squared test: a goodness-of-fit test, not highly recom- mended for continuous distributions. • Jarque-Bera test and D’Agostino-Pearson omnibus tests: moment based tests. The [nortest] package in R implements a few of the tests men- tioned above. Overall, Shapiro-Wilk test shows a robust performance against a wide variety of alternatives.2 We can use the function 2 See the article [Yap and Sim (2011). shapiro.test() to perform this test. Comparisons of various types of nor- mality tests] for a numerical compari- son between various tests. shapiro.test(SL) ## ## Shapiro-Wilk normality test ## ## data: SL ## W = 0.9777, p-value = 0.4595 Since the p-value is large (e.g., larger than 5%), we can say that normality assumption for the data is pausible. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution5 Bivariate and Multivariate normal distributions T The random vector X2×1 = (X1, X2) follows a bivariate normal T (Gaussian) distribution with mean vector m = (m1, m2) and variance- covariance (positive definite) matrix S and denoted as X ∼ N2(m, S) if its probability density function is3 3 Recall the PDF of univariate normal distribution, N(m, s2) is −1 −1/2 T −1 f (x) = (2p) jSj expf−(x − m) S (x − m)/2g. 2 1 1 2 fY (y; m, s ) = p expf− (y − m) g 2ps2 2s2 PDF of a bivariate normal distribution A random sample of size 100 2 1 density x2 0 −1 x2 x1 −2 −2 −1 0 1 2 x1 The shape of the PDF (and that of the scatterplot of a random sample generated from the distribution) is determined by S, the variance-covariance matrix of X. An easy was to visualize the PDF of a bivariate distribution is to plot the constant probability density contours. Constant probability density contours We define the constant probability density contour (also called constant-density contour) of a bivariate normal PDF to be the set of vectors x such that f (x) is constant, that is, fx : (x − m)TS−1(x − m) = cg c for a specific . These sets are ellipsesp that are centered around m, and the major and minor axes are c liei, where li are the eigenvalues and ei are the corresponding eigenvectors of S. T More generally, a random vector X = (X1,..., Xp) is said for follow a multivariate normal distribution Np(m, S), where m is a p × 1 vector and S is positive definite matrix, if the PDF of X is f (x) = (2p)−p/2jSj−1/2 expf−(x − m)TS−1(x − m)/2g. We can show that E(X) = m and that cov(X) = S. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution6 PDF of a bivariate normal distribution Contour plot Figure 5: PDF and contours of a bivari- v(x1) = v(x2) = 1, cov(x1,x2) = 0 ate normal distribution with v(x1) = 3 v(x2) = 1, cov(x1,x2) = 0 2 0.04 0.06 1 0.1 0.14 0 density 0.12 −1 0.08 x2 0.02 −2 x1 −3 −3 −2 −1 0 1 2 3 PDF of a bivariate normal distribution Contour plot Figure 6: PDF and contours of a bivari- v(x1) = 1, v(x2) = 0.7, cov(x1,x2) = 0.5 ate normal distribution with v(x1) = 1, 3 v(x2) = 0.7, cov(x1,x2) = 0.5 2 0.02 0.04 0.08 1 0.12 0.16 0.18 0.22 0 density 0.2 0.14 −1 0.1 0.06 x2 −2 x1 −3 −3 −2 −1 0 1 2 3 PDF of a bivariate normal distribution v(x1) = 1, v(x2) = 1.3, cov Contour plot Figure 7: PDF and contours of a bivari- (x1,x2) = −0.5 ate normal distribution with v(x1) = 1, 3 v(x2) = 1.3, cov(x1,x2) = -0.5 0.02 2 0.04 0.08 1 0.12 0.14 0 density 0.1 −1 0.06 x2 −2 x1 −3 −3 −2 −1 0 1 2 3 Applied Multivariate and Longitudinal Data Analysis Dr.