Multivariate Normal Distribution Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ Amaity[At]Ncsu.Edu

Total Page:16

File Type:pdf, Size:1020Kb

Multivariate Normal Distribution Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ Amaity[At]Ncsu.Edu Multivariate Normal Distribution Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ amaity[at]ncsu.edu Contents Review of univariate normal distribution 2 Some properties 2 R functions 3 Assessing univariate normality 3 Bivariate and Multivariate normal distributions 5 Some properties of the multivariate normal distribution 7 Mahalanobis distance 7 Sampling distribution of X¯ and S 8 Checking multivariate normality 9 Check univariate normality 9 Check scatterplots 10 Check bivariate distributions 11 Construct a chi-square plot 11 Outlier detection 13 Bivariate boxplot 14 Bagplot 14 Steps to detect outliers 15 ST 437/537 multivariatenormaldistribution2 Review of univariate normal distribution We say that Y follows a normal distribution, that is, Y ∼ N(m, s2), if the pdf of Y is 2 1 1 2 fY(y; m, s ) = p expf− (y − m) g, − ¥ < y < ¥. 2ps2 2s2 We can show that E(Y) = m and Var(Y) = s2. PDF and CDF of normal distribution N(m, s2) for different values of m and s2 are shown in Figure 1. PDF of Normal distribution CDF of Normal distribution Figure 1: Normal PDF (left panel) and 1.00 CDF (right panel) for various choices for mean and variance. 0.75 Parameters: µ = 0, σ2 = 0.2 0.75 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = − 2, σ2 = 0.5 0.50 0.50 Probability Probability Parameters: µ = 0, σ2 = 0.2 2 0.25 µ = 0, σ = 1 0.25 µ = 0, σ2 = 5 µ = − 2, σ2 = 0.5 0.00 0.00 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 y values y values Some properties There are some basic properties of normal distribution: • Standard Normal Distribution: Z ∼ N(0, 1). Any normal random variable Y ∼ N(m, s2) can be standardized using Z = s−1(Y − m). 0.4 • The function f(·) is often used to denote the pdf of the standard 0.3 normal distribution: p −1 −t2/2 0.2 f(t) = ( 2p) e . dnorm(x) 0.1 • The function F(·) is often used to denote the cdf of the standard normal (no closed form) 0.0 −3 −2 −1 0 1 2 3 Z z p 2 F(z) = P(Z ≤ z) = ( 2p)−1e−t /2dt. x −¥ If for some value zp, we have F(zp) = p, then zp is called the p- Figure 2: Normal CDF and quantiles quantile of the standard normal distribution. For example, the area of shaded region in Figure 2 is 0.4; this corresponds to F(-0.253). Thus -0.253 is the 0.4-quantile. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution3 • Any normal distribution can be created from a standard normal distribution using Y = m + Zs. Specifically, if Z ∼ N(0, 1) then 0.4 m + Zs ∼ N(m, s2). 50% 0.3 • Each interval has an associated probality, see Figure 3 for some 70% 0.2 examples. PDF value 0.1 95% 0.0 R functions −4 −2 0 2 4 R requires that you specify the mean and standard deviation, rather x values than mean and variance. The R functions related to normal distribu- tion are Figure 3: Normal PDF and associated probabilities (a) PDF: dnorm(x, mean, sd, log = FALSE) (b) CDF: pnorm(q, mean, sd, lower.tail = TRUE, log.p = FALSE) (c) Quantiles: qnorm(p, mean, sd, lower.tail = TRUE, log.p = FALSE) (d) Random number: rnorm(n, mean, sd) Here the arguments are: • x, q: the value at which to compute the probability PMF of CDF • mean: mean m • sd: standard deviation s • p: probability, it must be between 0 and 1 • n: the number of times to repeat the experiment. • log, log.p: logical; if TRUE, probabilities p are given as log(p). • lower.tail: logical; if TRUE (default), probabilities are P[Y ≤ x], otherwise, P[Y > x]. Assessing univariate normality We can use graphical as well as hypothesis testing techniques to assess wheather the normality assumption is reasonable for a dataset. A common graphical technique to check for normality is to create a normal quantile-quantile plot (Q-Q plot). Normal quantile-quantile plot A scatterplot of the sorted data, x(1) ≤ ... ≤ x(n), against nor- mal quantiles, F−1f(1 − 0.5)/ng,..., F−1f(n − 0.5)/ng. If this plot shows a linear pattern, then assumption of normalty is supported.1 1 We can actually create Q-Q plots for Let us consider the sepal length of setosa flowers, and create a any distribution. We just need to use the quantiles of that distribution instead of normal Q-Q plot. We can use the qqnorm() function in R. normal. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution4 # Extract only sepal.length of setosa flowers Normal Q−Q Plot SL <- iris$Sepal.Length[1:50] 5.5 # Q-Q plot to assess normality qqnorm(SL, pch = 19) 5.0 Sample Quantiles Note that the theoretical quantiles of a N(0, 1) distribution are ploted in the x-axis. Since the plot is fairly linear, normality assump- 4.5 tion seems reasonable in this case. −2 −1 0 1 2 We can also employ formal statistical tests to check for normality. Theoretical Quantiles • Shapiro-Wilks test, and Shapiro–Francia test: the later test is a Figure 4: Normal Q-Q plot of sepal simplification of the former; they show similar power to each length of setosa flowers. other. These two are among the more powerful normality tests. • Kolmogorov-Smirnov (K-S) test, and Lilliefors corrected K-S test: the later test is usually preferred among the two tests. • Cramer von Mises test, and Anderson-Darling test (a modifica- tion of the CVM test): based on weighted difference between the empirical and theoretical CDFs. • Person’s Chi-squared test: a goodness-of-fit test, not highly recom- mended for continuous distributions. • Jarque-Bera test and D’Agostino-Pearson omnibus tests: moment based tests. The [nortest] package in R implements a few of the tests men- tioned above. Overall, Shapiro-Wilk test shows a robust performance against a wide variety of alternatives.2 We can use the function 2 See the article [Yap and Sim (2011). shapiro.test() to perform this test. Comparisons of various types of nor- mality tests] for a numerical compari- son between various tests. shapiro.test(SL) ## ## Shapiro-Wilk normality test ## ## data: SL ## W = 0.9777, p-value = 0.4595 Since the p-value is large (e.g., larger than 5%), we can say that normality assumption for the data is pausible. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution5 Bivariate and Multivariate normal distributions T The random vector X2×1 = (X1, X2) follows a bivariate normal T (Gaussian) distribution with mean vector m = (m1, m2) and variance- covariance (positive definite) matrix S and denoted as X ∼ N2(m, S) if its probability density function is3 3 Recall the PDF of univariate normal distribution, N(m, s2) is −1 −1/2 T −1 f (x) = (2p) jSj expf−(x − m) S (x − m)/2g. 2 1 1 2 fY (y; m, s ) = p expf− (y − m) g 2ps2 2s2 PDF of a bivariate normal distribution A random sample of size 100 2 1 density x2 0 −1 x2 x1 −2 −2 −1 0 1 2 x1 The shape of the PDF (and that of the scatterplot of a random sample generated from the distribution) is determined by S, the variance-covariance matrix of X. An easy was to visualize the PDF of a bivariate distribution is to plot the constant probability density contours. Constant probability density contours We define the constant probability density contour (also called constant-density contour) of a bivariate normal PDF to be the set of vectors x such that f (x) is constant, that is, fx : (x − m)TS−1(x − m) = cg c for a specific . These sets are ellipsesp that are centered around m, and the major and minor axes are c liei, where li are the eigenvalues and ei are the corresponding eigenvectors of S. T More generally, a random vector X = (X1,..., Xp) is said for follow a multivariate normal distribution Np(m, S), where m is a p × 1 vector and S is positive definite matrix, if the PDF of X is f (x) = (2p)−p/2jSj−1/2 expf−(x − m)TS−1(x − m)/2g. We can show that E(X) = m and that cov(X) = S. Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 multivariatenormaldistribution6 PDF of a bivariate normal distribution Contour plot Figure 5: PDF and contours of a bivari- v(x1) = v(x2) = 1, cov(x1,x2) = 0 ate normal distribution with v(x1) = 3 v(x2) = 1, cov(x1,x2) = 0 2 0.04 0.06 1 0.1 0.14 0 density 0.12 −1 0.08 x2 0.02 −2 x1 −3 −3 −2 −1 0 1 2 3 PDF of a bivariate normal distribution Contour plot Figure 6: PDF and contours of a bivari- v(x1) = 1, v(x2) = 0.7, cov(x1,x2) = 0.5 ate normal distribution with v(x1) = 1, 3 v(x2) = 0.7, cov(x1,x2) = 0.5 2 0.02 0.04 0.08 1 0.12 0.16 0.18 0.22 0 density 0.2 0.14 −1 0.1 0.06 x2 −2 x1 −3 −3 −2 −1 0 1 2 3 PDF of a bivariate normal distribution v(x1) = 1, v(x2) = 1.3, cov Contour plot Figure 7: PDF and contours of a bivari- (x1,x2) = −0.5 ate normal distribution with v(x1) = 1, 3 v(x2) = 1.3, cov(x1,x2) = -0.5 0.02 2 0.04 0.08 1 0.12 0.14 0 density 0.1 −1 0.06 x2 −2 x1 −3 −3 −2 −1 0 1 2 3 Applied Multivariate and Longitudinal Data Analysis Dr.
Recommended publications
  • Linear Regression in SPSS
    Linear Regression in SPSS Data: mangunkill.sav Goals: • Examine relation between number of handguns registered (nhandgun) and number of man killed (mankill) • Model checking • Predict number of man killed using number of handguns registered I. View the Data with a Scatter Plot To create a scatter plot, click through Graphs\Scatter\Simple\Define. A Scatterplot dialog box will appear. Click Simple option and then click Define button. The Simple Scatterplot dialog box will appear. Click mankill to the Y-axis box and click nhandgun to the x-axis box. Then hit OK. To see fit line, double click on the scatter plot, click on Chart\Options, then check the Total box in the box labeled Fit Line in the upper right corner. 60 60 50 50 40 40 30 30 20 20 10 10 Killed of People Number Number of People Killed 400 500 600 700 800 400 500 600 700 800 Number of Handguns Registered Number of Handguns Registered 1 Click the target button on the left end of the tool bar, the mouse pointer will change shape. Move the target pointer to the data point and click the left mouse button, the case number of the data point will appear on the chart. This will help you to identify the data point in your data sheet. II. Regression Analysis To perform the regression, click on Analyze\Regression\Linear. Place nhandgun in the Dependent box and place mankill in the Independent box. To obtain the 95% confidence interval for the slope, click on the Statistics button at the bottom and then put a check in the box for Confidence Intervals.
    [Show full text]
  • Regression Assumptions and Diagnostics
    Newsom Psy 522/622 Multiple Regression and Multivariate Quantitative Methods, Winter 2021 1 Overview of Regression Assumptions and Diagnostics Assumptions Statistical assumptions are determined by the mathematical implications for each statistic, and they set the guideposts within which we might expect our sample estimates to be biased or our significance tests to be accurate. Violations of assumptions therefore should be taken seriously and investigated, but they do not necessarily always indicate that the statistical test will be inaccurate. The complication is that it is almost never possible to know for certain if an assumption has been violated and it is often a judgement call by the research on whether or not a violation has occurred or is serious. You will likely find that the wording of and lists of regression assumptions provided in regression texts tends to vary, but here is my summary. Linearity. Regression is a summary of the relationship between X and Y that uses a straight line. Therefore, the estimate of that relationship holds only to the extent that there is a consistent increase or decrease in Y as X increases. There might be a relationship (even a perfect one) between the two variables that is not linear, or some of the relationship may be of a linear form and some of it may be a nonlinear form (e.g., quadratic shape). The correlation coefficient and the slope can only be accurate about the linear portion of the relationship. Normal distribution of residuals. For t-tests and ANOVA, we discussed that there is an assumption that the dependent variable is normally distributed in the population.
    [Show full text]
  • A Robust Rescaled Moment Test for Normality in Regression
    Journal of Mathematics and Statistics 5 (1): 54-62, 2009 ISSN 1549-3644 © 2009 Science Publications A Robust Rescaled Moment Test for Normality in Regression 1Md.Sohel Rana, 1Habshah Midi and 2A.H.M. Rahmatullah Imon 1Laboratory of Applied and Computational Statistics, Institute for Mathematical Research, University Putra Malaysia, 43400 Serdang, Selangor, Malaysia 2Department of Mathematical Sciences, Ball State University, Muncie, IN 47306, USA Abstract: Problem statement: Most of the statistical procedures heavily depend on normality assumption of observations. In regression, we assumed that the random disturbances were normally distributed. Since the disturbances were unobserved, normality tests were done on regression residuals. But it is now evident that normality tests on residuals suffer from superimposed normality and often possess very poor power. Approach: This study showed that normality tests suffer huge set back in the presence of outliers. We proposed a new robust omnibus test based on rescaled moments and coefficients of skewness and kurtosis of residuals that we call robust rescaled moment test. Results: Numerical examples and Monte Carlo simulations showed that this proposed test performs better than the existing tests for normality in the presence of outliers. Conclusion/Recommendation: We recommend using our proposed omnibus test instead of the existing tests for checking the normality of the regression residuals. Key words: Regression residuals, outlier, rescaled moments, skewness, kurtosis, jarque-bera test, robust rescaled moment test INTRODUCTION out that the powers of t and F tests are extremely sensitive to the hypothesized error distribution and may In regression analysis, it is a common practice over deteriorate very rapidly as the error distribution the years to use the Ordinary Least Squares (OLS) becomes long-tailed.
    [Show full text]
  • Mahalanobis Distance
    Mahalanobis Distance Here is a scatterplot of some multivariate data (in two dimensions): accepte d What can we make of it when the axes are left out? Introduce coordinates that are suggested by the data themselves. The origin will be at the centroid of the points (the point of their averages). The first coordinate axis (blue in the next figure) will extend along the "spine" of the points, which (by definition) is any direction in which the variance is the greatest. The second coordinate axis (red in the figure) will extend perpendicularly to the first one. (In more than two dimensions, it will be chosen in that perpendicular direction in which the variance is as large as possible, and so on.) We need a scale . The standard deviation along each axis will do nicely to establish the units along the axes. Remember the 68-95-99.7 rule: about two-thirds (68%) of the points should be within one unit of the origin (along the axis); about 95% should be within two units. That makes it ea sy to eyeball the correct units. For reference, this figure includes the unit circle in these units: That doesn't really look like a circle, does it? That's because this picture is distorted (as evidenced by the different spacings among the numbers on the two axes). Let's redraw it with the axes in their proper orientations--left to right and bottom to top-- and with a unit aspect ratio so that one unit horizontally really does equal one unit vertically: You measure the Mahalanobis distance in this picture rather than in the original.
    [Show full text]
  • Testing Normality: a GMM Approach Christian Bontempsa and Nour
    Testing Normality: A GMM Approach Christian Bontempsa and Nour Meddahib∗ a LEEA-CENA, 7 avenue Edouard Belin, 31055 Toulouse Cedex, France. b D´epartement de sciences ´economiques, CIRANO, CIREQ, Universit´ede Montr´eal, C.P. 6128, succursale Centre-ville, Montr´eal (Qu´ebec), H3C 3J7, Canada. First version: March 2001 This version: June 2003 Abstract In this paper, we consider testing marginal normal distributional assumptions. More precisely, we propose tests based on moment conditions implied by normality. These moment conditions are known as the Stein (1972) equations. They coincide with the first class of moment conditions derived by Hansen and Scheinkman (1995) when the random variable of interest is a scalar diffusion. Among other examples, Stein equation implies that the mean of Hermite polynomials is zero. The GMM approach we adopt is well suited for two reasons. It allows us to study in detail the parameter uncertainty problem, i.e., when the tests depend on unknown parameters that have to be estimated. In particular, we characterize the moment conditions that are robust against parameter uncertainty and show that Hermite polynomials are special examples. This is the main contribution of the paper. The second reason for using GMM is that our tests are also valid for time series. In this case, we adopt a Heteroskedastic-Autocorrelation-Consistent approach to estimate the weighting matrix when the dependence of the data is unspecified. We also make a theoretical comparison of our tests with Jarque and Bera (1980) and OPG regression tests of Davidson and MacKinnon (1993). Finite sample properties of our tests are derived through a comprehensive Monte Carlo study.
    [Show full text]
  • Recent Outlier Detection Methods with Illustrations Loss Reserving Context
    Recent outlier detection methods with illustrations in loss reserving Benjamin Avanzi, Mark Lavender, Greg Taylor, Bernard Wong School of Risk and Actuarial Studies, UNSW Sydney Insights, 18 September 2017 Recent outlier detection methods with illustrations loss reserving Context Context Reserving Robustness and Outliers Robust Statistical Techniques Robustness criteria Heuristic Tools Robust M-estimation Outlier Detection Techniques Robust Reserving Overview Illustration - Robust Bivariate Chain Ladder Robust N-Dimensional Chain-Ladder Summary and Conclusions References 1/46 Recent outlier detection methods with illustrations loss reserving Context Reserving Context Reserving Robustness and Outliers Robust Statistical Techniques Robustness criteria Heuristic Tools Robust M-estimation Outlier Detection Techniques Robust Reserving Overview Illustration - Robust Bivariate Chain Ladder Robust N-Dimensional Chain-Ladder Summary and Conclusions References 1/46 Recent outlier detection methods with illustrations loss reserving Context Reserving The Reserving Problem i/j 1 2 ··· j ··· I 1 X1;1 X1;2 ··· X1;j ··· X1;J 2 X2;1 X2;2 ··· X2;j ··· . i Xi;1 Xi;2 ··· Xi;j . I XI ;1 Figure: Aggregate claims run-off triangle I Complete the square (or rectangle) I Also - multivariate extensions. 1/46 Recent outlier detection methods with illustrations loss reserving Context Reserving Common Reserving Techniques I Deterministic Chain-Ladder I Stochastic Chain-Ladder (Hachmeister and Stanard, 1975; England and Verrall, 2002) I Mack’s Model (Mack, 1993) I GLMs
    [Show full text]
  • Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests
    Journal ofStatistical Modeling and Analytics Vol.2 No.I, 21-33, 2011 Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests Nornadiah Mohd Razali1 Yap Bee Wah 1 1Faculty ofComputer and Mathematica/ Sciences, Universiti Teknologi MARA, 40450 Shah Alam, Selangor, Malaysia E-mail: nornadiah@tmsk. uitm. edu.my, yapbeewah@salam. uitm. edu.my ABSTRACT The importance of normal distribution is undeniable since it is an underlying assumption of many statistical procedures such as I-tests, linear regression analysis, discriminant analysis and Analysis of Variance (ANOVA). When the normality assumption is violated, interpretation and inferences may not be reliable or valid. The three common procedures in assessing whether a random sample of independent observations of size n come from a population with a normal distribution are: graphical methods (histograms, boxplots, Q-Q-plots), numerical methods (skewness and kurtosis indices) and formal normality tests. This paper* compares the power offour formal tests of normality: Shapiro-Wilk (SW) test, Kolmogorov-Smirnov (KS) test, Lillie/ors (LF) test and Anderson-Darling (AD) test. Power comparisons of these four tests were obtained via Monte Carlo simulation of sample data generated from alternative distributions that follow symmetric and asymmetric distributions. Ten thousand samples ofvarious sample size were generated from each of the given alternative symmetric and asymmetric distributions. The power of each test was then obtained by comparing the test of normality statistics with the respective critical values. Results show that Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test, Lillie/ors test and Kolmogorov-Smirnov test. However, the power ofall four tests is still low for small sample size.
    [Show full text]
  • Download from the Resource and Environment Data Cloud Platform (
    sustainability Article Quantitative Assessment for the Dynamics of the Main Ecosystem Services and their Interactions in the Northwestern Arid Area, China Tingting Kang 1,2, Shuai Yang 1,2, Jingyi Bu 1,2 , Jiahao Chen 1,2 and Yanchun Gao 1,* 1 Key Laboratory of Water Cycle and Related Land Surface Processes, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, P.O. Box 9719, Beijing 100101, China; [email protected] (T.K.); [email protected] (S.Y.); [email protected] (J.B.); [email protected] (J.C.) 2 University of the Chinese Academy of Sciences, Beijing 100049, China * Correspondence: [email protected] Received: 21 November 2019; Accepted: 13 January 2020; Published: 21 January 2020 Abstract: Evaluating changes in the spatial–temporal patterns of ecosystem services (ESs) and their interrelationships provide an opportunity to understand the links among them, as well as to inform ecosystem-service-based decision making and governance. To research the development trajectory of ecosystem services over time and space in the northwestern arid area of China, three main ecosystem services (carbon sequestration, soil retention, and sand fixation) and their relationships were analyzed using Pearson’s coefficient and a time-lagged cross-correlation analysis based on the mountain–oasis–desert zonal scale. The results of this study were as follows: (1) The carbon sequestration of most subregions improved, and that of the oasis continuously increased from 1990 to 2015. Sand fixation decreased in the whole region except for in the Alxa Plateau–Hexi Corridor area, which experienced an “increase–decrease” trend before 2000 and after 2000; (2) the synergies and trade-off relationships of the ES pairs varied with time and space, but carbon sequestration and soil retention in the most arid area had a synergistic relationship, and most oases retained their synergies between carbon sequestration and sand fixation over time.
    [Show full text]
  • A Quantitative Validation Method of Kriging Metamodel for Injection Mechanism Based on Bayesian Statistical Inference
    metals Article A Quantitative Validation Method of Kriging Metamodel for Injection Mechanism Based on Bayesian Statistical Inference Dongdong You 1,2,3,* , Xiaocheng Shen 1,3, Yanghui Zhu 1,3, Jianxin Deng 2,* and Fenglei Li 1,3 1 National Engineering Research Center of Near-Net-Shape Forming for Metallic Materials, South China University of Technology, Guangzhou 510640, China; [email protected] (X.S.); [email protected] (Y.Z.); fl[email protected] (F.L.) 2 Guangxi Key Lab of Manufacturing System and Advanced Manufacturing Technology, Guangxi University, Nanning 530003, China 3 Guangdong Key Laboratory for Advanced Metallic Materials processing, South China University of Technology, Guangzhou 510640, China * Correspondence: [email protected] (D.Y.); [email protected] (J.D.); Tel.: +86-20-8711-2933 (D.Y.); +86-137-0788-9023 (J.D.) Received: 2 April 2019; Accepted: 24 April 2019; Published: 27 April 2019 Abstract: A Bayesian framework-based approach is proposed for the quantitative validation and calibration of the kriging metamodel established by simulation and experimental training samples of the injection mechanism in squeeze casting. The temperature data uncertainty and non-normal distribution are considered in the approach. The normality of the sample data is tested by the Anderson–Darling method. The test results show that the original difference data require transformation for Bayesian testing due to the non-normal distribution. The Box–Cox method is employed for the non-normal transformation. The hypothesis test results of the calibrated kriging model are more reliable after data transformation. The reliability of the kriging metamodel is quantitatively assessed by the calculated Bayes factor and confidence.
    [Show full text]
  • Of Typicality and Predictive Distributions in Discriminant Function Analysis Lyle W
    Wayne State University Human Biology Open Access Pre-Prints WSU Press 8-22-2018 Of Typicality and Predictive Distributions in Discriminant Function Analysis Lyle W. Konigsberg Department of Anthropology, University of Illinois at Urbana–Champaign, [email protected] Susan R. Frankenberg Department of Anthropology, University of Illinois at Urbana–Champaign Recommended Citation Konigsberg, Lyle W. and Frankenberg, Susan R., "Of Typicality and Predictive Distributions in Discriminant Function Analysis" (2018). Human Biology Open Access Pre-Prints. 130. https://digitalcommons.wayne.edu/humbiol_preprints/130 This Open Access Article is brought to you for free and open access by the WSU Press at DigitalCommons@WayneState. It has been accepted for inclusion in Human Biology Open Access Pre-Prints by an authorized administrator of DigitalCommons@WayneState. Of Typicality and Predictive Distributions in Discriminant Function Analysis Lyle W. Konigsberg1* and Susan R. Frankenberg1 1Department of Anthropology, University of Illinois at Urbana–Champaign, Urbana, Illinois, USA. *Correspondence to: Lyle W. Konigsberg, Department of Anthropology, University of Illinois at Urbana–Champaign, 607 S. Mathews Ave, Urbana, IL 61801 USA. E-mail: [email protected]. Short Title: Typicality and Predictive Distributions in Discriminant Functions KEY WORDS: ADMIXTURE, POSTERIOR PROBABILITY, BAYESIAN ANALYSIS, OUTLIERS, TUKEY DEPTH Pre-print version. Visit http://digitalcommons.wayne.edu/humbiol/ after publication to acquire the final version. Abstract While discriminant function analysis is an inherently Bayesian method, researchers attempting to estimate ancestry in human skeletal samples often follow discriminant function analysis with the calculation of frequentist-based typicalities for assigning group membership. Such an approach is problematic in that it fails to account for admixture and for variation in why individuals may be classified as outliers, or non-members of particular groups.
    [Show full text]
  • On the Mahalanobis Distance Classification Criterion For
    1 On the Mahalanobis Distance Classification Criterion for Multidimensional Normal Distributions Guillermo Gallego, Carlos Cuevas, Raul´ Mohedano, and Narciso Garc´ıa Abstract—Many existing engineering works model the sta- can produce unacceptable results. On the one hand, some tistical characteristics of the entities under study as normal authors [7][8][9] erroneously consider that the cumulative distributions. These models are eventually used for decision probability of a generic n-dimensional normal distribution making, requiring in practice the definition of the classification region corresponding to the desired confidence level. Surprisingly can be computed as the integral of its probability density enough, however, a great amount of computer vision works over a hyper-rectangle (Cartesian product of intervals). On using multidimensional normal models leave unspecified or fail the other hand, in different areas such as, for example, face to establish correct confidence regions due to misconceptions on recognition [10] or object tracking in video data [11], the the features of Gaussian functions or to wrong analogies with the cumulative probabilities are computed as the integral over unidimensional case. The resulting regions incur in deviations that can be unacceptable in high-dimensional models. (hyper-)ellipsoids with inadequate radius. Here we provide a comprehensive derivation of the optimal Among the multiple engineering areas where the aforemen- confidence regions for multivariate normal distributions of arbi- tioned errors are found, moving object detection using Gaus- trary dimensionality. To this end, firstly we derive the condition sian Mixture Models (GMMs) [12] must be highlighted. In this for region optimality of general continuous multidimensional field, most strategies proposed during the last years take as a distributions, and then we apply it to the widespread case of the normal probability density function.
    [Show full text]
  • Mahalanobis Distance: a Multivariate Measure of Effect in Hypnosis Research
    ORIGINAL ARTICLE Mahalanobis Distance: A Multivariate Measure of Effect in Hypnosis Research Marty Sapp, Ed.D., Festus E. Obiakor, Ph.D., Amanda J. Gregas, and Steffanie Scholze Mahalanobis distance, a multivariate measure of effect, can improve hypnosis research. When two groups of research participants are measured on two or more dependent variables, Mahalanobis distance can provide a multivariate measure of effect. Finally, Mahalanobis distance is the multivariate squared generalization of the univariate d effect size, and like other multivariate statistics, it can take into account the intercorrelation of variables. (Sleep and Hypnosis 2007;9(2):67-70) Key words: Mahalanobis distance, multivariate effect size, hypnosis research EFFECT SIZES because homogeneity or equality of variances is assumed. Even though it may not be ffect sizes allow researchers to determine apparent, the d effect size is a distance Eif statistical results have practical measure. For example, it is the distance or significance, and they allow one to determine difference between population means the degree of effect a treatment has within a expressed in standard deviation units. population; or simply stated, the degree in Equation (1) has the general formula: which the null hypothesis may be false (1-3). (1) THE EFFECT SIZE D Cohen (1977) defined the most basic effect measure, the statistic that is synthesized, as an analog to the t-tests for means. Specifically, for a two-group situation, he defined the d effect size as the differences Suppose µ1 equals the population mean, between two population means divided by and in this case we are using it to represent the standard deviation of either population, the treatment group population mean of 1.0.
    [Show full text]