Sampling the Multivariate Standard Normal Distribution Under a Weighted Sum Constraint

Total Page:16

File Type:pdf, Size:1020Kb

Sampling the Multivariate Standard Normal Distribution Under a Weighted Sum Constraint risks Technical Note Sampling the Multivariate Standard Normal Distribution under a Weighted Sum Constraint Frédéric Vrins ID Louvain Finance Center & CORE, Université catholique de Louvain, 1348 Louvain-la-Neuve, Belgium; [email protected] Received: 30 May 2018; Accepted: 22 June 2018; Published: 25 June 2018 Abstract: Statistical modeling techniques—and factor models in particular—are extensively used in practice, especially in the insurance and finance industry, where many risks have to be accounted for. In risk management applications, it might be important to analyze the situation when fixing the value of a weighted sum of factors, for example to a given quantile. In this work, we derive the (n − 1)-dimensional distribution corresponding to a n-dimensional i.i.d. standard Normal vector 0 0 0 Z = (Z1, Z2, ... , Zn) subject to the weighted sum constraint w Z = c, where w = (w1, w2, ... , wn) and wi 6= 0. This law is proven to be a Normal distribution, whose mean vector m and covariance matrix S are explicitly derived as a function of (w, c). The derivation of the density relies on the analytical inversion of a very specific positive definite matrix. We show that it does not correspond to naive sampling techniques one could think of. This result is then used to design algorithms for sampling Z under constraint that w0Z = c or w0Z ≤ c and is illustrated on two applications dealing with Value-at-Risk and Expected Shortfall. Keywords: conditional distribution; conditional sampling; multi-factor models 1. Introduction Factor models are extensively used in statistical modeling. In banking and finance, for instance, it is a standard procedure to introduce a dependence structure among loans (see, e.g., Li’s model Li(2016) , but also Andersen and Sidenius(2004); Hull and White(2004); Laurent and Sestier(2016); Vrins(2009), just to name a few). In the popular case of a one-factor Gaussian copula model, the default of the i-th entity is jointly driven by a common risk factor, say Y, as well as an idiosynchratic risk, all being independent and Normally distributed. In multi-factor models, the common factor is replaced by 0 a weighted sum of Normal risk factors Z˜ := (Z˜ 1, Z˜ 2, ... , Z˜ n) representing different aspects of the global economy (like region, sector, etc.). The random vector Z˜ can typically be expressed (via a Cholesky 0 decomposition) as a weighted sum of n (or less) i.i.d. standard Normal factors Z := (Z1, Z2, ... , Zn) . Interestingly, the asymptotic law (as the number of loans tends to infinity) of these portfolios can be derived analytically when only one systematic factor (n = 1) is considered (see Vasicek(1991) in the case of homogeneous pools and Gordy(2003) for an extension to heterogeneous pools). However, the case n > 1 requires numerical methods, typically Monte Carlo simulations. This raises the following question, whose practical interest will be illustrated with concrete examples: given a value 0 0 c of a common factor Y = w Z where w = (w1, w2, ... , wn) is a vector of non-zero weights, what is the distribution of Z ? In other words, how can we sample Z conditional upon Y = c ? It is of course straightforward to sample a vector Z of n Normal variables such that the weighted sum is c. One possibility is to sample a (n − 1)-dimensional i.i.d. standard Normal vector (Z1, Z2, ... , Zn−1) n−1 and then set Zn = (c − ∑i=1 wiZi)/wn (Method 1). Another possibility would be to draw a sample of a 0 0 n-dimensional i.i.d. standard Normal vector Q = (Q1, Q2, ... , Qn) and set Zi = Qi + (c − w Q)/(nwi) Risks 2018, 6, 64; doi:10.3390/risks6030064 www.mdpi.com/journal/risks Risks 2018, 6, 64 2 of 13 c (Method 2). Alternatively, one could just rescale such a vector and take Z = w0Q Q (Method 3). However, as discussed in Section3, none of these approaches yield the correct answer. Conditional sampling methods have already been studied in the literature. One example is of course sampling random variables between two bounds, which is a trivial problem solved by the inverse transform, see e.g., the reference books Glasserman(2010) or Rubinstein and Kroese(2017) . In the context of data acquisition, Pham(2000) introduces a clever algorithm to draw m samples from the n-dimensional Normal distribution N (m, S) subject to the constraint that the empirical (i.e., sample) mean vector mˆ (m) and covariance matrix Sˆ (m) of these m vectors agree with their theoretical counterparts m and S, respectively. Very recently, Meyer(2018) introduced a method to draw n-dimensional samples from a given distribution conditional upon the fact that q < n entries take known values. These works, however, do not provide an answer to the question we are dealing with. In particular, we do not put constraints on the average of sampled vectors, nor force some of their entries to take specific values, but instead work with a constraint on the weighted sum of the n entries of each of these vectors. In this paper, we derive the (n − 1)-dimensional conditional distribution associated with the 0 (w Z = c)-slice of the n-dimensional standard Normal density when wi 6= 0 for all i 2 f1, 2, ... , ng, n ≥ 2. More specifically, we restrict ourselves to derive the joint distribution of X = (X1, X2, ... , Xn), where Xi := wiZi (as the distribution of Z can be obtained by simple rescaling of that of X as X = DZ, where D is an invertible diagonal matrix satisfying Di,i = wi). The result derives from the analytical properties of a square positive definite matrix having a very specific form. We conclude the paper with two sampling algorithms and two illustrative examples. 2. Derivation of the Conditional Density The conditional distribution is derived by first noting that the conditional density takes the general form − n ! n 1 n ! fX ,...,X − ,Xn x1,..., xn−1, c − ∑ xi 1 n 1 i=1 fX ,...,Xn x1,..., xn Xi = c = d xi − c , (1) 1 ∑ f n (c) ∑ i=1 ∑i=1 Xi i=1 where d is the Dirac measure centered at 0. This expression is likely to be familiar to most readers but some technical details are provided in AppendixA. 0 n We shall show that the random vector X = (X1, ... , Xn) given ∑i=1 Xi = c is distributed as ˜ ˜ n−1 ˜ 0 ˜ ˜ ˜ 0 (X1, ... , Xn−1, c − ∑i=1 Xi) where the density of X = (X1, ... , Xn−1) is a multivariate Normal with mean vector m(c, w) and covariance matrix S(w) respectively given by cw2 m (c, w) := i , (2) i kwk2 w2 S (w) := i d kwk2 − w2 + (d − 1)w2 . (3) i,j kwk2 ij i ij j Note that, in these expressions, the indices i, j belong to f1, 2, ... , n − 1g and dij is the Kronecker symbol. The denominator of Equation (1) collapses to the univariate centered Normal density with n standard deviation kwk evaluated at c, noted f(c; 0, kwk). Similarly, when ∑i=1 xi = c, the numerator n−1 is just the product of univariate Normal densities. Using xn = c − ∑i=1 xi, n n−1 ! n−1 ∏ f(xi; 0, wi) = f c − ∑ xi; 0, wn ∏ f(xi; 0, wi) . i=1 i=1 i=1 Risks 2018, 6, 64 3 of 13 0 Hence, when the vector x = (x1, ... , xn) meets the constraint, (1) looks like a (n − 1)-th dimensional Normal pdf: n (x w ) ∏ f i; 0, i c2 ( n−1 ! n−1 ! 2 ) = 1 1 1 x 2c c i 1 = k(w) e 2kwk2 exp − + x2 + i x − x − , (4) ( k k) ∑ 2 2 i 2 ∑ j 2 i 2 f c; 0, w 2 i=1 wi wn wn j=1,j6=i wn 2wn where 1 kwk k(w) := p . n−1 n ( 2p) ∏i=1 wi On the other hand, the general expression of the Normal density of dimension n − 1 with mean 0 −1 vector m = (m1, ... , mn−1) and covariance matrix S whose inverse has entries noted by ai,j = (S )i,j can be obtained by expanding the matrix form of the multivariate Normal : ( n−1 n−1 n−1 n−1 !) 1 2 f(x; m, S) = K exp − ∑ ai,ixi + xi ∑ ai,jxj − xi ∑ (aj,i + ai,j)mj + mi ∑ ai,jmj , (5) 2 i=1 j=1,j6=i j=1 j=1 p where K := 1/ (2p)n−1jSj. To determine the expression of the covariance matrix and mean vector of the conditional density (4) (assuming it is indeed Normal), it remains to determine the entries of m and S−1 by inspection, comparing the expression of conditional density in (4) with that of the multivariate Normal (5) and then to obtain S by analytical inversion. Leaving only k(w) as a factor in front of the exponential in (4), the independent term (i.e., the term that does not appear as a factor of any xi) reads w.l.o.g. as n−1 2 2 2 2 ∑ wi 2 n−1 c c c = c − = − i 1 = − g w2 k k2 2 2 k k2 2 k k2 ∑ i i 2 w 2wn 2wn w 2wn w i=1 n−1 2 n−1 2 for any (g1, g2, ... , gn−1) satisfying ∑i=1 giwi = ∑i=1 wi (note that the constant case gi = 1 might be a solution, but it is not guaranteed at this stage). Comparing (4) and (5), it comes that the expression ! n−1 2 2 1 1 x 2c c giw + x2 + i x − x + i (6) 2 2 i 2 ∑ j 2 i 2 k k2 wi wn wn j=1,j6=i wn wn w must agree with n−1 n−1 n−1 2 ai,ixi + xi ∑ ai,jxj − xi ∑ (ai,j + aj,i)mj + mi ∑ ai,jmj (7) j=1,j6=i j=1 j=1 for all x1, x2, ..
Recommended publications
  • Lecture 22: Bivariate Normal Distribution Distribution
    6.5 Conditional Distributions General Bivariate Normal Let Z1; Z2 ∼ N (0; 1), which we will use to build a general bivariate normal Lecture 22: Bivariate Normal Distribution distribution. 1 1 2 2 f (z1; z2) = exp − (z1 + z2 ) Statistics 104 2π 2 We want to transform these unit normal distributions to have the follow Colin Rundel arbitrary parameters: µX ; µY ; σX ; σY ; ρ April 11, 2012 X = σX Z1 + µX p 2 Y = σY [ρZ1 + 1 − ρ Z2] + µY Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 1 / 22 6.5 Conditional Distributions 6.5 Conditional Distributions General Bivariate Normal - Marginals General Bivariate Normal - Cov/Corr First, lets examine the marginal distributions of X and Y , Second, we can find Cov(X ; Y ) and ρ(X ; Y ) Cov(X ; Y ) = E [(X − E(X ))(Y − E(Y ))] X = σX Z1 + µX h p i = E (σ Z + µ − µ )(σ [ρZ + 1 − ρ2Z ] + µ − µ ) = σX N (0; 1) + µX X 1 X X Y 1 2 Y Y 2 h p 2 i = N (µX ; σX ) = E (σX Z1)(σY [ρZ1 + 1 − ρ Z2]) h 2 p 2 i = σX σY E ρZ1 + 1 − ρ Z1Z2 p 2 2 Y = σY [ρZ1 + 1 − ρ Z2] + µY = σX σY ρE[Z1 ] p 2 = σX σY ρ = σY [ρN (0; 1) + 1 − ρ N (0; 1)] + µY = σ [N (0; ρ2) + N (0; 1 − ρ2)] + µ Y Y Cov(X ; Y ) ρ(X ; Y ) = = ρ = σY N (0; 1) + µY σX σY 2 = N (µY ; σY ) Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 2 / 22 Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 3 / 22 6.5 Conditional Distributions 6.5 Conditional Distributions General Bivariate Normal - RNG Multivariate Change of Variables Consequently, if we want to generate a Bivariate Normal random variable Let X1;:::; Xn have a continuous joint distribution with pdf f defined of S.
    [Show full text]
  • The Statistical Analysis of Distributions, Percentile Rank Classes and Top-Cited
    How to analyse percentile impact data meaningfully in bibliometrics: The statistical analysis of distributions, percentile rank classes and top-cited papers Lutz Bornmann Division for Science and Innovation Studies, Administrative Headquarters of the Max Planck Society, Hofgartenstraße 8, 80539 Munich, Germany; [email protected]. 1 Abstract According to current research in bibliometrics, percentiles (or percentile rank classes) are the most suitable method for normalising the citation counts of individual publications in terms of the subject area, the document type and the publication year. Up to now, bibliometric research has concerned itself primarily with the calculation of percentiles. This study suggests how percentiles can be analysed meaningfully for an evaluation study. Publication sets from four universities are compared with each other to provide sample data. These suggestions take into account on the one hand the distribution of percentiles over the publications in the sets (here: universities) and on the other hand concentrate on the range of publications with the highest citation impact – that is, the range which is usually of most interest in the evaluation of scientific performance. Key words percentiles; research evaluation; institutional comparisons; percentile rank classes; top-cited papers 2 1 Introduction According to current research in bibliometrics, percentiles (or percentile rank classes) are the most suitable method for normalising the citation counts of individual publications in terms of the subject area, the document type and the publication year (Bornmann, de Moya Anegón, & Leydesdorff, 2012; Bornmann, Mutz, Marx, Schier, & Daniel, 2011; Leydesdorff, Bornmann, Mutz, & Opthof, 2011). Until today, it has been customary in evaluative bibliometrics to use the arithmetic mean value to normalize citation data (Waltman, van Eck, van Leeuwen, Visser, & van Raan, 2011).
    [Show full text]
  • Applied Biostatistics Mean and Standard Deviation the Mean the Median Is Not the Only Measure of Central Value for a Distribution
    Health Sciences M.Sc. Programme Applied Biostatistics Mean and Standard Deviation The mean The median is not the only measure of central value for a distribution. Another is the arithmetic mean or average, usually referred to simply as the mean. This is found by taking the sum of the observations and dividing by their number. The mean is often denoted by a little bar over the symbol for the variable, e.g. x . The sample mean has much nicer mathematical properties than the median and is thus more useful for the comparison methods described later. The median is a very useful descriptive statistic, but not much used for other purposes. Median, mean and skewness The sum of the 57 FEV1s is 231.51 and hence the mean is 231.51/57 = 4.06. This is very close to the median, 4.1, so the median is within 1% of the mean. This is not so for the triglyceride data. The median triglyceride is 0.46 but the mean is 0.51, which is higher. The median is 10% away from the mean. If the distribution is symmetrical the sample mean and median will be about the same, but in a skew distribution they will not. If the distribution is skew to the right, as for serum triglyceride, the mean will be greater, if it is skew to the left the median will be greater. This is because the values in the tails affect the mean but not the median. Figure 1 shows the positions of the mean and median on the histogram of triglyceride.
    [Show full text]
  • 1. How Different Is the T Distribution from the Normal?
    Statistics 101–106 Lecture 7 (20 October 98) c David Pollard Page 1 Read M&M §7.1 and §7.2, ignoring starred parts. Reread M&M §3.2. The eects of estimated variances on normal approximations. t-distributions. Comparison of two means: pooling of estimates of variances, or paired observations. In Lecture 6, when discussing comparison of two Binomial proportions, I was content to estimate unknown variances when calculating statistics that were to be treated as approximately normally distributed. You might have worried about the effect of variability of the estimate. W. S. Gosset (“Student”) considered a similar problem in a very famous 1908 paper, where the role of Student’s t-distribution was first recognized. Gosset discovered that the effect of estimated variances could be described exactly in a simplified problem where n independent observations X1,...,Xn are taken from (, ) = ( + ...+ )/ a normal√ distribution, N . The sample mean, X X1 Xn n has a N(, / n) distribution. The random variable X Z = √ / n 2 2 Phas a standard normal distribution. If we estimate by the sample variance, s = ( )2/( ) i Xi X n 1 , then the resulting statistic, X T = √ s/ n no longer has a normal distribution. It has a t-distribution on n 1 degrees of freedom. Remark. I have written T , instead of the t used by M&M page 505. I find it causes confusion that t refers to both the name of the statistic and the name of its distribution. As you will soon see, the estimation of the variance has the effect of spreading out the distribution a little beyond what it would be if were used.
    [Show full text]
  • A Logistic Regression Equation for Estimating the Probability of a Stream in Vermont Having Intermittent Flow
    Prepared in cooperation with the Vermont Center for Geographic Information A Logistic Regression Equation for Estimating the Probability of a Stream in Vermont Having Intermittent Flow Scientific Investigations Report 2006–5217 U.S. Department of the Interior U.S. Geological Survey A Logistic Regression Equation for Estimating the Probability of a Stream in Vermont Having Intermittent Flow By Scott A. Olson and Michael C. Brouillette Prepared in cooperation with the Vermont Center for Geographic Information Scientific Investigations Report 2006–5217 U.S. Department of the Interior U.S. Geological Survey U.S. Department of the Interior DIRK KEMPTHORNE, Secretary U.S. Geological Survey P. Patrick Leahy, Acting Director U.S. Geological Survey, Reston, Virginia: 2006 For product and ordering information: World Wide Web: http://www.usgs.gov/pubprod Telephone: 1-888-ASK-USGS For more information on the USGS--the Federal source for science about the Earth, its natural and living resources, natural hazards, and the environment: World Wide Web: http://www.usgs.gov Telephone: 1-888-ASK-USGS Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government. Although this report is in the public domain, permission must be secured from the individual copyright owners to reproduce any copyrighted materials contained within this report. Suggested citation: Olson, S.A., and Brouillette, M.C., 2006, A logistic regression equation for estimating the probability of a stream in Vermont having intermittent
    [Show full text]
  • Characteristics and Statistics of Digital Remote Sensing Imagery (1)
    Characteristics and statistics of digital remote sensing imagery (1) Digital Images: 1 Digital Image • With raster data structure, each image is treated as an array of values of the pixels. • Image data is organized as rows and columns (or lines and pixels) start from the upper left corner of the image. • Each pixel (picture element) is treated as a separate unite. Statistics of Digital Images Help: • Look at the frequency of occurrence of individual brightness values in the image displayed • View individual pixel brightness values at specific locations or within a geographic area; • Compute univariate descriptive statistics to determine if there are unusual anomalies in the image data; and • Compute multivariate statistics to determine the amount of between-band correlation (e.g., to identify redundancy). 2 Statistics of Digital Images It is necessary to calculate fundamental univariate and multivariate statistics of the multispectral remote sensor data. This involves identification and calculation of – maximum and minimum value –the range, mean, standard deviation – between-band variance-covariance matrix – correlation matrix, and – frequencies of brightness values The results of the above can be used to produce histograms. Such statistics provide information necessary for processing and analyzing remote sensing data. A “population” is an infinite or finite set of elements. A “sample” is a subset of the elements taken from a population used to make inferences about certain characteristics of the population. (e.g., training signatures) 3 Large samples drawn randomly from natural populations usually produce a symmetrical frequency distribution. Most values are clustered around the central value, and the frequency of occurrence declines away from this central point.
    [Show full text]
  • Linear Regression
    eesc BC 3017 statistics notes 1 LINEAR REGRESSION Systematic var iation in the true value Up to now, wehav e been thinking about measurement as sampling of values from an ensemble of all possible outcomes in order to estimate the true value (which would, according to our previous discussion, be well approximated by the mean of a very large sample). Givenasample of outcomes, we have sometimes checked the hypothesis that it is a random sample from some ensemble of outcomes, by plotting the data points against some other variable, such as ordinal position. Under the hypothesis of random selection, no clear trend should appear.Howev er, the contrary case, where one finds a clear trend, is very important. Aclear trend can be a discovery,rather than a nuisance! Whether it is adiscovery or a nuisance (or both) depends on what one finds out about the reasons underlying the trend. In either case one must be prepared to deal with trends in analyzing data. Figure 2.1 (a) shows a plot of (hypothetical) data in which there is a very clear trend. The yaxis scales concentration of coliform bacteria sampled from rivers in various regions (units are colonies per liter). The x axis is a hypothetical indexofregional urbanization, ranging from 1 to 10. The hypothetical data consist of 6 different measurements at each levelofurbanization. The mean of each set of 6 measurements givesarough estimate of the true value for coliform bacteria concentration for rivers in a region with that urbanization level. The jagged dark line drawn on the graph connects these estimates of true value and makes the trend quite clear: more extensive urbanization is associated with higher true values of bacteria concentration.
    [Show full text]
  • Integration of Grid Maps in Merged Environments
    Integration of grid maps in merged environments Tanja Wernle1, Torgeir Waaga*1, Maria Mørreaunet*1, Alessandro Treves1,2, May‐Britt Moser1 and Edvard I. Moser1 1Kavli Institute for Systems Neuroscience and Centre for Neural Computation, Norwegian University of Science and Technology, Trondheim, Norway; 2SISSA – Cognitive Neuroscience, via Bonomea 265, 34136 Trieste, Italy * Equal contribution Corresponding author: Tanja Wernle [email protected]; Edvard I. Moser, [email protected] Manuscript length: 5719 words, 6 figures,9 supplemental figures. 1 Abstract (150 words) Natural environments are represented by local maps of grid cells and place cells that are stitched together. How transitions between map fragments are generated is unknown. Here we recorded grid cells while rats were trained in two rectangular compartments A and B (each 1 m x 2 m) separated by a wall. Once distinct grid maps were established in each environment, we removed the partition and allowed the rat to explore the merged environment (2 m x 2 m). The grid patterns were largely retained along the distal walls of the box. Nearer the former partition line, individual grid fields changed location, resulting almost immediately in local spatial periodicity and continuity between the two original maps. Grid cells belonging to the same grid module retained phase relationships during the transformation. Thus, when environments are merged, grid fields reorganize rapidly to establish spatial periodicity in the area where the environments meet. Introduction Self‐location is dynamically represented in multiple functionally dedicated cell types1,2 . One of these is the grid cells of the medial entorhinal cortex (MEC)3. The multiple spatial firing fields of a grid cell tile the environment in a periodic hexagonal pattern, independently of the speed and direction of a moving animal.
    [Show full text]
  • How to Use MGMA Compensation Data: an MGMA Research & Analysis Report | JUNE 2016
    How to Use MGMA Compensation Data: An MGMA Research & Analysis Report | JUNE 2016 1 ©MGMA. All rights reserved. Compensation is about alignment with our philosophy and strategy. When someone complains that they aren’t earning enough, we use the surveys to highlight factors that influence compensation. Greg Pawson, CPA, CMA, CMPE, chief financial officer, Women’s Healthcare Associates, LLC, Portland, Ore. 2 ©MGMA. All rights reserved. Understanding how to utilize benchmarking data can help improve operational efficiency and profits for medical practices. As we approach our 90th anniversary, it only seems fitting to celebrate MGMA survey data, the gold standard of the industry. For decades, MGMA has produced robust reports using the largest data sets in the industry to help practice leaders make informed business decisions. The MGMA DataDive® Provider Compensation 2016 remains the gold standard for compensation data. The purpose of this research and analysis report is to educate the reader on how to best use MGMA compensation data and includes: • Basic statistical terms and definitions • Best practices • A practical guide to MGMA DataDive® • Other factors to consider • Compensation trends • Real-life examples When you know how to use MGMA’s provider compensation and production data, you will be able to: • Evaluate factors that affect compensation andset realistic goals • Determine alignment between medical provider performance and compensation • Determine the right mix of compensation, benefits, incentives and opportunities to offer new physicians and nonphysician providers • Ensure that your recruitment packages keep pace with the market • Understand the effects thatteaching and research have on academic faculty compensation and productivity • Estimate the potential effects of adding physicians and nonphysician providers • Support the determination of fair market value for professional services and assess compensation methods for compliance and regulatory purposes 3 ©MGMA.
    [Show full text]
  • The Central Limit Theorem (Review)
    Introduction to Confidence Intervals { Solutions STAT-UB.0103 { Statistics for Business Control and Regression Models The Central Limit Theorem (Review) 1. You draw a random sample of size n = 64 from a population with mean µ = 50 and standard deviation σ = 16. From this, you compute the sample mean, X¯. (a) What are the expectation and standard deviation of X¯? Solution: E[X¯] = µ = 50; σ 16 sd[X¯] = p = p = 2: n 64 (b) Approximately what is the probability that the sample mean is above 54? Solution: The sample mean has expectation 50 and standard deviation 2. By the central limit theorem, the sample mean is approximately normally distributed. Thus, by the empirical rule, there is roughly a 2.5% chance of being above 54 (2 standard deviations above the mean). (c) Do you need any additional assumptions for part (c) to be true? Solution: No. Since the sample size is large (n ≥ 30), the central limit theorem applies. 2. You draw a random sample of size n = 16 from a population with mean µ = 100 and standard deviation σ = 20. From this, you compute the sample mean, X¯. (a) What are the expectation and standard deviation of X¯? Solution: E[X¯] = µ = 100; σ 20 sd[X¯] = p = p = 5: n 16 (b) Approximately what is the probability that the sample mean is between 95 and 105? Solution: The sample mean has expectation 100 and standard deviation 5. If it is approximately normal, then we can use the empirical rule to say that there is a 68% of being between 95 and 105 (within one standard deviation of its expecation).
    [Show full text]
  • Calculating Variance and Standard Deviation
    VARIANCE AND STANDARD DEVIATION Recall that the range is the difference between the upper and lower limits of the data. While this is important, it does have one major disadvantage. It does not describe the variation among the variables. For instance, both of these sets of data have the same range, yet their values are definitely different. 90, 90, 90, 98, 90 Range = 8 1, 6, 8, 1, 9, 5 Range = 8 To better describe the variation, we will introduce two other measures of variation—variance and standard deviation (the variance is the square of the standard deviation). These measures tell us how much the actual values differ from the mean. The larger the standard deviation, the more spread out the values. The smaller the standard deviation, the less spread out the values. This measure is particularly helpful to teachers as they try to find whether their students’ scores on a certain test are closely related to the class average. To find the standard deviation of a set of values: a. Find the mean of the data b. Find the difference (deviation) between each of the scores and the mean c. Square each deviation d. Sum the squares e. Dividing by one less than the number of values, find the “mean” of this sum (the variance*) f. Find the square root of the variance (the standard deviation) *Note: In some books, the variance is found by dividing by n. In statistics it is more useful to divide by n -1. EXAMPLE Find the variance and standard deviation of the following scores on an exam: 92, 95, 85, 80, 75, 50 SOLUTION First we find the mean of the data: 92+95+85+80+75+50 477 Mean = = = 79.5 6 6 Then we find the difference between each score and the mean (deviation).
    [Show full text]
  • Ggplot2 Compatible Quantile-Quantile Plots in R by Alexandre Almeida, Adam Loy, Heike Hofmann
    CONTRIBUTED RESEARCH ARTICLES 248 ggplot2 Compatible Quantile-Quantile Plots in R by Alexandre Almeida, Adam Loy, Heike Hofmann Abstract Q-Q plots allow us to assess univariate distributional assumptions by comparing a set of quantiles from the empirical and the theoretical distributions in the form of a scatterplot. To aid in the interpretation of Q-Q plots, reference lines and confidence bands are often added. We can also detrend the Q-Q plot so the vertical comparisons of interest come into focus. Various implementations of Q-Q plots exist in R, but none implements all of these features. qqplotr extends ggplot2 to provide a complete implementation of Q-Q plots. This paper introduces the plotting framework provided by qqplotr and provides multiple examples of how it can be used. Background Univariate distributional assessment is a common thread throughout statistical analyses during both the exploratory and confirmatory stages. When we begin exploring a new data set we often consider the distribution of individual variables before moving on to explore multivariate relationships. After a model has been fit to a data set, we must assess whether the distributional assumptions made are reasonable, and if they are not, then we must understand the impact this has on the conclusions of the model. Graphics provide arguably the most common way to carry out these univariate assessments. While there are many plots that can be used for distributional exploration and assessment, a quantile- quantile (Q-Q) plot (Wilk and Gnanadesikan, 1968) is one of the most common plots used. Q-Q plots compare two distributions by matching a common set of quantiles.
    [Show full text]